Question

STAR options for RNA Seq

0

Entering edit mode

7.2 years ago

skhan ▴ 10

I have 2x75b TruSeq stranded RNA Seq data from rat samples and collected on an Illumina NextSeq machine. I have removed adapters from the FASTQ files and quality trimmed them using trimmomatic. I'd like to align them using STAR, and generate counts matrices for downstream differential expression analysis. I am confused about the options to use during the STAR alignment.

Here is what I have:

STAR --genomeDir $STARINDICES/ \
--readFilesIn sample1_read1.fq.gz sample1_read2.fq.gz \
--outFileNamePrefix out_ \
--runThreadN 4 \
--outSAMattrRGline ID:"sample1" SM:"sample1" LB:"sample1" PL:"ILLUMINA" \
--outBAMsortingThreadN 4 \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMstrandField intronMotif \
--outFilterIntronMotifs RemoveNoncanonicalUnannotated \
--readFilesCommand zcat \
--chimSegmentMin 20 \
--genomeLoad NoSharedMemory

Specifically, am I correct to select these three options?

--outSAMunmapped Within \   # outputs unmapped reads within the main SAM file.

--outSAMstrandField intronMotif \   # strand derived from the intron motif. Reads with inconsistent and/or non-canonical introns are filtered out.

--outFilterIntronMotifs RemoveNoncanonicalUnannotated \   # filter out alignments that contain non-canonical unannotated junctions when using annotated spice junctions database. The annotated non-canonical junctions will be kept.

I will be using htseq-count or featureCounts (but may use Cufflinks as well) to generate expression counts.

Have I missed anything? And do I need to modify the resulting BAM file in any way before using it as input for htseq-count / featureCounts?

Thanks.

STAR RNA-seq • 7.0k views

ADD COMMENT • link updated 22 months ago by Ram 45k • written 7.2 years ago by skhan ▴ 10

0

Entering edit mode

You should refer (if not already done) to 3.2.2 in STAR manual : "ENCODE options" (for long RNA-Seq pipeline).

If you want to read more about the latest ENCODE options for RNA-Seq, you will find documentation here.

ADD REPLY • link 7.2 years ago by erwan.scaon ▴ 960

2

Entering edit mode

7.2 years ago

swbarnes2 15k

Of those last three options, I never have used the introny ones (but my lab cares more about counting than analyzing splice sites), and I always include unmapped reads in the .bam. It's just so much easier down the road if you or someone else wants to reanalyze data to know that you didn't wrongly throw anything away.

And yes --quantMode geneCounts is useful. You can use those counts to go into programs that do differential expression.

ADD COMMENT • link 7.2 years ago by swbarnes2 15k

score 5 · Accepted Answer · 2018-05-03

I wouldn't fiddle too much with the parameters, STAR does a very good job using the default parameters. And, unless you plan on variant calling latter, you don't need to add read grooup information.

More importantly, it can output counts similar to those of featureCounts / HTSeq, you just have to use the parameter --quantMode geneCounts.

I would use something like:

STAR --genomeDir $STARINDICES/ \
  --readFilesIn sample1_read1.fq.gz sample1_read2.fq.gz \
  --outFileNamePrefix out_ --runThreadN 4 \
  --outBAMsortingThreadN 4 \
  --outSAMtype BAM SortedByCoordinate \
  --readFilesCommand zcat \
  --quantMode geneCounts \
  --genomeLoad NoSharedMemory

Note that you can even use --outSAMtype None, if you don't have interest in the bam file.