Question: STAR options for RNA Seq
gravatar for skhan
2.4 years ago by
skhan10 wrote:

I have 2x75b TruSeq stranded RNA Seq data from rat samples and collected on an Illumina NextSeq machine. I have removed adapters from the FASTQ files and quality trimmed them using trimmomatic. I'd like to align them using STAR, and generate counts matrices for downstream differential expression analysis. I am confused about the options to use during the STAR alignment.

Here is what I have:

STAR --genomeDir $STARINDICES/ \
--readFilesIn sample1_read1.fq.gz sample1_read2.fq.gz \
--outFileNamePrefix out_ \
--runThreadN 4 \
--outSAMattrRGline ID:"sample1" SM:"sample1" LB:"sample1" PL:"ILLUMINA" \
--outBAMsortingThreadN 4 \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMstrandField intronMotif \
--outFilterIntronMotifs RemoveNoncanonicalUnannotated \
--readFilesCommand zcat \
--chimSegmentMin 20 \
--genomeLoad NoSharedMemory

Specifically, am I correct to select these three options?

--outSAMunmapped Within \   # outputs unmapped reads within the main SAM file.

--outSAMstrandField intronMotif \   # strand derived from the intron motif. Reads with inconsistent and/or non-canonical introns are filtered out.

--outFilterIntronMotifs RemoveNoncanonicalUnannotated \   # filter out alignments that contain non-canonical unannotated junctions when using annotated spice junctions database. The annotated non-canonical junctions will be kept.

I will be using htseq-count or featureCounts (but may use Cufflinks as well) to generate expression counts.

Have I missed anything? And do I need to modify the resulting BAM file in any way before using it as input for htseq-count / featureCounts?


rna seq star • 2.3k views
ADD COMMENTlink modified 2.4 years ago by swbarnes28.6k • written 2.4 years ago by skhan10

You should refer (if not already done) to 3.2.2 in STAR manual : "ENCODE options" (for long RNA-Seq pipeline).

If you want to read more about the latest ENCODE options for RNA-Seq, you will find documentation here.

ADD REPLYlink written 2.4 years ago by erwan.scaon810
gravatar for h.mon
2.4 years ago by
h.mon31k wrote:

I wouldn't fiddle too much with the parameters, STAR does a very good job using the default parameters. And, unless you plan on variant calling latter, you don't need to add read grooup information.

More importantly, it can output counts similar to those of featureCounts / HTSeq, you just have to use the parameter --quantMode geneCounts.

I would use something like:

STAR --genomeDir $STARINDICES/ \
  --readFilesIn sample1_read1.fq.gz sample1_read2.fq.gz \
  --outFileNamePrefix out_ --runThreadN 4 \
  --outBAMsortingThreadN 4 \
  --outSAMtype BAM SortedByCoordinate \
  --readFilesCommand zcat \
  --quantMode geneCounts \
  --genomeLoad NoSharedMemory

Note that you can even use --outSAMtype None, if you don't have interest in the bam file.

ADD COMMENTlink written 2.4 years ago by h.mon31k
gravatar for swbarnes2
2.4 years ago by
United States
swbarnes28.6k wrote:

Of those last three options, I never have used the introny ones (but my lab cares more about counting than analyzing splice sites), and I always include unmapped reads in the .bam. It's just so much easier down the road if you or someone else wants to reanalyze data to know that you didn't wrongly throw anything away.

And yes --quantMode geneCounts is useful. You can use those counts to go into programs that do differential expression.

ADD COMMENTlink written 2.4 years ago by swbarnes28.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 823 users visited in the last hour