Question: STAR options for RNA Seq
gravatar for skhan
11 months ago by
skhan10 wrote:

I have 2x75b TruSeq stranded RNA Seq data from rat samples and collected on an Illumina NextSeq machine. I have removed adapters from the FASTQ files and quality trimmed them using trimmomatic. I'd like to align them using STAR, and generate counts matrices for downstream differential expression analysis. I am confused about the options to use during the STAR alignment.

Here is what I have:

STAR --genomeDir $STARINDICES/ \
--readFilesIn sample1_read1.fq.gz sample1_read2.fq.gz \
--outFileNamePrefix out_ \
--runThreadN 4 \
--outSAMattrRGline ID:"sample1" SM:"sample1" LB:"sample1" PL:"ILLUMINA" \
--outBAMsortingThreadN 4 \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMstrandField intronMotif \
--outFilterIntronMotifs RemoveNoncanonicalUnannotated \
--readFilesCommand zcat \
--chimSegmentMin 20 \
--genomeLoad NoSharedMemory

Specifically, am I correct to select these three options?

--outSAMunmapped Within \   # outputs unmapped reads within the main SAM file.

--outSAMstrandField intronMotif \   # strand derived from the intron motif. Reads with inconsistent and/or non-canonical introns are filtered out.

--outFilterIntronMotifs RemoveNoncanonicalUnannotated \   # filter out alignments that contain non-canonical unannotated junctions when using annotated spice junctions database. The annotated non-canonical junctions will be kept.

I will be using htseq-count or featureCounts (but may use Cufflinks as well) to generate expression counts.

Have I missed anything? And do I need to modify the resulting BAM file in any way before using it as input for htseq-count / featureCounts?


rna seq star • 987 views
ADD COMMENTlink modified 11 months ago by swbarnes25.2k • written 11 months ago by skhan10

You should refer (if not already done) to 3.2.2 in STAR manual : "ENCODE options" (for long RNA-Seq pipeline).

If you want to read more about the latest ENCODE options for RNA-Seq, you will find documentation here.

ADD REPLYlink written 11 months ago by erwan.scaon670
gravatar for h.mon
11 months ago by
h.mon24k wrote:

I wouldn't fiddle too much with the parameters, STAR does a very good job using the default parameters. And, unless you plan on variant calling latter, you don't need to add read grooup information.

More importantly, it can output counts similar to those of featureCounts / HTSeq, you just have to use the parameter --quantMode geneCounts.

I would use something like:

STAR --genomeDir $STARINDICES/ \
  --readFilesIn sample1_read1.fq.gz sample1_read2.fq.gz \
  --outFileNamePrefix out_ --runThreadN 4 \
  --outBAMsortingThreadN 4 \
  --outSAMtype BAM SortedByCoordinate \
  --readFilesCommand zcat \
  --quantMode geneCounts \
  --genomeLoad NoSharedMemory

Note that you can even use --outSAMtype None, if you don't have interest in the bam file.

ADD COMMENTlink written 11 months ago by h.mon24k
gravatar for swbarnes2
11 months ago by
United States
swbarnes25.2k wrote:

Of those last three options, I never have used the introny ones (but my lab cares more about counting than analyzing splice sites), and I always include unmapped reads in the .bam. It's just so much easier down the road if you or someone else wants to reanalyze data to know that you didn't wrongly throw anything away.

And yes --quantMode geneCounts is useful. You can use those counts to go into programs that do differential expression.

ADD COMMENTlink written 11 months ago by swbarnes25.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 922 users visited in the last hour