Question: STAR options for RNA Seq
0
gravatar for skhan
19 months ago by
skhan10
skhan10 wrote:

I have 2x75b TruSeq stranded RNA Seq data from rat samples and collected on an Illumina NextSeq machine. I have removed adapters from the FASTQ files and quality trimmed them using trimmomatic. I'd like to align them using STAR, and generate counts matrices for downstream differential expression analysis. I am confused about the options to use during the STAR alignment.

Here is what I have:

STAR --genomeDir $STARINDICES/ \
--readFilesIn sample1_read1.fq.gz sample1_read2.fq.gz \
--outFileNamePrefix out_ \
--runThreadN 4 \
--outSAMattrRGline ID:"sample1" SM:"sample1" LB:"sample1" PL:"ILLUMINA" \
--outBAMsortingThreadN 4 \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMstrandField intronMotif \
--outFilterIntronMotifs RemoveNoncanonicalUnannotated \
--readFilesCommand zcat \
--chimSegmentMin 20 \
--genomeLoad NoSharedMemory

Specifically, am I correct to select these three options?

--outSAMunmapped Within \   # outputs unmapped reads within the main SAM file.

--outSAMstrandField intronMotif \   # strand derived from the intron motif. Reads with inconsistent and/or non-canonical introns are filtered out.

--outFilterIntronMotifs RemoveNoncanonicalUnannotated \   # filter out alignments that contain non-canonical unannotated junctions when using annotated spice junctions database. The annotated non-canonical junctions will be kept.

I will be using htseq-count or featureCounts (but may use Cufflinks as well) to generate expression counts.

Have I missed anything? And do I need to modify the resulting BAM file in any way before using it as input for htseq-count / featureCounts?

Thanks.

rna seq star • 1.5k views
ADD COMMENTlink modified 19 months ago by swbarnes27.0k • written 19 months ago by skhan10

You should refer (if not already done) to 3.2.2 in STAR manual : "ENCODE options" (for long RNA-Seq pipeline).

If you want to read more about the latest ENCODE options for RNA-Seq, you will find documentation here.

ADD REPLYlink written 19 months ago by erwan.scaon720
4
gravatar for h.mon
19 months ago by
h.mon28k
Brazil
h.mon28k wrote:

I wouldn't fiddle too much with the parameters, STAR does a very good job using the default parameters. And, unless you plan on variant calling latter, you don't need to add read grooup information.

More importantly, it can output counts similar to those of featureCounts / HTSeq, you just have to use the parameter --quantMode geneCounts.

I would use something like:

STAR --genomeDir $STARINDICES/ \
  --readFilesIn sample1_read1.fq.gz sample1_read2.fq.gz \
  --outFileNamePrefix out_ --runThreadN 4 \
  --outBAMsortingThreadN 4 \
  --outSAMtype BAM SortedByCoordinate \
  --readFilesCommand zcat \
  --quantMode geneCounts \
  --genomeLoad NoSharedMemory

Note that you can even use --outSAMtype None, if you don't have interest in the bam file.

ADD COMMENTlink written 19 months ago by h.mon28k
2
gravatar for swbarnes2
19 months ago by
swbarnes27.0k
United States
swbarnes27.0k wrote:

Of those last three options, I never have used the introny ones (but my lab cares more about counting than analyzing splice sites), and I always include unmapped reads in the .bam. It's just so much easier down the road if you or someone else wants to reanalyze data to know that you didn't wrongly throw anything away.

And yes --quantMode geneCounts is useful. You can use those counts to go into programs that do differential expression.

ADD COMMENTlink written 19 months ago by swbarnes27.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1545 users visited in the last hour