RNA-seq: Explain STAR quantMode geneCounts values
1
31
Entering edit mode
5.7 years ago

Background: I mapped my reads against the reference genome using the STAR command:

star
--outFilterMultimapNmax 1
--outFilterMatchNmin 35
--outSAMtype BAM Unsorted SortedByCoordinate
--quantMode GeneCounts
--twopassMode Basic
--outFileNamePrefix "$newfilename" --genomeDir$indexedreference
--sjdbGTFfile $gtfreference --readFilesIn$inputfile1 inputfile2  Output: As explained on the STAR manual (here), I get a file called *ReadsPerGene.out.tab with the following results: N_unmapped 20460635 20460635 20460635 N_multimapping 0 0 0 N_noFeature 23097482 39371367 23239579 N_ambiguous 497097 12313 293179 ENSDARG00000009262 139 0 139 ENSDARG00000018618 2372 5 2367 ENSDARG00000045874 206 4 303 ENSDARG00000057468 13 101 13 ENSDARG00000045873 273 2 271 ENSDARG00000079143 72 0 72 ENSDARG00000057714 206 31 175  Explanation from the manual: Counting number of reads per gene. With --quantMode GeneCounts option STAR will count number reads per gene while mapping. A read is counted if it overlaps (1nt or more) one and only one gene. Both ends of the pairedend read are checked for overlaps. The counts coincide with those produced by htseq-count with default parameters. This option requires annotations (GTF or GFF with –sjdbGTFfile option) used at the genome generation step, or at the mapping step. STAR outputs read counts per gene into ReadsPerGene.out.tab file with 4 columns which correspond to different strandedness options: column 1: gene ID column 2: counts for unstranded RNA-seq column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes) column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse) Select the output according to the strandedness of your data. Note, that if you have stranded data and choose one of the columns 3 or 4, the other column (4 or 3) will give you the count of antisense reads. With --quantMode TranscriptomeSAM GeneCounts, and get both the Aligned.toTranscriptome.out.bam and ReadsPerGene.out.tab outputs. Questions: 1. Which column should I use for differential expression analysis? Why? 2. Should the sum of columns 3 and 4 equal column 2? Why or why not? What does it mean when they are not equal? 3. Why for some transcripts the value in column 3 is higher than the value in column 4, while for most of the transcripts the value in column 4 is higher than the value in column 3? 4. What does it mean when the value in column 3 or 4 is higher than the value in column 2? Thank you! RNA-Seq STAR Differential expression • 40k views ADD COMMENT 1 Entering edit mode Is this a homework assignment? ADD REPLY 0 Entering edit mode Thanks for the clear explanations. If I need FPKM data, do I use these read counts ? ADD REPLY 8 Entering edit mode 5.7 years ago Rohit ★ 1.5k 1. The column you need might be the 2nd, if you have non-strand specific data. Unless you have specifically designed a strand-specific protocol, usually this is not the case. But for downstram analysis, it is encouraged to run htseq-count or another read-counting program again. 2.2 is for unstranded hits, but those overlapping on opposite strand of features are considered ambiguous. $3 reports hits based on the strand you have given in your gff annotation, and$4 in the reverse direction of your features in gff (for PE-data the 5'3'-direction is also considered). Refer to -s option of htseq-count

3. Some of the reads might have a splice-form expressed higher in the preceeding exon, this can be the reason for lesser reads aligning. Splicing is usually the process which leads to differences in mapping coverages over the read. Usually what you need would be based on the analysis you need to do - gene-specific, transcript-specific or something else altogether.

4. If $3>$2 or $4>$2, it means that unstranded data has lesser hits than the cases of directions, happens when reads are assigned as ambiguous. Reads overlapping on the opposite strand (of the gff feature) are considered ambiguous for unstranded but are assigned to genes for $3 or$4. This is the reason for the difference, this was refered here on STAR discussion

Sorry for the confusion.

1
Entering edit mode

Thanks for the explanations!

Here is my understanding: column 3 and column 4 are corresponding to different strand-specific library preparation protocol. Please see HTSeq -s/--stranded description as follows. This blog explains the read orientations in different strand-specific/un-stranded protocols.

-s <yes no="" reverse="">, --stranded=<yes no="" reverse=""> whether the data is from a strand-specific assay (default: yes) For stranded=no, a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature. For stranded=yes and single-end reads, the read has to be mapped to the same strand as the feature. For paired-end reads, the first read has to be on the same strand and the second read on the opposite strand. For stranded=reverse, these rules are reversed.

So, ideally, under the condition that we have the totally correct gene annotation and no library errors, there should be one column(column 3 or 4) with all zeros (no reads would be aligned under the wrong protocol specification) if it is strand-specific RNAseq. But the fact is, even under strand-specific protocol, there always read counts in both column 3 and column 4.

I think that's because 1) There might be some small amount of un-stranded reads left during strand-specific library preparation (not pretty sure). 2) The genome annotation is incomplete, there might be some novel anti-sense gene expression, and those RNA reads were incorrectly mapped to that gene.

So, if we analyze strand specific RNA-seq data, we should choose right column (3 or 4) based on the strand specific protocol during library preparation.

Please point out if my understanding is not correct! Thanks!

0
Entering edit mode
1. Since I have different values for columns 3 and 4, does that automatically mean that I have strand-specific data? If I that's the case, should I use column 3 or 4 for differential expression? And why?

2. Why is $2=$3+$4 NOT the case for many of my transcripts? For example, in the extract above I have 206=4+303 and 13=101+13, which are obviously wrong. 3. Wouldn't that bias my differential expression analysis if I use either column? And why should I use column x in particular, instead of the other? 4. In my original question, I show some examples in which$3>$2 and$4>$2, so it is possible. That means that there's something wrong with my data, there's something wrong with the calculations by STAR or there's something wrong in our understanding of those values. Thank you for the quick response! ADD REPLY 0 Entering edit mode I have edited the answer, I did not consider ambiguity of mapping before, sorry for that ADD REPLY 0 Entering edit mode Can someone please explain point 1. Why is it advised to run htseq-count again if STAR is performing it with quantMode option?thanks ADD REPLY 0 Entering edit mode Although one can certainly do it, I don't think it is advisable, and both results should be largely the same. ADD REPLY 0 Entering edit mode Good explanation. Really helped with the clearing out the doubts. Thanks, ADD REPLY 0 Entering edit mode Sorry for digging in this old thread, but I have an understanding problem: 1). I have around 4000 genes (of 14000 overall) where$2 < $3 +$4 but I have no overlapping genes on the +/- strand.
So how can this be explained?

2). For most reads I have $4 >$3, but for ~1000 $3 >$4. This I see in all replicated and all samples. So when I have GeneX with $3 >$4 than for GeneX I have $3 >$4 in all other samples/replicates. All these genes are equally distributed to +/- strand.
Does anyone have an explanation for this?