Question

STAR Aligner --quantMode GeneCounts

0

Entering edit mode

2.9 years ago

Seigfried ▴ 80

Hello,

I have tried searching for the answer of my query as it is by no means a new question. This question was discussed in RNA-seq: Explain STAR quantMode geneCounts values. I read it but I don't understand it.

I have paired-end RNA-Seq human data. The filenames are like this : A1_1.fq.gz A1_2.fq.gz

I have run the aligner STAR with the option --quantMode geneCounts.

The resultant ReadsPerGene.out.tab has 4 options : column 1: gene ID column 2: counts for unstranded RNA-seq column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes) column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)

So I see that I have RNA Seq counts for both R1 and R2 strands. I assume that "1st read strand" refers to A1_1.fq.gz file while the "2nd read strand" refers to A1_2.fq.gz.

Since I have multiple files like this, I want to ultimately find differential gene expression by comparing 2 samples.

What do I do with this output? Do I sum the R1 and R2 strand output counts together?

I know the strandedness of a gene (from the .GFF or .GTF file), For example, if I know the gene is located on the + strand only so I assume that I should not consider the - strand counts at all. So I believe that I should sort my GTF into 2 parts + and - and just take the counts for genes specific to their strandedness.

Is this correct?

As a side question, is quantmode Gene counts comparable to htseq-counts? Or is htseq better?

RNA-Seq STAR • 2.0k views

ADD COMMENT • link updated 2.9 years ago by swbarnes2 14k • written 2.9 years ago by Seigfried ▴ 80

score 3 · Accepted Answer · 2021-06-02

So I see that I have RNA Seq counts for both R1 and R2 strands. I assume that "1st read strand" refers to A1_1.fq.gz file while the "2nd read strand" refers to A1_2.fq.gz.

Nope. That's not what that means at all. If you run STAR with just R1, you'll get the same format of output, and the numbers won't be so different either.

Since I have multiple files like this, I want to ultimately find differential gene expression by comparing 2 samples.

Two? Just two? No biological replicates? Just use Excel. Other software uses clever math to understand the samples more statistically using replicate information, but you don't have that.

I know the strandedness of a gene (from the .GFF or .GTF file), For example, if I know the gene is located on the + strand only so I assume that I should not consider the - strand counts at all.

Wrong. You need to stop and learn what it is you are doing. Pushing nonsense through analysis software is only going to give you grief.

is quantmode Gene counts comparable to htseq-counts?

They are supposed to be the same. Using RSEM would be better, or starting over with Kallisto or Salmon on transcriptome references; all of those options will handle ambiguous gene assignments better.