I have tried searching for the answer of my query as it is by no means a new question. This question was discussed in RNA-seq: Explain STAR quantMode geneCounts values. I read it but I don't understand it.
I have paired-end RNA-Seq human data. The filenames are like this : A1_1.fq.gz A1_2.fq.gz
I have run the aligner STAR with the option --quantMode geneCounts.
The resultant ReadsPerGene.out.tab has 4 options : column 1: gene ID column 2: counts for unstranded RNA-seq column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes) column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)
So I see that I have RNA Seq counts for both R1 and R2 strands. I assume that "1st read strand" refers to A1_1.fq.gz file while the "2nd read strand" refers to A1_2.fq.gz.
Since I have multiple files like this, I want to ultimately find differential gene expression by comparing 2 samples.
What do I do with this output? Do I sum the R1 and R2 strand output counts together?
I know the strandedness of a gene (from the .GFF or .GTF file), For example, if I know the gene is located on the + strand only so I assume that I should not consider the - strand counts at all. So I believe that I should sort my GTF into 2 parts + and - and just take the counts for genes specific to their strandedness.
Is this correct?
As a side question, is quantmode Gene counts comparable to htseq-counts? Or is htseq better?