Dear Colleagues,
I see a significant asymmetry between read1 (antisense) and read2 (sense) counts mapped to genes by STAR. I have a couple of different questions regarding to this. The answers depend on what exactly STAR is doing when aligning stranded data.
Question 1: Which is correct, A, B, or C?
A. When mapping stranded reads to the genome STAR maps both read1 and read2 to each strand of a gene, just keeping track of what mapped to what strand. This implies that there should be equal (or at least very similar) counts of read1 and read2 mapped to each gene. Both reads reflect the frequency of a transcript and the sum of the two should be used as a measure of transcript abundance.
B. When mapping stranded reads to the genome STAR takes into account the orientation of each gene and maps read1 to the antisense strand only and read2 to the sense strand only. This means that read2 should map in larger numbers (assuming, as it should be, transcription from only one strand) and only read2 (reverse read, identical to sense strand and to RNA) count should be used as a measure of transcription level. (And read1 is a measure of antisense transcription rate).
C. When mapping stranded reads to the genome STAR maps read1 to one strand and read2 to the other strand, without gene orientation information taken into account. In this case one should use read2 for transcripts in '+' and read1 for transcripts in '-' orientation.
Question 2. If either A or C is the correct answer to Q1, can I used the data to estimate antisense transcription? chatGPT said yes and even wrote a Jupyter notebood to do that, but I have a feeling it is hallucinating, caught in the hard-to-grasp difference between "sense vs. antisense reads" and "sense vs. antisense transcript abundance".
Many thanks in advance!
Lev Yampolsky