Question

Read1 and read2 asymmetry in stranded, paired-end rnaseq data aligned by STAR

0

Entering edit mode

12 weeks ago

yampolsk ▴ 10

Dear Colleagues,

I see a significant asymmetry between read1 (antisense) and read2 (sense) counts mapped to genes by STAR. I have a couple of different questions regarding to this. The answers depend on what exactly STAR is doing when aligning stranded data.

Question 1: Which is correct, A, B, or C?

A. When mapping stranded reads to the genome STAR maps both read1 and read2 to each strand of a gene, just keeping track of what mapped to what strand. This implies that there should be equal (or at least very similar) counts of read1 and read2 mapped to each gene. Both reads reflect the frequency of a transcript and the sum of the two should be used as a measure of transcript abundance.

B. When mapping stranded reads to the genome STAR takes into account the orientation of each gene and maps read1 to the antisense strand only and read2 to the sense strand only. This means that read2 should map in larger numbers (assuming, as it should be, transcription from only one strand) and only read2 (reverse read, identical to sense strand and to RNA) count should be used as a measure of transcription level. (And read1 is a measure of antisense transcription rate).

C. When mapping stranded reads to the genome STAR maps read1 to one strand and read2 to the other strand, without gene orientation information taken into account. In this case one should use read2 for transcripts in '+' and read1 for transcripts in '-' orientation.

Question 2. If either A or C is the correct answer to Q1, can I used the data to estimate antisense transcription? chatGPT said yes and even wrote a Jupyter notebood to do that, but I have a feeling it is hallucinating, caught in the hard-to-grasp difference between "sense vs. antisense reads" and "sense vs. antisense transcript abundance".

Many thanks in advance!

Lev Yampolsky

stranded STAR RNAseq antisense sense • 461 views

ADD COMMENT • link updated 12 weeks ago by rfran010 ★ 1.6k • written 12 weeks ago by yampolsk ▴ 10

score 1 · Answer 1 · 2025-06-24

A. may be the most correct, but I think you are making some incorrect assumptions about READ1 and READ2, so that I don't think any of these options are exactly correct. At the very least I think some of the terms are muddled.

READ1 and READ2 are expected to come from the same original fragment, so B is impossible (an individual RNA fragment cannot be both sense and antisense).

STAR maps these reads to the reference genome and keeps track if a read mapped to the reference plus strand or reference minus strand. Since paired reads are expected to come from opposite ends of the same fragment, then if READ1 maps to the plus strand, READ2 will map to the minus strand, and vice versa (if READ1 maps to minus strand, READ2 maps to plus strand). If the pairs do not follow this expectation, then the alignment is discordant.

At this point, there is no "strandedness" information accounted for, technically READ1 can be mapped to plus or minus reference strand. 'sense' and 'antisense' do not make sense here since there's no information on the original fragment orientation yet. To derive stranded (sense/antisense) info, you need to make additional assumptions based on the library preparation and then take into account gene orientation.

In this idea, I would argue there is no stranded 'reads' or 'data', but a stranded library. With a stranded library, you simply have additional expectations for these read alignments.

Strandedness is then largely determined by whether you expect READ1 to be sense or antisense to the original fragments.

In a reverse-stranded library, it is expected READ1 is reverse complementary ("antisense") to the original fragment, therefore you expect READ2 is complementary ("sense") to the original fragment.

So, if READ2 is mapped to the reference plus strand, then the original fragment sequence is derived from the plus strand. But, if READ2 is mapped to the reference minus strand, then the original fragment was derived from the minus strand. At this point, you still cannot say if that fragment is sense or antisense to a gene though.

Now, you can take into account the gene orientation. If a gene is annotated on the minus strand, then any READ2s mapping to the minus strand in that region are "sense" counts, meaning that they are evidence of the expected transcript. Even though the paired READ1's sequence is "antisense", it is counted as "sense" since it is expected to come from the same original fragment, and therefore is also evidence of transcript abundance.

If however, a gene is annotated on the minus strand, and READ2 maps to the plus strand, then that is evidence of antisense transcription, based on the assumptions above, and both READ1 and READ2 are counted towards antisense abundance.

Note, if the library is forward-stranded, the assumptions are opposite: READ1 is complementary to the original fragment...

(It can be confusing because a READ may be "sense" to the fragment, but the same fragment may be "antisense" to the gene or expected transcript. I tried to use "complementary" when referring to comparing the READ to the original fragment. Also, sense/antisense terms are generally independent from plus/minus terms referring to reference strands)