Question

How to improving 2-Nucleotide RNA-seq Mapping Accuracy

0

Entering edit mode

7 hours ago

2411110159 • 0

Hi, sorry if this is a dumb question, I might be overthinking this. I’m trying to test mapping accuracy for a special 2-letter RNA-seq system. In our experiment, methylated A and C stay the same, but unmethylated A turns into G and unmethylated C turns into T with very high conversion rate (~99%). So the reads in theory are almost only G and T.

base-conversion • 70 views

ADD COMMENT • link 7 hours ago by 2411110159 • 0

0

Entering edit mode

Since we do not have real data yet, I tried to simulate this using strand-specific RNA bisulfite-seq reads (C to T). Read1 corresponds to the transcript strand. I first used the original bisulfite reads to align with HISAT-3N as a reference. Then, to simulate 2-letter reads, I converted about 99 percent of A to G in Read1, so the simulated reads contain mostly G and T with a small amount of A and C.

I then tried to map these simulated reads with HISAT-3N. For the first pass, I converted all A to G in the FASTQ and aligned to an AG-converted reference using HISAT-3N with base-change C,T. For the second pass, I aligned the same reads to a TC-converted reference using base-change G,A. I selected reads from the strand that matches the expected chemistry and then combined the two sets of alignments.

However, the final BAM is very inconsistent with the original bisulfite alignment. Many reads map to different positions, some uniquely mapped reads become unmapped, and overall the agreement is much worse than expected. It seems that the very low sequence complexity of the simulated reads may be causing instability, or my simulation strategy may be wrong.

I am also not sure how to properly handle paired-end reads in this situation. If the two mates overlap, the heavy conversions mean the overlapping region is no longer complementary, which might confuse the aligner. I wonder if enforcing paired-end constraints would help, or if there are better strategies or aligners for reduced-alphabet reads.

If anyone has experience with reduced-alphabet mapping or has suggestions for how to simulate or align these reads more reliably, I would really appreciate the help. I am trying to understand whether the problem comes from my simulation steps or from the intrinsic limitations of mapping reads with such low sequence complexity.

Thanks.

ADD REPLY • link 7 hours ago by 2411110159 • 0