Why do paired ends map to same coordinates of a scaffold?
1
0
Entering edit mode
21 months ago
wmorgan • 0

I used magic-BLAST to map SRA paired reads (Illumina 150 bp) to the the corresponding whole genome assembly (WGA). I expected that paired end reads would map near each other on a scaffold; however, as shown below the paired reads typically overlap. Here is an example from each magic-blast output file that shows the mapping of a read pair (query) from four related libraries to assembled contigs (refID) of the WGA:

queryID               refID  %_ident q_start  q_end   r_start   r_end

SRR_1.sra.6388.1      S107.1  100     27  127       80397   80497
SRR_1.sra.6388.2      S107.1  100     1   101       80497   80397

SRR_2.sra.576423.1    S007.1  100     1   151       297238  297388
SRR_2.sra.576423.2    S007.1  100     58  151       297455  297362

SRR_3.sra.4219.1      S516.1  99.0654   45  151     40745   40639
SRR_3.sra.4219.2      S516.1  99.1379   1   116     40630   40745

SRR_4.sra.3159.1      S557.1  99.3333 1   150       37510   37659
SRR_4.sra.3159.2      S557.1  100     1   151       37706   37556


As seen, the SRR_1 and SRR_3 read pairs are mapping to (nearly) the same (reversed) coordinates of the corresponding contig. The mapped coordinates of the SRR_2 and SRR_4 read pairs are off-set, but still overlap (for other read pairs, the overlap is as extensive as for the SRR_1 and SRR_3 pairs).

I suppose this mapping of read pairs is possible if the library fragments used to create the paired end library were about the same size as the read lengths (150 bp), but I would expect that longer library fragments (300-600 bp) would have been made. Unfortunately, the SRA metadata doesn't include these details about library construction. The SRA metadata does note that two of the four libraries are paired-end and the other two are mate pair, which is even more perplexing. If the read pairs were from a mate pair library, I would expect them to map to the same scaffold but hundreds of base pairs from each other.

My working hypothesis is that all four libraries were made with very small fragments, but perhaps there is another likely explanation. Or perhaps my expectations about the mapping of pair end and mate pair reads are wrong. Any insights?

reads map magicblast SRA • 959 views
0
Entering edit mode
21 months ago
GenoMax 125k

SRR_1 and SRR_3 read pairs are mapping to (nearly) the same (reversed) coordinates of the corresponding contig.

You can check if the two reads overlap by using bbmerge.sh (from BBMap suite) and other programs that do this type of analysis. It would indeed be odd if the insert sizes of these libraries were short or they perfectly overlapped.

Another option would be to estimate the insert size by using instructions here: Target fragment size versus final insert size

If the read pairs were from a mate pair library, I would expect them to map to the same scaffold but hundreds of base pairs from each other.

-

Aligner needs to understand/know if the data is from a mate-pair library. My guess it magicblast is not able to handle this type of data. I would suggest that you use bbmap.sh or a different aligner that can do this. See the following quote from BBMap help guide:

BBMap supports paired reads. When mapped as pairs, reads in the normal “innie” fragment orientation (left read mapped to plus strand, right read mapped to minus strand), with a pair distance within some margin of the average, will be considered properly paired and the mapping score will be increased. If one read maps and the other does not map nearby, a “rescue” operation is performed to look for a good mapping location by brute force. However, some library types (mainly long-mate-pair libraries) do not have reads in the “innie” orientation. In those cases, it’s best to set the flag “requirecorrectstrand=f” (rcs=f) in order to get correct pairing statistics. That simply drops the innie orientation requirement for pairing.

0
Entering edit mode

Thanks for taking the time to respond to my question and confirming my expectations about mapping of paired end and mate paired reads.

I tried your suggestion to use bbmerge.sh, working with a test set of 24 read pairs, and as one would expect from the magicblast data posted originally, >80% of the reads were merged. (I also double-checked one read pair with BLASTN, which confirmed the magicblast mapping.)

At this point, the evidence seems to support my initial hypothesis that all four libraries were made with very small fragments, but I'll try the "Target fragment size versus final insert size" instructions that you pointed me to when I have some more compute power available.