paired end mapping with one end being unique and the other end multiple
1
0
Entering edit mode
9 weeks ago
schu • 0

Hi all,

This may be a question already and repeatedly asked before, but I could not find an answer relevant to my issue...

I would like to map human/mouse genome sequence data of paired end reads (150 bp x 2 etc.) allowing one end being "uniquely" mapped and the other end "multiply" mapped in order to increase mappability to repetitive regions.

Is this (one end uniquely and the other end multiply mapped) possible for common mapping softwares such as bowtie2, bwa etc. by properly setting some option parameters, or do I need to manually filter (by awk etc.) the output, for which I allow for multimapping reads of both ends?

paired-end genome ngs • 607 views
3
Entering edit mode

It is unclear what you are actually trying to achieve.

allowing one end being "uniquely" mapped and the other end "multiply" mapped in order to increase mappability to repetitive regions

As Istvan Albert already stated, this is already done by most aligners. Specifically, most aligners will preferentially place a multi-mapping read in the location that is consistent with location of the uniquely mapping mate based on the library fragment size distribution. Most popular aligners will even report a non-zero mapq for the multi-mapping mate (since the read pair, as a whole, is not multimapping).

I allow for multimapping reads of both ends

Aligners will still preferentially align the both reads in the read pair in a consistent location. If there's no one best location then they'll output a random location with mapq=0. The exception to this is if the sequence is highly abundant in the reference (e.g. alpha satellite repeats) as, for performance reasons, the aligners will have a limit on the number of candidate locations an alignment 'seed' can have. For bwa this is the -c parameter and defaults to 500.

in order to increase mappability to repetitive regions

That's not how alignment works. If your reference genome has 10 identical copies of a particular region (that is large relative to the fragment size/read length), there's nothing any aligner can do to increase the mappability because will be 5 regions that all have identically good alignments. By default, aligners will randomly choose one of the 5. You could ask the aligner to output all the candidate locations (either as secondary alignments or, for bwa, the XA tag using -h), but that doesn't actually change the mappability of any of the regions - it's still an ambiguously aligned read and there's nothing the aligner can do about it^.

to map human/mouse genome sequence data

Is this data (such as mouse xenograft data) where your input reads are a mixture of mouse and human cell that you're trying to align to a joint mouse+human reference? Due to high level of homology across the genomes, you'll get a high multi-mapping rate if you just aligned to a joint reference. This sort of data requires more sophisticated and there are specialised pipeline designed to deal with this.

As a general rule, try to look for and use existing software whenever possible. High quality software development time consuming and writing your own frequently leads to an inferior solution to a general problem. Considering writing your own when there are no options out there, or you are extremely familiar with the state-of-the-art and have concrete solutions to the limitations of the current software.

^ When the homologs aren't identical in the sample being sequenced, there are some sophisticated techniques that can be used for partially disambiguating them, but that's way more advanced than the question you are asking and is likely to require literally years to perfect.

0
Entering edit mode

Dear d-cameron and cmdcolin,

Most popular aligners will even report a non-zero mapq for the multi-mapping mate (since the read pair, as a whole, is not multimapping).

I confirmed this by looking at my alignments of WGS data in IGV focusing on junctions of repetitive elements and non-repetitive regions. Actually, "unique + multi" mate pairs at such junctions were labeled as "unique", while "multi + multi" mate pairs inside repetitive elements were mapped as "multi".

Is this data (such as mouse xenograft data) where your input reads are a mixture of mouse and human ...

No, I am analyzing human WGS vs mouse WGS but not a mixture of human and mouse samples (my colleague is using "Xenome" for this purpose). I am sorry for my ambiguous question.

5
Entering edit mode
9 weeks ago

Multiple mapping should not even require anything on your behalf, it should work like that by default. But collecting all the positions may be a bit more complicated and may require parsing/processing the BAM file with a custom script.

The alternative locations will be specified in the XA tags (if I recall that correctly) and you won't get a separate alignment for each position.

Here is where the choice of aligner may matter more, bowtie2 usually offers more formatting options, perhaps includes all alternative locations, bwa almost certainly provides only the XA tags.

2
Entering edit mode

allowing one end being "uniquely" mapped and the other end "multiply" mapped

This is the bit of requirement mentioned in original post that is unlikely to be supported by any aligner. OP will need to filter the results after the alignments to satisfy this requirement.

0
Entering edit mode

Dear Istvan Albert and GenoMax,

Thank you so much for your kind replies and comments. All right, I will now try to parse SAM/BAM of bowtie2 by using flags and tags etc. Thanks again.

2
Entering edit mode

bwa almost certainly provides only the XA tags

Be aware that the XA tag (and indeed all X, Y, Z* tags) has an implementation-defined meaning. It's used as a light-weight secondary alignment field by bwa but there is no requirement that any other caller either uses the field in this way, or even at all.