Question

How to extract reads that map with BWA exactly twice

1

Entering edit mode

3.8 years ago

thomashartwig80 ▴ 10

Hi all. I am using BWA mem to map reads from a ChiP experiment which comes from an F1. To avoid mapping bias we are mapping them to a diploid genome consisting of both parent genomes. To get unique reads we filter them by q20 but we also want to recover all the multi-map reads which map twice, once to the father and once to the mother genome. We do not want to lose all those positions which are TF binding site, there is just no variation at the site. Since we are doing this for many F1s losing all those non-variant sites would give us very biased results.

I know MAPQ=0 is ambiguous reads but is there a way to distinguish reads that map twice and reads that map 3 or more times with equal score?

Your help would be appreciated.

ChIP-Seq • 1.9k views

ADD COMMENT • link updated 3.7 years ago by wulj2 ▴ 50 • written 3.8 years ago by thomashartwig80 ▴ 10

0

Entering edit mode

we also want to recover all the multi-map reads which map twice, once to the father and once to the mother genome.

Thinking out aloud rather than offering a concrete solution.

I am not sure which aligner you are using and how it handles multi-mappers but this is a difficult task for any aligner (having a diploid genome present in reference). Have you looked at the results to see what is happening? Which aligner are you using? It may be better to align to just one haploid parent version and then look for number of alternate alleles present at sites you know.

ADD REPLY • link 3.8 years ago by GenoMax 141k

0

Entering edit mode

Hi. We are using BWA-mem (now BWA-mem2). Using a diploid genome was the only way to truly eliminate mapping bias. In plants, we have a lot of structure variation so personal genome approaches left to many artefacts.

ADD REPLY • link 3.8 years ago by thomashartwig80 ▴ 10

0

Entering edit mode

But are you able to see

multi-map reads which map twice, once to the father and once to the mother genome.

in the alignments? I assume your reference genome contains something line chr1_parent1 and chr1_parent2 type names. If you were to extracts a set of aligned reads do you see them aligning to both genomes (if you ignore the twice requirement).

ADD REPLY • link 3.8 years ago by GenoMax 141k

0

Entering edit mode

Yes, chromosomes have the names of their parents to distinguish. We use a chain file from WGA to connect which parts of each chromosome belong to each other. Currently, with BWA mem, reads are assigned by random to each position if the mapping score is equal which in this case would be 0. At the moment, we keep reads with a Mapping quality of >20 (>99% unique) and 0 to keep ambiguous reads. So we merge the bam from q<20 and q=0. But instead of keeping all multimap reads, we would like to keep the once that map equally to parent 1 and parent 2 chromosomes, but not reads that map more than twice with an equal score.

ADD REPLY • link 3.8 years ago by thomashartwig80 ▴ 10

1

Entering edit mode

Currently, with BWA mem, reads are assigned by random to each position if the mapping score is equal which in this case would be 0.

That is exactly what I was referring to in my initial comment.

we would like to keep the once that map equally to parent 1 and parent 2 chromosomes, but not reads that map more than twice with an equal score.

That is almost certainly going to require some kind of post processing. I would suggest that you try bbmap.sh with ambig=all option which would align the reads to all possible locations. I don't think bwa has that kind of an option. Then you will need to process the bam file to find reads that fit your requirement, which will need custom code.

ADD REPLY • link 3.8 years ago by GenoMax 141k

score 0 · Answer 1 · 2020-08-03

0

Entering edit mode

3.7 years ago

colindaven 6.4k

Not really an answer (bowtie2 is not as effective for SNV mapping as bwa-mem in comparisons colleagues of mine have done in the past), but in the interests of completeness bowtie2 offers more options for this.

 Reporting:
  (default)          look for multiple alignments, report best, with MAPQ
   OR
  -k <int>           report up to <int> alns per read; MAPQ not meaningful
   OR
  -a/--all           report all alignments; very slow, MAPQ not meaningful

ADD COMMENT • link 3.7 years ago by colindaven 6.4k

0

Entering edit mode

See my comment for @wljiinsky..

ADD REPLY • link 3.7 years ago by GenoMax 141k

score 0 · Answer 2 · 2020-08-03

0

Entering edit mode

3.7 years ago

wulj2 ▴ 50

It seems that bwa mem options -h will help you out

which means if there are < INT hits with score >80% of the max score, output all in XA [5,200] will help you out

INT is the value you passed to bwa mem

my bwa version is "0.7.17-r1188"

ADD COMMENT • link 3.7 years ago by wulj2 ▴ 50

1

Entering edit mode

Issue is there are two copies of almost identical genomes. If a read multi-maps within a single genome it will map to the other one equally well. Options you mention can't control where a read maps. One time across two genomes or twice on the same genome. As I said before to get what original poster wants will require post-processing of the alignments.

ADD REPLY • link 3.7 years ago by GenoMax 141k

0

Entering edit mode

In my opinion, XA tag will tell you all the possible positions across all genome contigs in your reference index feed to bwa mem, maybe I am wrong about this?

ADD REPLY • link 3.7 years ago by wulj2 ▴ 50