Question: How to extract reads that map with BWA exactly twice
1
gravatar for thomashartwig80
4 weeks ago by
thomashartwig8010 wrote:

Hi all. I am using BWA mem to map reads from a ChiP experiment which comes from an F1. To avoid mapping bias we are mapping them to a diploid genome consisting of both parent genomes. To get unique reads we filter them by q20 but we also want to recover all the multi-map reads which map twice, once to the father and once to the mother genome. We do not want to lose all those positions which are TF binding site, there is just no variation at the site. Since we are doing this for many F1s losing all those non-variant sites would give us very biased results.

I know MAPQ=0 is ambiguous reads but is there a way to distinguish reads that map twice and reads that map 3 or more times with equal score?

Your help would be appreciated.

chip-seq • 172 views
ADD COMMENTlink modified 7 days ago by wljlinksly20 • written 4 weeks ago by thomashartwig8010

we also want to recover all the multi-map reads which map twice, once to the father and once to the mother genome.

Thinking out aloud rather than offering a concrete solution.

I am not sure which aligner you are using and how it handles multi-mappers but this is a difficult task for any aligner (having a diploid genome present in reference). Have you looked at the results to see what is happening? Which aligner are you using? It may be better to align to just one haploid parent version and then look for number of alternate alleles present at sites you know.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax87k

Hi. We are using BWA-mem (now BWA-mem2). Using a diploid genome was the only way to truly eliminate mapping bias. In plants, we have a lot of structure variation so personal genome approaches left to many artefacts.

ADD REPLYlink written 4 weeks ago by thomashartwig8010

But are you able to see

multi-map reads which map twice, once to the father and once to the mother genome.

in the alignments? I assume your reference genome contains something line chr1_parent1 and chr1_parent2 type names. If you were to extracts a set of aligned reads do you see them aligning to both genomes (if you ignore the twice requirement).

ADD REPLYlink written 4 weeks ago by genomax87k

Yes, chromosomes have the names of their parents to distinguish. We use a chain file from WGA to connect which parts of each chromosome belong to each other. Currently, with BWA mem, reads are assigned by random to each position if the mapping score is equal which in this case would be 0. At the moment, we keep reads with a Mapping quality of >20 (>99% unique) and 0 to keep ambiguous reads. So we merge the bam from q<20 and q=0. But instead of keeping all multimap reads, we would like to keep the once that map equally to parent 1 and parent 2 chromosomes, but not reads that map more than twice with an equal score.

ADD REPLYlink modified 29 days ago • written 29 days ago by thomashartwig8010
1

Currently, with BWA mem, reads are assigned by random to each position if the mapping score is equal which in this case would be 0.

That is exactly what I was referring to in my initial comment.

we would like to keep the once that map equally to parent 1 and parent 2 chromosomes, but not reads that map more than twice with an equal score.

That is almost certainly going to require some kind of post processing. I would suggest that you try bbmap.sh with ambig=all option which would align the reads to all possible locations. I don't think bwa has that kind of an option. Then you will need to process the bam file to find reads that fit your requirement, which will need custom code.

ADD REPLYlink written 28 days ago by genomax87k
0
gravatar for colindaven
7 days ago by
colindaven2.3k
Hannover Medical School
colindaven2.3k wrote:

Not really an answer (bowtie2 is not as effective for SNV mapping as bwa-mem in comparisons colleagues of mine have done in the past), but in the interests of completeness bowtie2 offers more options for this.

 Reporting:
  (default)          look for multiple alignments, report best, with MAPQ
   OR
  -k <int>           report up to <int> alns per read; MAPQ not meaningful
   OR
  -a/--all           report all alignments; very slow, MAPQ not meaningful
ADD COMMENTlink written 7 days ago by colindaven2.3k

See my comment for @wljiinsky..

ADD REPLYlink written 7 days ago by genomax87k
0
gravatar for wljlinksly
7 days ago by
wljlinksly20
wljlinksly20 wrote:

It seems that bwa mem options -h will help you out

which means if there are < INT hits with score >80% of the max score, output all in XA [5,200] will help you out

INT is the value you passed to bwa mem

my bwa version is "0.7.17-r1188"

ADD COMMENTlink modified 7 days ago • written 7 days ago by wljlinksly20
1

Issue is there are two copies of almost identical genomes. If a read multi-maps within a single genome it will map to the other one equally well. Options you mention can't control where a read maps. One time across two genomes or twice on the same genome. As I said before to get what original poster wants will require post-processing of the alignments.

ADD REPLYlink modified 7 days ago • written 7 days ago by genomax87k

In my opinion, XA tag will tell you all the possible positions across all genome contigs in your reference index feed to bwa mem, maybe I am wrong about this?

ADD REPLYlink written 6 days ago by wljlinksly20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1555 users visited in the last hour