Question: Bowtie 2 - is there a way to discard reads mapping to multiple locations?
3
gravatar for mccoykg
4.0 years ago by
mccoykg30
United States
mccoykg30 wrote:

Hi all. I'm trying to switch from bowtie to bowtie2 right now, because bowtie2 has some neat options. However, when reading over the manual, I noticed it seems to be missing the option to discard reads mapping to multiple locations - which was -m in bowtie. Does anyone know if there's a way to do this that I'm missing?

Thanks,

K

mapping bowtie2 • 7.7k views
ADD COMMENTlink modified 4.0 years ago by Brian Bushnell16k • written 4.0 years ago by mccoykg30

Not sure about bowtie2...  but you can do that with BBMap, by using the "outm" flag instead of "out".  For example...

bbmap.sh ref=ref.fa in=reads.fq outm=mapped.sam outu=unmapped.sam out=both.sam

You can specify any or all of outm, outu, and/or out depending on which reads you want to collect, and output them in other formats like fastq if you want, instead of sam.

ADD REPLYlink written 4.0 years ago by Brian Bushnell16k

I have also encountered that problem switching from Bowtie to Bowtie2. In my case, I would like to keep reads which fall under a certain -m parameter. I have not yet worked out a way to do this.

ADD REPLYlink written 4.0 years ago by jotan1.2k
8
gravatar for Ian
4.0 years ago by
Ian5.5k
University of Manchester, UK
Ian5.5k wrote:

I had exactly the same problem when I moved from bowtie to bowtie2.  For me the reason was because our facility started to routinely output 101bp paired-end sequence.  I will detail below how I replicated the -m1 flag setting in bowtie2, but the overall conclusion is that I no longer use this method and instead allow mapping of reads to multiple locations as long as the read pairs are concordant and high quality.

1) Replicating -m1 in bowtie2 [ not recommended ]
This is mostly based upon Devon Ryan's answer to

C: How To Extract Reads-Pairs Aligned Concordantly Exactly 1 Time?

also here is another great discussion:
How to extract unique mapped results from Bowtie2 bam results?

You can run bowtie2 with default settings, but employ '-k 2', which will report up to two mapped location per read/pair.

The resulting SAM file can then be filtered using the XS:i flag, which indicates the second best mapping location, i.e. it identifies non-uniquely mapping reads.  Below is some dummy code to illustrate:

# remove non-primary mapping reads (-F 256) and retain properly paired, i.e. concordant read pairs (-f 2)
samtools view -Sh -f 2 -F 256 sample.sam | grep -v "XS:i:" > sample_PP_filterXS.sam

# see below for check_pairs.py code, remove orphaned reads when one is deleted due to multi-mapping
./check_pairs.py < sample_PP_filterXS.sam > sample_PP_filterXS_rm-orphan.sam

# SAM to BAM
samtools view -bS -o sample_PP_filterXS_rm-orphan.bam sample_PP_filterXS_rm-orphan.sam

# sort BAM ---> index and flagstat as required
samtools sort sample_PP_filterXS_rm-orphan.bam sample_PP_filterXS_rm-orphan_sorted

check_pairs.py

#!/usr/bin/env python
import csv
import sys

f = csv.reader(sys.stdin, dialect="excel-tab")
of = csv.writer(sys.stdout, dialect="excel-tab")
last_read = None
for line in f :
    #take care of the header
    if(line[0][0] == "@") :
        of.writerow(line)
        continue

    if(last_read == None) :
        last_read = line
    else :
        if(last_read[0] == line[0]) :
            of.writerow(last_read)
            of.writerow(line)
            last_read = None
        else :
            last_read = line

2) Learning to love multi-mapped reads [ recommended - "by me" ]
To cut a long story short after various discussions with colleagues and the reading of an illuminating blog post by Heng Li (http://lh3lh3.users.sourceforge.net/mapuniq.shtml) it was decided we should stop worrying about multi-mapping reads from bowtie2, as long as they belonged to proper-pairs (a.k.a concordant) and are of good quality.

Again here is some dummy code:

# SAM to BAM
samtools view -bS sample.sam > sample.bam

# retain properly paired, i.e. concordant read pairs (-f 2) and high quality reads (-q30)
samtools view -f2 -q30 -bS sample.sam > sample_PP_Q30.bam

# sort BAM ---> index and flagstat as required
samtools sort sample_PP_Q30.bam sample_PP_Q30_sorted


Conclusion
Although it feels safer to only use uniquely mapping reads when using paired-end reads it is OK to allow multi-mapping as long as the pairs are checked to there relative distance, orientation and quality.  At least in examples I have seen mapping "looks" better in ChIP-seq analysis, where "real" peaks remain, but "odd" looking peaks disappear over repetitive regions. 

For short single end reads I would still use bowtie1 '-m1' :)

I hope this helps as I agonised over this for a long time.

 

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Ian5.5k
1

When you filter for -q 30 aren't you automatically removing multi mapped reads? For example bwa will set the quality to 0 for reads that map to multiple locations. I am not sure what bowtie2 does but if reports these has high mapping quality then their MAPQ is not right.

ADD REPLYlink written 4.0 years ago by Istvan Albert ♦♦ 81k

Bowtie2 will never report a multimapper* as having a MAPQ >1, so -q 2 would suffice.

*For the most sane definition of "multimapper".

ADD REPLYlink written 4.0 years ago by Devon Ryan92k

Thanks for the comment.  I am increasingly think I am talking nonsense, but isn't the '-q' setting in 'samtools view' filtering on the Phred score of the quality of each read? I.e. choosing a higher phred score cut off is good?

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Ian5.5k
1

-q selects a MAPQ score, not a phred score, which describes base quality. Filtering should typically not be very aggressive, since selecting a high value ends up excluding a lot of true positive alignments (I've yet to see a circumstance where a threshold over 10 was worthwhile).

Anyway, as Istvan mentioned, documentation and standards aren't exactly a forte of the field :(

ADD REPLYlink written 4.0 years ago by Devon Ryan92k
1

Well just to add to the confusion the MAPQ is on Phred scale and is decoded into a Phred type of probability but it is not actually a Phred score.

And what is the MAPQ? The definition of it is very generic - every aligner makes up its own rules of what it puts there. So nowadays the MAPQ  is a Phred encoded something that is used to filter for a property unrelated to the original intent of what the MAPQ should be.  

ADD REPLYlink written 4.0 years ago by Istvan Albert ♦♦ 81k

You are apparently right.  I feel rather ashamed that I had not realised this.  Thank you for the correction!

ADD REPLYlink written 4.0 years ago by Ian5.5k

no need to feel shame - the field of bioinformatics is choke full of tacit assumptions and undocumented features.

This is one of the greatest weaknesses of the field itself, there is no real standard to uphold and refer to. The SAM format does very-very little - not to mention how clumsy it is. The whole MAPQ field is an arbitrary value, basically just a number that anyone is free to interpret and fill in however they wish.

ADD REPLYlink written 4.0 years ago by Istvan Albert ♦♦ 81k

Thanks much! This was well explained. My lab works with short single end reads (around 15~18 bp) and we can only use reads that map uniquely to keep our fitness calculations accurate, so I suppose we'll have to stick with Bowtie. I can see how multi-mapped reads wouldn't be an issue for most data now though.

ADD REPLYlink written 4.0 years ago by mccoykg30
0
gravatar for Brian Bushnell
4.0 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

If speed is a problem, you can use BBMap like this:

bbmap.sh in=reads.fq out=mapped.sam ref=ref.fa perfectmode ambig=toss

That will only retain reads that map perfectly and unambiguously to a specific location.  It should be faster than Bowtie, though I'd be interested to hear if it isn't.

ADD COMMENTlink written 4.0 years ago by Brian Bushnell16k

what is the difference with the outm flag ?

ADD REPLYlink written 3.0 years ago by Picasa470
1

"out" gets all reads. "outm" only gets mapped reads. But with "ambig=toss", reads mapping to multiple locations will be classified as unmapped, so they will not go to outm.

"perfectmode" only reads that map both perfectly (with no mismatches), imperfectly-mapped reads will be classified as unmapped, so they also won't go to outm. That mode is very fast.

Note that in all cases, if reads are paired and at least one read is mapped, the other will still go to outm even if it is unmapped, to keep pairs together.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Brian Bushnell16k

I'm tempted to try BBmap. Do you have any benchmark that compare BBmaè with the classical mappers ? bwa, bowtie2 etc.

ADD REPLYlink written 3.0 years ago by Picasa470
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1244 users visited in the last hour