Bowtie 2 - is there a way to discard reads mapping to multiple locations?
2
5
Entering edit mode
6.6 years ago
mccoykg ▴ 50

Hi all. I'm trying to switch from bowtie to bowtie2 right now, because bowtie2 has some neat options. However, when reading over the manual, I noticed it seems to be missing the option to discard reads mapping to multiple locations - which was -m in bowtie. Does anyone know if there's a way to do this that I'm missing?

Thanks,

K

bowtie2 mapping • 12k views
0
Entering edit mode

Not sure about bowtie2... but you can do that with BBMap, by using the "outm" flag instead of "out". For example...

bbmap.sh ref=ref.fa in=reads.fq outm=mapped.sam outu=unmapped.sam out=both.sam


You can specify any or all of outm, outu, and/or out depending on which reads you want to collect, and output them in other formats like fastq if you want, instead of sam.

0
Entering edit mode

I have also encountered that problem switching from Bowtie to Bowtie2. In my case, I would like to keep reads which fall under a certain -m parameter. I have not yet worked out a way to do this.

9
Entering edit mode
6.6 years ago
Ian 5.8k

I had exactly the same problem when I moved from bowtie to bowtie2. For me the reason was because our facility started to routinely output 101bp paired-end sequence. I will detail below how I replicated the -m1 flag setting in bowtie2, but the overall conclusion is that I no longer use this method and instead allow mapping of reads to multiple locations as long as the read pairs are concordant and high quality.

1) Replicating -m1 in bowtie2 [ not recommended ] This is mostly based upon Devon Ryan's answer to C: How To Extract Reads-Pairs Aligned Concordantly Exactly 1 Time?

also here is another great discussion: How to extract unique mapped results from Bowtie2 bam results?

You can run bowtie2 with default settings, but employ '-k 2', which will report up to two mapped location per read/pair.

The resulting SAM file can then be filtered using the XS:i flag, which indicates the second best mapping location, i.e. it identifies non-uniquely mapping reads. Below is some dummy code to illustrate:

# remove non-primary mapping reads (-F 256) and retain properly paired, i.e. concordant read pairs (-f 2)
samtools view -Sh -f 2 -F 256 sample.sam | grep -v "XS:i:" > sample_PP_filterXS.sam

# see below for check_pairs.py code, remove orphaned reads when one is deleted due to multi-mapping
./check_pairs.py < sample_PP_filterXS.sam > sample_PP_filterXS_rm-orphan.sam

# SAM to BAM
samtools view -bS -o sample_PP_filterXS_rm-orphan.bam sample_PP_filterXS_rm-orphan.sam

# sort BAM ---> index and flagstat as required
samtools sort sample_PP_filterXS_rm-orphan.bam sample_PP_filterXS_rm-orphan_sorted


### check_pairs.py

#!/usr/bin/env python
import csv
import sys

f = csv.reader(sys.stdin, dialect="excel-tab")
of = csv.writer(sys.stdout, dialect="excel-tab")
for line in f :
#take care of the header
if(line[0][0] == "@") :
of.writerow(line)
continue

if(last_read == None) :
else :
if(last_read[0] == line[0]) :
of.writerow(line)
else :


2) Learning to love multi-mapped reads [ recommended - "by me" ]

To cut a long story short after various discussions with colleagues and the reading of an illuminating blog post by Heng Li (http://lh3lh3.users.sourceforge.net/mapuniq.shtml) it was decided we should stop worrying about multi-mapping reads from bowtie2, as long as they belonged to proper-pairs (a.k.a concordant) and are of good quality.

Again here is some dummy code:

# SAM to BAM
samtools view -bS sample.sam > sample.bam

# retain properly paired, i.e. concordant read pairs (-f 2) and high quality reads (-q30)
samtools view -f2 -q30 -bS sample.sam > sample_PP_Q30.bam

# sort BAM ---> index and flagstat as required
samtools sort sample_PP_Q30.bam sample_PP_Q30_sorted


Conclusion

Although it feels safer to only use uniquely mapping reads when using paired-end reads it is OK to allow multi-mapping as long as the pairs are checked to there relative distance, orientation and quality. At least in examples I have seen mapping "looks" better in ChIP-seq analysis, where "real" peaks remain, but "odd" looking peaks disappear over repetitive regions.

For short single end reads I would still use bowtie1 '-m1' :)

I hope this helps as I agonised over this for a long time.

1
Entering edit mode

When you filter for -q 30 aren't you automatically removing multi mapped reads? For example bwa will set the quality to 0 for reads that map to multiple locations. I am not sure what bowtie2 does but if reports these has high mapping quality then their MAPQ is not right.

0
Entering edit mode

Bowtie2 will never report a multimapper* as having a MAPQ >1, so -q 2 would suffice.

*For the most sane definition of "multimapper".

0
Entering edit mode

Thanks for the comment. I am increasingly think I am talking nonsense, but isn't the -q setting in samtools view filtering on the Phred score of the quality of each read? I.e. choosing a higher phred score cut off is good?

1
Entering edit mode

-q selects a MAPQ score, not a phred score, which describes base quality. Filtering should typically not be very aggressive, since selecting a high value ends up excluding a lot of true positive alignments (I've yet to see a circumstance where a threshold over 10 was worthwhile).

Anyway, as Istvan mentioned, documentation and standards aren't exactly a forte of the field :(

1
Entering edit mode

Well just to add to the confusion the MAPQ is on Phred scale and is decoded into a Phred type of probability but it is not actually a Phred score.

And what is the MAPQ? The definition of it is very generic - every aligner makes up its own rules of what it puts there. So nowadays the MAPQ is a Phred encoded something that is used to filter for a property unrelated to the original intent of what the MAPQ should be.

0
Entering edit mode

You are apparently right. I feel rather ashamed that I had not realised this. Thank you for the correction!

0
Entering edit mode

no need to feel shame - the field of bioinformatics is choke full of tacit assumptions and undocumented features.

This is one of the greatest weaknesses of the field itself, there is no real standard to uphold and refer to. The SAM format does very-very little - not to mention how clumsy it is. The whole MAPQ field is an arbitrary value, basically just a number that anyone is free to interpret and fill in however they wish.

0
Entering edit mode

Thanks much! This was well explained. My lab works with short single end reads (around 15~18 bp) and we can only use reads that map uniquely to keep our fitness calculations accurate, so I suppose we'll have to stick with Bowtie. I can see how multi-mapped reads wouldn't be an issue for most data now though.

0
Entering edit mode
6.6 years ago

If speed is a problem, you can use BBMap like this:

bbmap.sh in=reads.fq out=mapped.sam ref=ref.fa perfectmode ambig=toss


That will only retain reads that map perfectly and unambiguously to a specific location. It should be faster than Bowtie, though I'd be interested to hear if it isn't.

0
Entering edit mode

what is the difference with the outm flag ?

1
Entering edit mode

"out" gets all reads. "outm" only gets mapped reads. But with "ambig=toss", reads mapping to multiple locations will be classified as unmapped, so they will not go to outm.

"perfectmode" only reads that map both perfectly (with no mismatches), imperfectly-mapped reads will be classified as unmapped, so they also won't go to outm. That mode is very fast.

Note that in all cases, if reads are paired and at least one read is mapped, the other will still go to outm even if it is unmapped, to keep pairs together.

0
Entering edit mode

I'm tempted to try BBmap. Do you have any benchmark that compare BBmaè with the classical mappers ? bwa, bowtie2 etc.