Question: How to extract unique mapped results from Bowtie2 bam results?
gravatar for zhwjmch
5.2 years ago by
United States
zhwjmch90 wrote:

How to extract unique mapped results from Bowtie2 bam results?

I used samtools view -bq 1 WG.bam > unique.bam

However, my results contain 54792 lines, why it is not 42097?


After I have the subset of those reads, how can I extract them from sam or bam file to create a new fastq format file? Can anybody offer some code to do so?


"samopen] SAM header is present: 84 sequences.

60291 reads; of these:

  60291 (100.00%) were unpaired; of these:

    2788 (4.62%) aligned 0 times

    42097 (69.82%) aligned exactly 1 time

    15406 (25.55%) aligned >1 times

95.38% overall alignment rate"


SRR029237.10    16    4    21793833    42    242M    *    0    0    TTCCCAGTGTCTAGAATGTGGCATGCCCACAACAAATTCTAGCTGAAGGAATCAGCAAGGAGATGTTATGGAGCTCTACCAAAATACTAACCCAGAACTTGAGACATGGTCAGTCATGAGAATTTCCACTACGCTCTGCTTCTGTGATGTTAATTTTTATATTAACATATAAAATAATGGCATATATAGATTTTGAAGTGTGTGCTAATGGCATAAAATTGCCCTCATAAATAAATGAAGTC    =C1BF9<<=8==<4<C:6:7@::==-@D;<5>7,?D9A;5<<==:=C:B6?<<:;<=C=C<<;;;<B;;=C=8<<<<:;C(6CG<;<;9A,@D9<;C<=D<==;=;9<C9<<===<<<;9;B/AE=D==<8;;:<=<;=<C;:=<<=6<;A=C&09DH<:=<C=C:<;7<(6CG84==<C==9<<=<==;(6CG;=D=<<==<<<=;B:6><6:'5BF2;5)>B:<;=.AE=-@E;;;B=;=    AS:i:-8    XN:i:0    XM:i:2    XO:i:0    XG:i:0    NM:i:2    MD:Z:23G108A109    YT:Z:UU

SRR029237.11    0    17    62886136    1    37M1I159M    *    0    0    TCTCTAGATCNCAAAATCCCTTTTAGAATCCAACTCTGGGGTGGCACCAAGATCAGCAGAACCTCCATTTCCTCCTCTCTTTTCCCAAACCTTATTATGAAAGCCCCACATGGAACCATGTCAGGGCTGCAAGTGAAGCCATTCAACCTTTTTCCCCCCATCAAAAAAATTGGAGAACTATAATGTGCATAAAGTGC    1<;=8;==76!8FB4&<B>)FB4&58=48@8C=8<;=EA3#9C=8=B=B<<<38;<<7<C=C=<?6;FB2C=<C=;;4;GC7+FB0EA.B;C=.B::3<A='8EA3#5=<:@7@8C=:8=:;;FA/=8;<>5=<7B:1D=<D=6>5C=GC8.%GC91)"92=GB92,'"@9B<7<@;:<:9C<;8=;13;EA.<3=<    AS:i:-9    XS:i:-9    XN:i:0    XM:i:1    XO:i:1    XG:i:1    NM:i:2    MD:Z:10A185    YT:Z:UU

alignment • 25k views
ADD COMMENTlink modified 3.2 years ago by merodev140 • written 5.2 years ago by zhwjmch90

according to the documentation of samtools -q is for skipping alignments with MAPQ smaller than INT which in your case 1

ADD REPLYlink written 5.2 years ago by mostafa.shokrof0

Thank you, but the results for the following command is 54792 lines too.

samtools view -q 1 WG.bam > unique2.sam

ADD REPLYlink written 5.2 years ago by zhwjmch90

what command line have you used?

ADD REPLYlink written 5.2 years ago by Varun Gupta1.1k

bowtie2 --very-sensitive -x hg19.fa -U

ADD REPLYlink written 5.2 years ago by zhwjmch90
gravatar for Devon Ryan
5.2 years ago by
Devon Ryan91k
Freiburg, Germany
Devon Ryan91k wrote:

Bowtie2 will give an alignment a MAPQ score of 0 or 1 if it can map equally well to more than one location. Further, there's not always a perfect correspondence between the MAPQ of a read and the summary metrics it prints at the end (I'd need to go through the code again to determine how it arrives at the printed summary metrics, that's not documented anywhere). Finally, you would be well served to completely abandon the concept of "uniquely mapped". It is never useful and is always misleading, since you're simply lying to by labeling something unique. You're better served by simply filtering on a meaningful MAPQ (5 or 10 are often reasonable choices), which has the benefit of actually doing what you want, namely filtering according the the likelihood that an alignment is correct.

ADD COMMENTlink written 5.2 years ago by Devon Ryan91k

Thank you for your reply. So the following filter criterion should be more sensible, right? Any suggestions on how to extract those reads? 

$ less sort.sam |perl -lane 'print $_, if $F[4] >10'|wc

  46470  888752 26242525

$ less sort.sam |perl -lane 'print $_, if $F[4] > 5'|wc

  49667  952267 27997939

ADD REPLYlink written 5.2 years ago by zhwjmch90

It's easiest to just use samtools: samtools view -b -q 10 foo.bam > foo.filtered.bam, or you could use -q 5 or some other threshold instead.

ADD REPLYlink written 5.2 years ago by Devon Ryan91k

If I want to quantify repetitive elements, is it safe to do like this? I want to keep high reliable reads mapping to repetitive elements while not result in over-estimation like @per_ludvik.brattaas said. Thanks. 

ADD REPLYlink written 3.6 years ago by AlicePsyche30

I don't recall what MAPQ bowtie2 will given to secondary alignments, but once can always filter them out with -F 256 and samtools. In point of fact bowtie2 defaults to only outputting one alignment/pair per read/pair, so the over-estimation concern won't happen.

I should note that if you really want to quantify repeats then you should use very different methods. Namely, you should allow multimappers and then count them as one only if they all (or some threshold fraction of them) align to a particular repeat class/family/etc. I used this method with countRepeats, though I'm not sure it's in a usable form at the moment.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Devon Ryan91k

I see. In my lab, we just filter XS to avoid multiple mapping and didn't care MAPQ value much...I would do more research on repeats mapping. Thanks for your help.

ADD REPLYlink written 3.6 years ago by AlicePsyche30

I see how this is in general a good strategy. However, when quantifying reads to repeated sequences, such as transposons, wont this cause a drastic over-estimation?

If you don't extract primary or unique reads from the sam/bam, the reads that maps equally well to multiple sequences will cause a serious bias in quantification of repeated elements. I would say the best strategy here is to

1) Rank the alignments of the same read on alignment score

2) Choose only the best (primary read)

3) If the top scoring alignments have equal alignment score, choose one randomly.

Is this possible in bowtie2? It is in STAR.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by TEman10

That's what bowtie2 does by default. If you want primary alignments only then just use the appropriate samtools flag.

ADD REPLYlink written 3.6 years ago by Devon Ryan91k

I don't know why I wrote that 3rd step. It is of course default behaviour in bowtie2.

What I meant was to only keep an alignment if it is better than all other alignments of that read. That is, if the top alignment has the same score as the second, it should be discarded as it is "impossible" to say which alignment is correct. However, if the first alignment has a perfect alignment and the second has e.g. 4 mismatches, it would be reasonable to keep the first ("primary") read.

I guess an option is to look whether AS > XS.

ADD REPLYlink written 3.5 years ago by TEman10

It's simpler to just filter by MAPQ, since what you care about is the likelihood that the alignment is wrong. Reasonable thresholds tend to be between 5 and 10 for most applications.

ADD REPLYlink written 3.5 years ago by Devon Ryan91k
gravatar for merodev
3.2 years ago by
United States
merodev140 wrote:

If you want only the uniquely mapped reads from bowtie2 sam files then you can proceed as follows:

grep -E "@|NM:" bt2output.sam | grep -v "XS:" > uniq_bt2output.sam

This first checks to see if there is alignment and then removes reads with secondary alignments. It will keep the headers intact. Hope this helps.

ADD COMMENTlink written 3.2 years ago by merodev140
gravatar for poisonAlien
5.2 years ago by
poisonAlien2.8k wrote:

Looking at Bowtie2 manual:


XS:i:  Alignment score for the best-scoring alignment found other than the alignment

reported. Can be negative. Can be greater than 0 in --local mode (but not in --end-to-end

mode). Only present if the SAM record is for an aligned read and more than one alignment was

found for the read. Note that, when the read is part of a concordantly-aligned pair, this

score could be greater than AS:i.

I think Bowtie2 adds XS:i tag if more than one valid alignments are found. You can remove lines with XS:i tag from your SAM file maybe ?

ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by poisonAlien2.8k

I wouldn't recommend that. Just because a read has a valid second-best alignment doesn't mean it's unreliable, which is really what people should be filtering by. For example, if the best alignment has no mismatches and the second best has 6, then the alignment is quite reliable but would be filtered out. The proper way to do things is to filter by MAPQ (the concept of a "unique alignment" is absurd to begin with).

ADD REPLYlink written 5.2 years ago by Devon Ryan91k

What do you mean by "not unreliable" and "proper way to do things"? Isn't it dependent on the use-case? Couldn't there be a biological scenario where we want to find reads that aligned only once?

ADD REPLYlink modified 20 months ago • written 20 months ago by chefarov120
gravatar for mostafa.shokrof
5.2 years ago by
Egypt/Cairo/Nile University
mostafa.shokrof0 wrote:

you can use samtools to sort with the read names and print the lines which have unique read names , the first 12 charcters,

NB: uniq command suppose that the input is sorted so i used sort first

samtools sort  -no <ur bam > | uniq -uw 12

ADD COMMENTlink written 5.2 years ago by mostafa.shokrof0

That won't work. Bowtie2, like many aligners, can and will print only one of the many possible alignments to the SAM file. The correct procedure is to simply filter by MAPQ score and call it done.

ADD REPLYlink written 5.2 years ago by Devon Ryan91k

Thanks for your reply, but the results are not correct, I guess it is because some reads are not 12 characters

$ less sort.sam|uniq -uw 12|wc

  55383 1066786 30950672


SRR029237.22234    16    9    13629074    42    7M1I22M1I194M    *    0    0    TCTTTTCTAGACAAGCAAATGCTGAGGGAAGTTCATTACCAGTATACCTGCCTTACAAGAGCTCCTGAAAGAAGGACTAAATATGGAAAGAAAAGACCGTTATCATTCACTCCAAAAGAGATGTAAGTACAGAGACCAATGACGCTATAAAATAGCCACACAAACAAGTCAGCATAATAAACAGTTAACAACATGATGACAGGATCAAATCCACAAATATCAATA    3:+7CG:51;1;=C;=0BF<8;:512BF<B&<C9==C:<C;=8<<8;B:93<<C87;C;6855=C::0BF<<A<B===2AE<4<7@#;@=(6CG<77@<:B<:=9:B9=;<<C$4AE<8=:1<:6>==;=;===<=C<A=;=<=<9<;+7CG<<;:B==<<-@D<<C<<<9<=<=:B<-@D<=<<C<C<;B<<==<;=<===D=<<0BF<=C;</BF;;==;B8AS:i:-16    XN:i:0    XM:i:0    XO:i:2    XG:i:2    NM:i:2    MD:Z:223    YT:Z:UU


ADD REPLYlink written 5.2 years ago by zhwjmch90
gravatar for zhan2142
4.6 years ago by
United States
zhan214210 wrote:

samtools view -q 255 WG.bam

will give you the uniquely mapped reads.  At least it worked for me just a few minutes ago.

It is a little bit complicated to explain why Bowtie2 uses 255 in the mapping quality score to indicate uniquely mapped reads, but here is a very good blog that discusses the MAPQ score in bowtie2:

My simple (but may not be correct) reasoning is,

1) MAPQ=-10logP, P=the probability of the mapping is wrong

2) 255 is to be used for mapped reads where the MAPQ is not available (SAM specification)

For a perfectly uniquely mapped read, the probability of getting wrong is 0, and we know log0 doesn't exit, so 255 is used here to indicate the perfect match.

Given this reasoning, if the reads could map to two positions, then the probability of the alignment is wrong is 0.5, so MAPQ = -10*log(0.5, base=10) ~ 3. 

However, I might miss some "good" mapping by filtering on -q 255, as explained by the blog post I referred to previously.  Please check that blog carefully.


ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by zhan214210

Bowtie2 does not assign a MAPQ of 255 to "uniquely" aligned reads. You might be thinking of older versions of tophat2.

ADD REPLYlink written 4.6 years ago by Devon Ryan91k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 843 users visited in the last hour