How to extract unique mapped results from Bowtie2 bam results?
5
14
Entering edit mode
8.1 years ago
zhwjmch ▴ 150

How to extract unique mapped results from Bowtie2 bam results?

I used samtools view -bq 1 WG.bam > unique.bam

However, my results contain 54792 lines, why it is not 42097?

After I have the subset of those reads, how can I extract them from sam or bam file to create a new fastq format file? Can anybody offer some code to do so?

[samopen] SAM header is present: 84 sequences.
60291 (100.00%) were unpaired; of these:
2788 (4.62%) aligned 0 times
42097 (69.82%) aligned exactly 1 time
15406 (25.55%) aligned >1 times
95.38% overall alignment rate

SRR029237.10    16    4    21793833    42    242M    *    0    0    TTCCCAGTGTCTAGAATGTGGCATGCCCACAACAAATTCTAGCTGAAGGAATCAGCAAGGAGATGTTATGGAGCTCTACCAAAATACTAACCCAGAACTTGAGACATGGTCAGTCATGAGAATTTCCACTACGCTCTGCTTCTGTGATGTTAATTTTTATATTAACATATAAAATAATGGCATATATAGATTTTGAAGTGTGTGCTAATGGCATAAAATTGCCCTCATAAATAAATGAAGTC    =C1BF9<<=8==<4<C:6:7@::==-@D;<5>7,?D9A;5<<==:=C:B6?<<:;<=C=C<<;;;<B;;=C=8<<<<:;C(6CG<;<;9A,@D9<;C<=D<==;=;9<C9<<===<<<;9;B/AE=D==<8;;:<=<;=<C;:=<<=6<;A=C&09DH<:=<C=C:<;7<(6CG84==<C==9<<=<==;(6CG;=D=<<==<<<=;B:6><6:'5BF2;5)>B:<;=.AE=-@E;;;B=;=    AS:i:-8    XN:i:0    XM:i:2    XO:i:0    XG:i:0    NM:i:2    MD:Z:23G108A109    YT:Z:UU
SRR029237.11    0    17    62886136    1    37M1I159M    *    0    0    TCTCTAGATCNCAAAATCCCTTTTAGAATCCAACTCTGGGGTGGCACCAAGATCAGCAGAACCTCCATTTCCTCCTCTCTTTTCCCAAACCTTATTATGAAAGCCCCACATGGAACCATGTCAGGGCTGCAAGTGAAGCCATTCAACCTTTTTCCCCCCATCAAAAAAATTGGAGAACTATAATGTGCATAAAGTGC    1<;=8;==76!8FB4&<B>)FB4&58=48@8C=8<;=EA3#9C=8=B=B<<<38;<<7<C=C=<?6;FB2C=<C=;;4;GC7+FB0EA.B;C=.B::3<A='8EA3#5=<:@7@8C=:8=:;;FA/=8;<>5=<7B:1D=<D=6>5C=GC8.%GC91)"92=GB92,'"@9B<7<@;:<:9C<;8=;13;EA.<3=<    AS:i:-9    XS:i:-9    XN:i:0    XM:i:1    XO:i:1    XG:i:1    NM:i:2    MD:Z:10A185    YT:Z:UU

alignment • 35k views
0
Entering edit mode

according to the documentation of samtools -q is for skipping alignments with MAPQ smaller than INT which in your case 1

0
Entering edit mode

Thank you, but the results for the following command is 54792 lines too.

samtools view -q 1 WG.bam > unique2.sam

0
Entering edit mode

what command line have you used?

0
Entering edit mode
bowtie2 --very-sensitive -x hg19.fa -U

23
Entering edit mode
8.1 years ago

Bowtie2 will give an alignment a MAPQ score of 0 or 1 if it can map equally well to more than one location. Further, there's not always a perfect correspondence between the MAPQ of a read and the summary metrics it prints at the end (I'd need to go through the code again to determine how it arrives at the printed summary metrics, that's not documented anywhere). Finally, you would be well served to completely abandon the concept of "uniquely mapped". It is never useful and is always misleading, since you're simply lying to by labeling something unique. You're better served by simply filtering on a meaningful MAPQ (5 or 10 are often reasonable choices), which has the benefit of actually doing what you want, namely filtering according the the likelihood that an alignment is correct.

0
Entering edit mode

Thank you for your reply. So the following filter criterion should be more sensible, right? Any suggestions on how to extract those reads?

$less sort.sam |perl -lane 'print$_, if $F[4] >10'|wc 46470 888752 26242525$ less sort.sam |perl -lane 'print $_, if$F[4] > 5'|wc
49667  952267 27997939

5
Entering edit mode

It's easiest to just use samtools:

samtools view -b -q 10 foo.bam > foo.filtered.bam


or you could use -q 5 or some other threshold instead.

0
Entering edit mode

If I want to quantify repetitive elements, is it safe to do like this? I want to keep high reliable reads mapping to repetitive elements while not result in over-estimation like @per_ludvik.brattaas said. Thanks.

0
Entering edit mode

I don't recall what MAPQ bowtie2 will given to secondary alignments, but once can always filter them out with -F 256 and samtools. In point of fact bowtie2 defaults to only outputting one alignment/pair per read/pair, so the over-estimation concern won't happen.

I should note that if you really want to quantify repeats then you should use very different methods. Namely, you should allow multimappers and then count them as one only if they all (or some threshold fraction of them) align to a particular repeat class/family/etc. I used this method with countRepeats, though I'm not sure it's in a usable form at the moment.

0
Entering edit mode

I see. In my lab, we just filter XS to avoid multiple mapping and didn't care MAPQ value much...I would do more research on repeats mapping. Thanks for your help.

0
Entering edit mode

Would -q 20 be a good treshold for skipping the unique read filtering step?

0
Entering edit mode

I see how this is in general a good strategy. However, when quantifying reads to repeated sequences, such as transposons, wont this cause a drastic over-estimation?

If you don't extract primary or unique reads from the sam/bam, the reads that maps equally well to multiple sequences will cause a serious bias in quantification of repeated elements. I would say the best strategy here is to

1. Rank the alignments of the same read on alignment score
2. Choose only the best (primary read)
3. If the top scoring alignments have equal alignment score, choose one randomly.

Is this possible in bowtie2? It is in STAR.

0
Entering edit mode

That's what bowtie2 does by default. If you want primary alignments only then just use the appropriate samtools flag.

0
Entering edit mode

I don't know why I wrote that 3rd step. It is of course default behaviour in bowtie2.

What I meant was to only keep an alignment if it is better than all other alignments of that read. That is, if the top alignment has the same score as the second, it should be discarded as it is "impossible" to say which alignment is correct. However, if the first alignment has a perfect alignment and the second has e.g. 4 mismatches, it would be reasonable to keep the first ("primary") read.

I guess an option is to look whether AS > XS.

1
Entering edit mode

It's simpler to just filter by MAPQ, since what you care about is the likelihood that the alignment is wrong. Reasonable thresholds tend to be between 5 and 10 for most applications.

7
Entering edit mode
6.1 years ago
merodev ▴ 140

If you want only the uniquely mapped reads from bowtie2 sam files then you can proceed as follows:

grep -E "@|NM:" bt2output.sam | grep -v "XS:" > uniq_bt2output.sam


This first checks to see if there is alignment and then removes reads with secondary alignments. It will keep the headers intact. Hope this helps.

2
Entering edit mode
8.1 years ago
poisonAlien ★ 3.1k

Looking at Bowtie2 manual:

XS:i: Alignment score for the best-scoring alignment found other than the alignment reported. Can be negative. Can be greater than 0 in --local mode (but not in --end-to-end mode). Only present if the SAM record is for an aligned read and more than one alignment was found for the read. Note that, when the read is part of a concordantly-aligned pair, this score could be greater than AS:i. I think Bowtie2 adds XS:i tag if more than one valid alignments are found. You can remove lines with XS:i tag from your SAM file maybe?

4
Entering edit mode

I wouldn't recommend that. Just because a read has a valid second-best alignment doesn't mean it's unreliable, which is really what people should be filtering by. For example, if the best alignment has no mismatches and the second best has 6, then the alignment is quite reliable but would be filtered out. The proper way to do things is to filter by MAPQ (the concept of a "unique alignment" is absurd to begin with).

0
Entering edit mode

What do you mean by "not unreliable" and "proper way to do things"? Isn't it dependent on the use-case? Couldn't there be a biological scenario where we want to find reads that aligned only once?

0
Entering edit mode
8.1 years ago

you can use samtools to sort with the read names and print the lines which have unique read names, the first 12 charcters,

NB: uniq command suppose that the input is sorted so I used sort first

samtools sort -no <ur bam > | uniq -uw 12

3
Entering edit mode

That won't work. Bowtie2, like many aligners, can and will print only one of the many possible alignments to the SAM file. The correct procedure is to simply filter by MAPQ score and call it done.

0
Entering edit mode

Thanks for your reply, but the results are not correct, I guess it is because some reads are not 12 characters

$less sort.sam|uniq -uw 12|wc 55383 1066786 30950672 SRR029237.22234 16 9 13629074 42 7M1I22M1I194M * 0 0 TCTTTTCTAGACAAGCAAATGCTGAGGGAAGTTCATTACCAGTATACCTGCCTTACAAGAGCTCCTGAAAGAAGGACTAAATATGGAAAGAAAAGACCGTTATCATTCACTCCAAAAGAGATGTAAGTACAGAGACCAATGACGCTATAAAATAGCCACACAAACAAGTCAGCATAATAAACAGTTAACAACATGATGACAGGATCAAATCCACAAATATCAATA 3:+7CG:51;1;=C;=0BF<8;:512BF<B&<C9==C:<C;=8<<8;B:93<<C87;C;6855=C::0BF<<A<B===2AE<4<7@#;@=(6CG<77@<:B<:=9:B9=;<<C$4AE<8=:1<:6>==;=;===<=C<A=;=<=<9<;+7CG<<;:B==<<-@D<<C<<<9<=<=:B<-@D<=<<C<C<;B<<==<;=<===D=<<0BF<=C;</BF;;==;B8AS:i:-16    XN:i:0    XM:i:0    XO:i:2    XG:i:2    NM:i:2    MD:Z:223    YT:Z:UU
SRR029237.48319    16    9    13706101    42    20M1D122M1I45M    *    0    0    CCGCTAGGTCAATGGAATCATTCTTTGTGACTTGGGAGTTTCTCAGAGCCAGCAATGTTATCCAAGCTTCATCCCTGTGACTGAGGAGAATTGCATCAGCTGCTCACGGGATACCAGATCTCAGTACGACATCTATTTAAGCAAATGCCCTGCATGGCAGATGTCAGAGTGCAGTTACTGGATGCACG    2;<<8=9A<::B=;B7@<<<:@<+?C83;;;/93AE<(\$;?:9=;=<<8@<==:B;<4=<49A<C<;<C3==)>B88<:=;88==C3<=D3<7<8;<=<;<;=7=7<-@D:<<8@:<;<9<9<==7:9<:<==6:2BF;B=6&<A<=2BF<:=:;=

0
Entering edit mode
7.4 years ago
zhan2142 ▴ 10
samtools view -q 255 WG.bam


will give you the uniquely mapped reads. At least it worked for me just a few minutes ago.

It is a little bit complicated to explain why Bowtie2 uses 255 in the mapping quality score to indicate uniquely mapped reads, but here is a very good blog that discusses the MAPQ score in bowtie2: http://biofinysics.blogspot.com/2014/05/how-does-bowtie2-assign-mapq-scores.html

My simple (but may not be correct) reasoning is,

1. MAPQ=-10logP, P=the probability of the mapping is wrong
2. 255 is to be used for mapped reads where the MAPQ is not available (SAM specification)

For a perfectly uniquely mapped read, the probability of getting wrong is 0, and we know log0 doesn't exit, so 255 is used here to indicate the perfect match.

Given this reasoning, if the reads could map to two positions, then the probability of the alignment is wrong is 0.5, so MAPQ = -10*log(0.5, base=10) ~ 3.

However, I might miss some "good" mapping by filtering on -q 255, as explained by the blog post I referred to previously. Please check that blog carefully.

0
Entering edit mode

Bowtie2 does not assign a MAPQ of 255 to "uniquely" aligned reads. You might be thinking of older versions of tophat2.