Difference In Quality Strings For Samtools Mpileup When Using -L Option Versus -R Option
1
3
Entering edit mode
11.3 years ago
nathanddees ▴ 30

Hello -

I was wondering if anyone else has noticed this wierd phenomena using the Samtools mpileup command:

I am running mpileup on a small test BAM file, looking at chr 3, position 72880191. (The test BAM file is linked publicly here: https://dl.dropbox.com/u/11626982/test.bam.)

In one scenario, I am using the -l option for a list of variants in BED format, of which I have a list of only one:

~$ cat one_site.bed

3       72880190        72880191        A       G

~$ samtools mpileup -f all_sequences.fa -l one_site.bed test.bam

[mpileup] 1 samples in 1 input files

<mpileup> Set max per-file depth to 8000

3       72880191        A    171        .$.$..,.............GG.............,,g.GG.G.G,gG.........,gGg...G.G.G.G...G.,.,.,.....,.,,......,,,,,,,,,,,,,,.,,,,,,,,,,,,,,,gg,gggg,gggggg,gg,gg,,.,,,,G,,GggggG,G,,gGg^],^],^]g^]g      DDDDF@DDBDDBDDB>>@DBDDDDDB>DBBDBBHH9DDDBDD?HGDDD>DDD<D5FDE@DBDGDABB<EDADGCFCDBFD9DD#F#DJDDD?BBDJHDJHJDJDJDDBEDDIDDJDCDDCDFDB;;D9:>AD9;9;>3D?(@>3'D?D>@DDD>#>A8@?0@D<9<CDD#C

(Please note that all_sequences.fa here is the NCBI-human-build36 reference fasta.)

In a second scenario, I am using the -r option for listing the position of interest:

~$ samtools mpileup -f all_sequences.fa -r 3:72880191-72880191 test.bam

[mpileup] 1 samples in 1 input files

<mpileup> Set max per-file depth to 8000

3       72880191        A    171        .$.$..,.............GG.............,,g.GG.G.G,gG.........,gGg...G.G.G.G...G.,.,.,.....,.,,......,,,,,,,,,,,,,,.,,,,,,,,,,,,,,,gg,gggg,gggggg,gg,gg,,.,,,,G,,GggggG,G,,gGg^],^],^]g^]g      $111%1.10!11!,1'!3!!!26-/!*)$(-$3)!!8!!!!,!!"!'!'%-+!&$!#!#$,$!!#$!!!-#,!!%+'$0)!!!#%!6:!!!!(!7-?8,;74.6(538!33'57-5-530162-%)3)(%!3*!-'$)3#(-!)'313733!33#%!*#%0$39)'!44#!

What I am trying to figure out is - Why are the quality strings so different when everything else in the output of both scenarios is identical?

Thank you for any help!

Update - Found this on SeqAnswers:

"it looks like the issue is that if I don't specify the -r flag, samtools does not use BAQ computation."

Is there any way to make sure BAQ computation remains ON when using the -l flag?

Thanks!

samtools mpileup • 4.5k views
ADD COMMENT
1
Entering edit mode
11.3 years ago

I was able to reproduce this error in samtools v0.1.17 and v0.1.18. Their intention was always to keep BAQ computation enabled by default, and this works fine if you use -r or if you run it on the whole BAM. But BAQ computation is disabled when you use -l, and this is an error that you should report to samtools-help@lists.sourceforge.net.

With samtools v0.1.16 and earlier, BAQ computation appears to be disabled for both -r and -l, and only works when running on the whole BAM. Someone must have fixed the problem for -r in later versions, but -l was probably overlooked. (Edited: Andreas points out below that BAQ artifacts can happen at the ends of BED regions, or in this case, for a BED region that is only 1bp long)

For now, if you really need to restrict pileup to regions in a BED file, you can pipe the output of "samtools view -L" into mpileup. For e.g.

samtools view -b -L one_site.bed test.bam | samtools mpileup -f all_sequences.fa -

But that also generates output for neighboring sites outside your BED's regions. So here's a nifty perl one-liner that I made just for you, buddy!

perl -e 'map{chomp; @c=split(/\t/); map{$roi{$c[0]}{$_}=1}($c[1]+1)..$c[2]}`cat one_site.bed`; map{@c=split(/\t/); print $_ if($roi{$c[0]}{$c[1]})}`samtools view -b -L one_site.bed test.bam | samtools mpileup -f all_sequences.fa -`'

The first map{} loads all the loci in your BED file into %roi, and the second map{} prints only the pileup output for loci in %roi.

ADD COMMENT
1
Entering edit mode

Hey Cyriac, I've heard before that BAQ artifacts are possible at the end of bed-regions (that's expected I guess), however, this is the first time I hear BAQ is completely disabled when using a bed file as input. Do you remember where you read that 'BAQ computation was incorrectly disabled for both -r and -l' and that someone 'fixed the problem for -r in later versions'. I'm also affected by this problem and can't actually find the cause in samtools' source code.

ADD REPLY
0
Entering edit mode

I've edited my answer above to clarify that those were presumptions based on my testing with different samtools versions. The "fix for -r in later versions" could have been an unintentional change.

But you make a point that I wasn't aware of - that artifacts are possible at the end of bed-regions. Is this because BAQ is based on the quality of the surrounding alignments? But it seems to work correctly for -r, so there is no reason for it to behave differently with -l. I tested out this theory by adding a few flanking bps around 72880190-72880191 in the OP's BED file, and BAQ computation was properly calculated for the single site in question. So it must be some kind of exception at the ends of BED regions, like you said.

ADD REPLY

Login before adding your answer.

Traffic: 1999 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6