Question

Weird behavior of `--max-depth` argument from `bcftools mpileup` command when its value is too low

1

Entering edit mode

2.1 years ago

sbstevenlee ▴ 480

Dear all,

Today I'm trying to clarify my understanding of --max-depth argument from bcftools mpileup command, which is currently described as:

At a position, read maximally INT reads per input file. Note that the original samtools mpileup command had a minimum value of 8000/n where n was the number of input files given to mpileup. This means that in samtools mpileup the default was highly likely to be increased and the -d parameter would have an effect only once above the cross-sample minimum of 8000. This behavior was problematic when working with a combination of single- and multi-sample bams, therefore in bcftools mpileup the user is given the full control (and responsibility), and an informative message is printed instead [250]

My question is: I get that this argument sets the maximum number of reads used to study each position per input file, but is there any minimum threshold (e.g. 100) below which user-specified value has no effects?

For example, I'm currently using bcftools' mpileup + call pipeline for variant calling of a 30x WGS sample:

$ fasta=/Users/sbslee/OneDrive/TestFiles/genome.fa
$ bam=/Users/sbslee/OneDrive/PyPGx/GeT-RM/grch37-bam/NA18519_PyPGx.sorted.markdup.recal.bam
$ region=chr2:234663648-234663651
$ bcftools mpileup -Ou -f $fasta -r $region $bam | bcftools call -mv -Ov

Which gives:

Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[W::hts_idx_load3] The index file is older than the data file: /Users/sbslee/OneDrive/PyPGx/GeT-RM/grch37-bam/NA18519_PyPGx.sorted.markdup.recal.bai
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.15+htslib-1.15
##bcftoolsCommand=mpileup -Ou -f /Users/sbslee/OneDrive/TestFiles/genome.fa -r chr2:234663648-234663651 /Users/sbslee/OneDrive/PyPGx/GeT-RM/grch37-bam/NA18519_PyPGx.sorted.markdup.recal.bam
##reference=file:///Users/sbslee/OneDrive/TestFiles/genome.fa
##contig=<ID=chrM,length=16571>
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of raw reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of raw reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Read Position Bias (closer to 0 is better)">
##INFO=<ID=MQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality Bias (closer to 0 is better)">
##INFO=<ID=BQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Base Quality Bias (closer to 0 is better)">
##INFO=<ID=MQSBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality vs Strand Bias (closer to 0 is better)">
##INFO=<ID=SCBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Soft-Clip Length Bias (closer to 0 is better)">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##bcftools_callVersion=1.15+htslib-1.15
##bcftools_callCommand=call -mv -Ov; Date=Sat Apr  2 16:16:56 2022
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA18519_PyPGx
chr2    234663649   .   C   G   225.417 .   DP=35;VDB=0.241828;SGB=-0.693136;MQSBZ=0;FS=0;MQ0F=0;AC=2;AN=2;DP4=0,0,17,18;MQ=60  GT:PL   1/1:255,105,0

Because I didn't change --max-depth, mpileup used the default value of 250 and the sample had a total of 35 reads for chr2-234663649-C-G (i.e. DP=35).

Now, by setting --max-depth 10 I would expect the variant to be still called but with DP=10 instead of DP=35. Let's run this:

$ bcftools mpileup -Ou -f $fasta -r $region --max-depth 10 $bam | bcftools call -mv -Ov

Which gives:

Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[W::hts_idx_load3] The index file is older than the data file: /Users/sbslee/OneDrive/PyPGx/GeT-RM/grch37-bam/NA18519_PyPGx.sorted.markdup.recal.bai
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 10
Note: The maximum per-sample depth with -d 10 is 10.0x
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.15+htslib-1.15
##bcftoolsCommand=mpileup -Ou -f /Users/sbslee/OneDrive/TestFiles/genome.fa -r chr2:234663648-234663651 -d 10 /Users/sbslee/OneDrive/PyPGx/GeT-RM/grch37-bam/NA18519_PyPGx.sorted.markdup.recal.bam
##reference=file:///Users/sbslee/OneDrive/TestFiles/genome.fa
##contig=<ID=chrM,length=16571>
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of raw reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of raw reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Read Position Bias (closer to 0 is better)">
##INFO=<ID=MQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality Bias (closer to 0 is better)">
##INFO=<ID=BQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Base Quality Bias (closer to 0 is better)">
##INFO=<ID=MQSBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality vs Strand Bias (closer to 0 is better)">
##INFO=<ID=SCBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Soft-Clip Length Bias (closer to 0 is better)">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##bcftools_callVersion=1.15+htslib-1.15
##bcftools_callCommand=call -mv -Ov; Date=Sat Apr  2 16:30:10 2022
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA18519_PyPGx
chr2    234663649   .   C   G   225.417 .   DP=33;VDB=0.184924;SGB=-0.693127;MQSBZ=0;FS=0;MQ0F=0;AC=2;AN=2;DP4=0,0,16,17;MQ=60  GT:PL   1/1:255,99,0

As you can see, the total depth decreased from 35 to 33 but it's nowhere close to our expected 10. Can someone please enlighten me what I'm missing here?

FYI, I'm using the latest version of bcftools available in bioconda (1.15):

$ conda list | grep "bcftools"
bcftools                  1.15                 hd1821b6_2    bioconda

P.S. This question has been cross-posted onto bcftools GitHub repository issue.

bcftools mpileup • 1.4k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 2.1 years ago by sbstevenlee ▴ 480

0

Entering edit mode

P.S. This question has been cross-posted onto bcftools GitHub repository issue.

In general, please don't crosspost — it just leads to duplication of effort as different people answer in different forums.

In particular, please don't crosspost between this Q&A forum and the tool's GitHub repository. Unless you are sure you have encountered a specific bug in the tool, if you are considering whether to post a question here or as an issue in the tool's repository, it is always more beneficial to post as a question here in a Q&A forum with a wider audience.

ADD REPLY • link 2.1 years ago by John Marshall 3.0k

score 1 · Accepted Answer · 2022-04-04

Sharing a satisfying answer I received from a contributor of bcftools:

It's not reading ahead further in the file to work out whether a read which may currently cause over-depth will be needed later, so sadly this artefact does still happen, but I believe it may attempt to mitigate it to a certain degree by limiting the number of alignments starting within close proximity (which is the usual pattern for a high spike caused by misalignment of repeats). Hence the filtering isn't as strict as it may sound, an is perhaps closer to an average maximum depth threshold rather than something to take as a literal limit.

You can read the full discussion here.