Question: vcftools discrepancy in mean depth values per site
gravatar for oscarmirap
2 days ago by
oscarmirap10 wrote:

Dear all,

I'm using VCFtools 0.1.17 to filter out loci with low coverage, this is the process:

  1. Use --site-mean-depth to get a "table.ldepth.mean" with the mean values per site.
  2. Calculate manually how many loci there are above different mean depth values, and choose a minimum depth value.
  3. Use --min-meanDP to filter out the loci below the chosen value.

Problem is:

The final vcf file after pruning low depth sites always has fewer loci than expected, when calculated from the table.

Example: If I check how many loci there are with MEAN_DEPTH above 4 in the table there is 5926; but if I prune loci using "--min-meanDP 4" there are only 2157 loci left in the output file. And there should be more, because it should not only include loci above 4, but also equal to 4.

Tried this with many different data, and combinations of values for --min-meanDP and --max-meanDP, and I always get way fewer loci in the output file than if I calculate their number manually from the table.

After some digging I found out, the problem are the missing values, looks like ----site-mean-depth ignores missing values, while --min-meanDP considers that they have DP=0 (which makes more sense).

If the vcf file has no missing genotypes, the numbers match.

Someone else has this problem?

Is there an alternative to calculate mean depth per locus that will consider missing as 0?


vcftools snp coverage depth vcf • 55 views
ADD COMMENTlink modified 2 days ago by Kevin Blighe61k • written 2 days ago by oscarmirap10
gravatar for Kevin Blighe
2 days ago by
Kevin Blighe61k
Kevin Blighe61k wrote:

Someone else has this problem?

These problems are common, believe it or not. However, these programs come with a vast amount of parameter configurations - I can understand why the VCFtools developer would encode one different from the other at design time.

A missing / NA value at a particular site may not necessarily reflect 0 - depending on the variant caller, it could simply mean that not enough information was available to make a call, but not necessarily indicating 0 reads.

You may want to switch over to BCFtools, by the way. VCFtools is not maintained anymore, as far as I am aware.

Anyway, it is for these discrepancies that I never use these pre-programmed functions of these tools. I became comfortable with AWK and general shell scripting for passing VCFs, and this allows me to do pretty much everything that I need.


ADD COMMENTlink modified 2 days ago • written 2 days ago by Kevin Blighe61k

Thanks for the heads up on "missing", Kevin, you are right. Yes, before seeing your answer we were discussing already that it will be better to write our own script, in perl in my case. Cheers!


ADD REPLYlink written 1 day ago by oscarmirap10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1299 users visited in the last hour