I wish to get the variants of my sequencing data using Samtools (vcfutils). For this I need to specify the maximum read depth in VarFilter function. I have no information of this from my Data. Can anybody tell me, how should I approach to its calculations, or considering default values to obtain SNPs in vcf format??
generally, maximal coverage for variant calling is set out of memory considerations, for this purpose i think the practice recommended in the GATK Unified Genotyper should be appropriate:
"...When running on projects with many samples at low coverage (e.g. 1000 Genomes with 4x coverage per sample) we usually lower this [the maximal coverage] value to about 10 times the average coverage (40x)..."
I have previously considered this. The '-D' flag allows the program to avoid regions that have abnormally high coverage, e.g. PCR amplification errors. I calculated the mean/median coverage for the sample using BEDTools, which was also covered a previous Biostars question.
This method does includes 0 counts for regions not covered by reads.
You can put the output in R or i used the Perl script at the bottom of the above link by Heikki.
To set '-D' i used a value 3 to 5 x that of the mean coverage, you'll to play with this to get the best result for your samples. I saw very little difference in the number of SNPs using any multiplier >3.
Bear in mind SNPs will not be reported for those regions with coverage > '-D'. Setting the value too high will result in too many false positives.
Also it is hightly recommended to locally realign reads to avoid false positive SNPs due to the presence of INDELs, e.g. SRMA or use of the GATK suite.
ADD COMMENT
• link
updated 5.1 years ago by
Ram
44k
•
written 12.7 years ago by
Ian
6.1k
Is the average coverage the average of the depths of all the variants in a VCF file? Or the average coverage of every mapped base in the exome/genome?