I believe it is expected that the merged size will be larger than the sum of all individual VCFs. Why? - This is the case because, in the merged file, a record is created for every single variant record across all of your files, and, even if the variant is only present in a single sample (private variant), information about it has to be added for each individual.
Consider the following:
Sample1
chr1 1234 A T 1/1
chr1 5678 A G 0/1
chr1 9999 G C 0/1
chr6 1983 T C 0/1
Sample2
chr3 1000 A G 0/1
chr6 1983 T C 1/1
Sample3
chr1 4444 T C 1/1
Merged:
S1 S2 S3
chr1 1234 A T 1/1 ./. ./.
chr1 4444 T C ./. ./. 1/1
chr1 5678 A G 0/1 ./. ./.
chr1 9999 G C 0/1 ./. ./.
chr3 1000 A G ./. 0/1 ./.
chr6 1983 T C 0/1 1/1 ./.
The situation becomes 'worse' the more heterogeneous is your cohort, and also if your samples have many private variants.
Coincidentally, if you save your merged file as BCF, it will likely be less than the total size of your individual VCFs because BCF is binary format. Think of the difference between SAM and BAM - huge difference in size.
If you still have major issues with space and time, then consider removing all private variants prior to merging, or those variants with MAF lower than a certain cut-off.
----------------------------------
Before merging, by the way, I would consider normalising each and every individual VCF with:
while read gzVCF
do
bcftools norm -m-any "${gzVCF}" | bcftools norm -Ob --check-ref w -f human_g1k_v37.fasta > "${gzVCF}".bcf ;
bcftools index "${gzVCF}".bcf ;
done < gzVCF.list
- 1st pipe, splits multi-allelic calls into separate variant calls
- 2nd pipe, left-aligns indels and issues warnings when the REF base in
your VCF does not match the base in the supplied FASTA reference
genome
The file gzVCF.list contains the path of all of your VCFs / VCF.gz. This will ensure that your merge is conducted faithfully.
Kevin
Also note that the correct way to merge (and output in binary BCF) is:
Doing it your way may have unexpected consequences, as you've found.
If you have many files to merge, then first buid up a long command for bcftools merge:
Here, all of my individual files are listed in gzVCF.list
NB - for
those are backticks around
$command
Hi Kevin,
Would you please tell me your idea about the presence of variants with very low frequency or even private variants in the merged vcf file? With these variants, there are multiple sites as "./." through the merged vcf file. However, we can keep just variants that present say in 80 or 90% of individuals. However, on one hand, I'm not sure about the threshold value, one the other hand, we will miss many variants with these thresholds. Could you please let me know your idea about it, how we can deal with this issue?
Thanks
Hello, what are you aiming to do with the data? There are indeed no standard thresholds. You could remove variants with a 'missingness' rate of 10%, so, AF >= 0.9.
You can add more tags to your VCF/BCF like this: A: How to use bcftools to calculate AF INFO field from AC and AN in VCF?
Note that my code above is already redundant. The way to merge now is via files listed in a file-name:
Thanks Kevin. The data generated from whole genome sequencing of a cohort of usual individuals (some healthy, some with complex diseases, without any selection for certain disease); data will be used for various subsequent analysis. I’m confused about your mention on AF, I filtered the missing genotypes with
--max-missing 0.98
with vcftools and as I didn’t do any filtering on minor allele count, there is many sites with low allele frequency. Still, I’m confused regarding how to deal with uncalled genotypes without missing many variants, maybe informative variants. It will be great if I have your suggestion on this issue. Thanks for your updating on the code.