Question

Bcftools merge taking too much time and producing large file

1

Entering edit mode

6.5 years ago

BAGeno ▴ 190

Hi,

I have VCFs of different samples split by chromosomes. I want to merge all these chromosomes files. For this I am using this command

bcftools merge --force-samples -m all *vcf.gz

The size of all vcf in gz form is 230GB. But the file which is produced after merging is of 1.2TB. And it has merged only two chromosomes till now. About 16 hours has passed when I ran this command.

Is something wrong with the command or it is running normally?

bcfttools merge vcf • 5.0k views

ADD COMMENT • link updated 6.5 years ago by Kevin Blighe 88k • written 6.5 years ago by BAGeno ▴ 190

score 6 · Accepted Answer · 2018-05-23

6

Entering edit mode

6.5 years ago

Kevin Blighe 88k

I believe it is expected that the merged size will be larger than the sum of all individual VCFs. Why? - This is the case because, in the merged file, a record is created for every single variant record across all of your files, and, even if the variant is only present in a single sample (private variant), information about it has to be added for each individual.

Consider the following:

Sample1

chr1  1234  A  T  1/1
chr1  5678  A  G  0/1
chr1  9999  G  C  0/1
chr6  1983  T  C  0/1

Sample2

chr3  1000  A  G  0/1
chr6  1983  T  C  1/1

Sample3

chr1  4444  T  C  1/1

Merged:

                  S1   S2   S3
chr1  1234  A  T  1/1  ./.  ./.
chr1  4444  T  C  ./.  ./.  1/1
chr1  5678  A  G  0/1  ./.  ./.
chr1  9999  G  C  0/1  ./.  ./.
chr3  1000  A  G  ./.  0/1  ./.
chr6  1983  T  C  0/1  1/1  ./.

The situation becomes 'worse' the more heterogeneous is your cohort, and also if your samples have many private variants.

Coincidentally, if you save your merged file as BCF, it will likely be less than the total size of your individual VCFs because BCF is binary format. Think of the difference between SAM and BAM - huge difference in size.

If you still have major issues with space and time, then consider removing all private variants prior to merging, or those variants with MAF lower than a certain cut-off.

----------------------------------

Before merging, by the way, I would consider normalising each and every individual VCF with:

while read gzVCF
do
    bcftools norm -m-any "${gzVCF}" | bcftools norm -Ob --check-ref w -f human_g1k_v37.fasta > "${gzVCF}".bcf ;

    bcftools index "${gzVCF}".bcf ;
done < gzVCF.list

1st pipe, splits multi-allelic calls into separate variant calls
2nd pipe, left-aligns indels and issues warnings when the REF base in your VCF does not match the base in the supplied FASTA reference genome

The file gzVCF.list contains the path of all of your VCFs / VCF.gz. This will ensure that your merge is conducted faithfully.

Kevin

ADD COMMENT • link 6.2 years ago by Kevin Blighe 88k

1

Entering edit mode

Also note that the correct way to merge (and output in binary BCF) is:

bcftools merge -Ob sample1.vcf sample2.vcf sample3.vcf

Doing it your way may have unexpected consequences, as you've found.

If you have many files to merge, then first buid up a long command for bcftools merge:

command="bcftools merge -Ob -m none " ;

while read gzVCF
do
    command="${command}"" ""${gzVCF}" ;
done < gzVCF.list

command="${command}"" -o ProjectMerge.bcf" ;

echo `$command` ;

bcftools index ProjectMerge.bcf ;

Here, all of my individual files are listed in gzVCF.list

NB - for

echo `$command` ;

those are backticks around $command

ADD REPLY • link 6.3 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi Kevin,

Would you please tell me your idea about the presence of variants with very low frequency or even private variants in the merged vcf file? With these variants, there are multiple sites as "./." through the merged vcf file. However, we can keep just variants that present say in 80 or 90% of individuals. However, on one hand, I'm not sure about the threshold value, one the other hand, we will miss many variants with these thresholds. Could you please let me know your idea about it, how we can deal with this issue?

Thanks

ADD REPLY • link 4.9 years ago by seta ★ 1.9k

0

Entering edit mode

Hello, what are you aiming to do with the data? There are indeed no standard thresholds. You could remove variants with a 'missingness' rate of 10%, so, AF >= 0.9.

You can add more tags to your VCF/BCF like this: A: How to use bcftools to calculate AF INFO field from AC and AN in VCF?

Note that my code above is already redundant. The way to merge now is via files listed in a file-name:

bcftools merge --file-list <file>

ADD REPLY • link 4.9 years ago by Kevin Blighe 88k

0

Entering edit mode

Thanks Kevin. The data generated from whole genome sequencing of a cohort of usual individuals (some healthy, some with complex diseases, without any selection for certain disease); data will be used for various subsequent analysis. I’m confused about your mention on AF, I filtered the missing genotypes with --max-missing 0.98 with vcftools and as I didn’t do any filtering on minor allele count, there is many sites with low allele frequency. Still, I’m confused regarding how to deal with uncalled genotypes without missing many variants, maybe informative variants. It will be great if I have your suggestion on this issue. Thanks for your updating on the code.

ADD REPLY • link 4.9 years ago by seta ★ 1.9k