Filtering VCF to divide with equal sizes
Entering edit mode
5 months ago
avelarbio46 ▴ 30

Hello everyone!

I have a very large VCF file (>400gb), and I want to divide it to use with VEP. VEP recommends separating the vcf, so I generated a list of contigs, based on the header, with 3^7 bases for each chromosome. This gave me a list list like this:

contig length

All the alt/small contigs are excluded because there are no variants within them.

And I have my chromosomes separated in different vcf files from a preprocessing


But separating chromosomes like:

bcftools view -r "$CHR" bigvcffile.vcf > "$CHR".vcf

Seems very inefficient as bcftools will run the filter on the big file (which takes 5-6 hours), all the times I separate a chromosome. I did this because this process would be worsened if I did a length filtering with 100 pieces, each one filtering based on the original VCF

How can I use bcftools to split these vcf based on my file in a more efficient way? Not sure if it is even possible

bcftools vcf • 565 views
Entering edit mode
5 months ago

I wrote , use it to generate a set of intervals of ~10000bp for your vcf

bcftools view -O v -G in.vcf.gz | java -jar dist/jvarkit.jar vcf2intervals --distance 10000 --min-distance 10    --bed > output.bed

then, for each line of the bed, run an instance of snpEff

bcftools view --regions "<INTERVAL>" in.vcf.gz  | (...snpEff ...) > annot.vcf

at the end, run bcftools concat to merge everything

NF for inspiration:

Entering edit mode
5 months ago
Ram 43k

This thread (in particular the linked answer) might be of use: How To Split A .Vcf.Gz File

It uses GNU parallel with vcftools, but you can change it a tiny bit to use bcftools like shown in a different loop-based answer in the same post.

Also, when you use -r or -R, bcftools uses the index file to get to the region, so it does not scan the whole VCF file. You don't need to worry about efficiency there.

Entering edit mode
5 months ago
avelarbio46 ▴ 30

I divided my contigs using R, but it was more complex than the program Pierre Lindenbaum made (

Then I ran:

 for i in {1..22}; do
     while IFS= read -r line; do
         echo "$i"
         echo "$line"
         echo "$count"
       bcftools view -Oz --regions "$line" /scratch4/vcfs/chr"$i".vcf.gz >
     done < /scratch4/vcf_contigs_3e7/unix_chr"$i".txt done

You can use/read an array with chromosomes names also if you have one with this loop

I didn't know bcftools uses the index, and I'm still benchmarking, but this should be efficient

Entering edit mode

A more efficient way would be to make a 2 column CSV file, one column with the count value and the other with the chr:start-end range, then use a single loop to do the actual computation. That way, you could even use GNU parallel


Login before adding your answer.

Traffic: 1530 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6