Question

Filtering VCF to divide with equal sizes

0

Entering edit mode

10 months ago

avelarbio46 ▴ 30

Hello everyone!

I have a very large VCF file (>400gb), and I want to divide it to use with VEP. VEP recommends separating the vcf, so I generated a list of contigs, based on the header, with 3^7 bases for each chromosome. This gave me a list list like this:

contig length

All the alt/small contigs are excluded because there are no variants within them.

And I have my chromosomes separated in different vcf files from a preprocessing

chr1.vcf.gz
chr2.vcf.gz
etc

But separating chromosomes like:

bcftools view -r "$CHR" bigvcffile.vcf > "$CHR".vcf

Seems very inefficient as bcftools will run the filter on the big file (which takes 5-6 hours), all the times I separate a chromosome. I did this because this process would be worsened if I did a length filtering with 100 pieces, each one filtering based on the original VCF

How can I use bcftools to split these vcf based on my file in a more efficient way? Not sure if it is even possible

bcftools vcf • 735 views

ADD COMMENT • link updated 10 months ago by Ram 44k • written 10 months ago by avelarbio46 ▴ 30

1

Entering edit mode

10 months ago

Ram 44k

This thread (in particular the linked answer) might be of use: How To Split A .Vcf.Gz File

It uses GNU parallel with vcftools, but you can change it a tiny bit to use bcftools like shown in a different loop-based answer in the same post.

Also, when you use -r or -R, bcftools uses the index file to get to the region, so it does not scan the whole VCF file. You don't need to worry about efficiency there.

ADD COMMENT • link 10 months ago by Ram 44k

0

Entering edit mode

10 months ago

avelarbio46 ▴ 30

I divided my contigs using R, but it was more complex than the program Pierre Lindenbaum made (https://jvarkit.readthedocs.io/en/latest/VcfToIntervals/)

Then I ran:

 count=0
 for i in {1..22}; do
     while IFS= read -r line; do
         ((count+=1))
         echo "$i"
         echo "$line"
         echo "$count"
       bcftools view -Oz --regions "$line" /scratch4/vcfs/chr"$i".vcf.gz >
 /scratch4/split_contig/split_"$count"_chr"$i".vcf.gz
     done < /scratch4/vcf_contigs_3e7/unix_chr"$i".txt done

You can use/read an array with chromosomes names also if you have one with this loop

I didn't know bcftools uses the index, and I'm still benchmarking, but this should be efficient

ADD COMMENT • link updated 10 months ago by Ram 44k • written 10 months ago by avelarbio46 ▴ 30

1

Entering edit mode

A more efficient way would be to make a 2 column CSV file, one column with the count value and the other with the chr:start-end range, then use a single loop to do the actual computation. That way, you could even use GNU parallel

ADD REPLY • link 10 months ago by Ram 44k

score 2 · Accepted Answer · 2023-09-21

I wrote https://jvarkit.readthedocs.io/en/latest/VcfToIntervals/ , use it to generate a set of intervals of ~10000bp for your vcf

bcftools view -O v -G in.vcf.gz | java -jar dist/jvarkit.jar vcf2intervals --distance 10000 --min-distance 10    --bed > output.bed

then, for each line of the bed, run an instance of snpEff

bcftools view --regions "<INTERVAL>" in.vcf.gz  | (...snpEff ...) > annot.vcf

at the end, run bcftools concat to merge everything

NF for inspiration: https://github.com/lindenb/gazoduc-nf/blob/4c58ed74595440d40d3cf93b1baad7b270ef3049/subworkflows/jvarkit/jvarkit.vcf2intervals.nf