Filtering VCF to divide with equal sizes
3
0
Entering edit mode
7 months ago
avelarbio46 ▴ 30

Hello everyone!

I have a very large VCF file (>400gb), and I want to divide it to use with VEP. VEP recommends separating the vcf, so I generated a list of contigs, based on the header, with 3^7 bases for each chromosome. This gave me a list list like this:

contig length

All the alt/small contigs are excluded because there are no variants within them.

And I have my chromosomes separated in different vcf files from a preprocessing

chr1.vcf.gz
chr2.vcf.gz
etc

But separating chromosomes like:

bcftools view -r "$CHR" bigvcffile.vcf > "$CHR".vcf

Seems very inefficient as bcftools will run the filter on the big file (which takes 5-6 hours), all the times I separate a chromosome. I did this because this process would be worsened if I did a length filtering with 100 pieces, each one filtering based on the original VCF

How can I use bcftools to split these vcf based on my file in a more efficient way? Not sure if it is even possible

bcftools vcf • 646 views
ADD COMMENT
2
Entering edit mode
7 months ago

I wrote https://jvarkit.readthedocs.io/en/latest/VcfToIntervals/ , use it to generate a set of intervals of ~10000bp for your vcf

bcftools view -O v -G in.vcf.gz | java -jar dist/jvarkit.jar vcf2intervals --distance 10000 --min-distance 10    --bed > output.bed

then, for each line of the bed, run an instance of snpEff

bcftools view --regions "<INTERVAL>" in.vcf.gz  | (...snpEff ...) > annot.vcf

at the end, run bcftools concat to merge everything

NF for inspiration: https://github.com/lindenb/gazoduc-nf/blob/4c58ed74595440d40d3cf93b1baad7b270ef3049/subworkflows/jvarkit/jvarkit.vcf2intervals.nf

ADD COMMENT
1
Entering edit mode
7 months ago
Ram 43k

This thread (in particular the linked answer) might be of use: How To Split A .Vcf.Gz File

It uses GNU parallel with vcftools, but you can change it a tiny bit to use bcftools like shown in a different loop-based answer in the same post.

Also, when you use -r or -R, bcftools uses the index file to get to the region, so it does not scan the whole VCF file. You don't need to worry about efficiency there.

ADD COMMENT
0
Entering edit mode
7 months ago
avelarbio46 ▴ 30

I divided my contigs using R, but it was more complex than the program Pierre Lindenbaum made (https://jvarkit.readthedocs.io/en/latest/VcfToIntervals/)

Then I ran:

 count=0
 for i in {1..22}; do
     while IFS= read -r line; do
         ((count+=1))
         echo "$i"
         echo "$line"
         echo "$count"
       bcftools view -Oz --regions "$line" /scratch4/vcfs/chr"$i".vcf.gz >
 /scratch4/split_contig/split_"$count"_chr"$i".vcf.gz
     done < /scratch4/vcf_contigs_3e7/unix_chr"$i".txt done

You can use/read an array with chromosomes names also if you have one with this loop

I didn't know bcftools uses the index, and I'm still benchmarking, but this should be efficient

ADD COMMENT
1
Entering edit mode

A more efficient way would be to make a 2 column CSV file, one column with the count value and the other with the chr:start-end range, then use a single loop to do the actual computation. That way, you could even use GNU parallel

ADD REPLY

Login before adding your answer.

Traffic: 1836 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6