Question: Quickest way to filter vcf using bed files
0
gravatar for spiral01
23 months ago by
spiral01100
spiral01100 wrote:

I am filtering vcf files using bed files using vcftools. I have one bed file per vcf (split by chromosome):

for i in "${chroms[@]}"; do vcftools --gzvcf denisovan/chr"$i"_mq25_mapab100.vcf.gz --bed bed/chr"$i"_mask.bed --recode --keep-INFO-all --stdout | gzip -c > filtered/denisovan.filtered."$i".vcf.gz; done

This is proving extremely slow. Is there any tool that is much quicker for doing this?

snp • 1.4k views
ADD COMMENTlink written 23 months ago by spiral01100
1

note: don't use gzip , but bgzip.

ADD REPLYlink written 23 months ago by Pierre Lindenbaum123k

Hi, what is the reasoning for using bgzip over gzip? Thanks.

ADD REPLYlink written 23 months ago by spiral01100
1

bgzip allows random access and you can use it with tabix: it's faster to extract a random partion (a genomic interval) of your vcf.

e.g: https://software.broadinstitute.org/software/igv/VCF

VCF data files must be indexed for viewing in IGV, either by using igvtools or by using Tabix.

ADD REPLYlink written 23 months ago by Pierre Lindenbaum123k
1
gravatar for Pierre Lindenbaum
23 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

use gnu parallel :

 (seq 1 10 && echo X && echo Y) | parallel ' vcftools --gzvcf denisovan/chr{}_mq25_mapab100.vcf.gz (...) '
ADD COMMENTlink written 23 months ago by Pierre Lindenbaum123k

That's extremely useful. Thank you.

ADD REPLYlink written 23 months ago by spiral01100

I have been running the above code as such since yesterday:

(seq 1 22 && echo X) | parallel ' vcftools --gzvcf denisovan/chr{}_mq25_mapab100.vcf.gz --bed bed/chr{}_mask.bed --recode --keep-INFO-all --stdout | bgzip -i -I filtered/{}.tbi > denisovan.filtered.{}.vcf.bgz '

Vcftools is running on 4 files, but it is still extremely slow. Each vcf.gz file is 3-4gb in size, but having had vcftools running for almost 24 hours now, the four files still have not been completed. Is this normal? I am aware that I will be limited by processor speed but as this will take several days before it is completed I wanted to check that there is no way to optimise this process. Thanks.

ADD REPLYlink written 23 months ago by spiral01100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 907 users visited in the last hour