Question: Quickest way to filter vcf using bed files
0
gravatar for spiral01
16 months ago by
spiral0180
spiral0180 wrote:

I am filtering vcf files using bed files using vcftools. I have one bed file per vcf (split by chromosome):

for i in "${chroms[@]}"; do vcftools --gzvcf denisovan/chr"$i"_mq25_mapab100.vcf.gz --bed bed/chr"$i"_mask.bed --recode --keep-INFO-all --stdout | gzip -c > filtered/denisovan.filtered."$i".vcf.gz; done

This is proving extremely slow. Is there any tool that is much quicker for doing this?

snp • 965 views
ADD COMMENTlink written 16 months ago by spiral0180
1

note: don't use gzip , but bgzip.

ADD REPLYlink written 16 months ago by Pierre Lindenbaum118k

Hi, what is the reasoning for using bgzip over gzip? Thanks.

ADD REPLYlink written 16 months ago by spiral0180
1

bgzip allows random access and you can use it with tabix: it's faster to extract a random partion (a genomic interval) of your vcf.

e.g: https://software.broadinstitute.org/software/igv/VCF

VCF data files must be indexed for viewing in IGV, either by using igvtools or by using Tabix.

ADD REPLYlink written 16 months ago by Pierre Lindenbaum118k
1
gravatar for Pierre Lindenbaum
16 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

use gnu parallel :

 (seq 1 10 && echo X && echo Y) | parallel ' vcftools --gzvcf denisovan/chr{}_mq25_mapab100.vcf.gz (...) '
ADD COMMENTlink written 16 months ago by Pierre Lindenbaum118k

That's extremely useful. Thank you.

ADD REPLYlink written 16 months ago by spiral0180

I have been running the above code as such since yesterday:

(seq 1 22 && echo X) | parallel ' vcftools --gzvcf denisovan/chr{}_mq25_mapab100.vcf.gz --bed bed/chr{}_mask.bed --recode --keep-INFO-all --stdout | bgzip -i -I filtered/{}.tbi > denisovan.filtered.{}.vcf.bgz '

Vcftools is running on 4 files, but it is still extremely slow. Each vcf.gz file is 3-4gb in size, but having had vcftools running for almost 24 hours now, the four files still have not been completed. Is this normal? I am aware that I will be limited by processor speed but as this will take several days before it is completed I wanted to check that there is no way to optimise this process. Thanks.

ADD REPLYlink written 16 months ago by spiral0180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1014 users visited in the last hour