Question: bcftools view -Oz -S [sample list] [input vcf] -o [output vcf] incredibly slow for vcf with many samples
0
gravatar for curious
11 days ago by
curious140
curious140 wrote:

I have a tabix indexed gzipped vcf that contain about 40K samples (approx 74 GB).

I just want to isolate 6 samples from this vcf, which I have in a sample list

I run on a job on a cluster that basically :

bcftools view -Oz -S [sample list] [input vcf] -o [output vcf]

5 hours later it is still running and I have like a 2000 kb output file. So I shut it down and ask here. Why is this so slow? When I use bcftools stats, I can tell it is just really slowing adding more variants with each write. Should I be going a different way or is this just reality of working with a file this big?

I tried increasing compression threads --threads, but it is not super obvious that this provides a speedup.

bcftools vcf • 137 views
ADD COMMENTlink modified 11 days ago by Pierre Lindenbaum124k • written 11 days ago by curious140
0
gravatar for Pierre Lindenbaum
11 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

bcf needs to parse every genotypes, may that's slow for 40k samples.

try to run in parallel for each contig ?

otherwise, try cut ?

first , get the column offsets for your samples:

bcftools view --header-only input.vcf.gz |  grep  "#CHROM" | cut -f 10- | tr "\t" "\n" | cat -n | grep -f samples_list.txt

then, use cut:

gunzip -c  input.vcf.gz | cut -f 1-10,<and-the-columns-indexes> | bgzip > out.vcf.gz

I'm not sure it it will be faster...

ADD COMMENTlink modified 11 days ago • written 11 days ago by Pierre Lindenbaum124k

I will give that a shot, thank you so much.

ADD REPLYlink written 11 days ago by curious140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1484 users visited in the last hour