bcftools view -Oz -S [sample list] [input vcf] -o [output vcf] incredibly slow for vcf with many samples
1
0
Entering edit mode
4.5 years ago
curious ▴ 750

I have a tabix indexed gzipped vcf that contain about 40K samples (approx 74 GB).

I just want to isolate 6 samples from this vcf, which I have in a sample list

I run on a job on a cluster that basically :

bcftools view -Oz -S [sample list] [input vcf] -o [output vcf]

5 hours later it is still running and I have like a 2000 kb output file. So I shut it down and ask here. Why is this so slow? When I use bcftools stats, I can tell it is just really slowing adding more variants with each write. Should I be going a different way or is this just reality of working with a file this big?

I tried increasing compression threads --threads, but it is not super obvious that this provides a speedup.

vcf bcftools • 1.7k views
ADD COMMENT
0
Entering edit mode
4.5 years ago

bcf needs to parse every genotypes, may that's slow for 40k samples.

try to run in parallel for each contig ?

otherwise, try cut ?

first , get the column offsets for your samples:

bcftools view --header-only input.vcf.gz |  grep  "#CHROM" | cut -f 10- | tr "\t" "\n" | cat -n | grep -f samples_list.txt

then, use cut:

gunzip -c  input.vcf.gz | cut -f 1-10,<and-the-columns-indexes> | bgzip > out.vcf.gz

I'm not sure it it will be faster...

ADD COMMENT
0
Entering edit mode

I will give that a shot, thank you so much.

ADD REPLY

Login before adding your answer.

Traffic: 4031 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6