Split a VCF file into individual sample files
5
1
Entering edit mode
6.6 years ago
win ▴ 890

I have the thousand genomes VCF file such as the following ALL.chrX.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz and this contains all the samples data in many, many columns. I want to split this file into separate VCF files, one for each sample. I tried the following code with the new bcftools and kept getting error that -h is not defined. Is there some other way I can convert into individual VCF files?

for file in *.vcf*; do for sample in bcftools view -h $file | grep "^#CHROM" | cut -f10-; do bcftools view -c1 -Oz -s$sample -o ${file/.vcf*/.$sample.vcf.gz} $file done done Thanks in advance. VCF • 8.8k views ADD COMMENT 2 Entering edit mode ADD REPLY 0 Entering edit mode the code i put is from the post suggested and it does not work and my files are not generated using GATK. ADD REPLY 1 Entering edit mode 6.6 years ago Len Trigg ★ 1.5k I would just use zgrep rather than your bcftools command to get the header:  for file in *.vcf*; do for sample in$(zgrep -m 1 "^#CHROM" $file | cut -f10-); do bcftools view -c1 -Oz -s$sample -o ${file/.vcf*/.$sample.vcf.gz} $file done done ADD COMMENT 0 Entering edit mode • zgrep parses the entire file • bcftools view -h parses only the header, so it's faster • bcftools query -l lists all samples, so it's the fastest ADD REPLY 0 Entering edit mode Incorrect, zgrep -m 1 does not parse the entire file. (bcftools query -l is still better though :-)) ADD REPLY 0 Entering edit mode you're right. I didn't realize the -m 1 option that stops reading the file after the first match. ADD REPLY 1 Entering edit mode 4.7 years ago already stated here and here: for file in *.vcf*; do for sample in bcftools query -l$file; do
bcftools view -c1 -Oz -s $sample -o${file/.vcf*/.$sample.vcf.gz}$file
done
done

0
Entering edit mode
6.6 years ago
Brice Sarver ★ 3.7k

view -h is not defined in (at least) bcftools v0.1.19; it is defined in v1.0+. I would check your binaries.

0
Entering edit mode
9 weeks ago
bacon4000 ▴ 70

FYI, I wrote a tool that cat write out thousands of single-sample VCFs during a single pass through the multi-sample input. The limiting factor will be the OS and file system. Most can handle 10k or more open files before hitting a bottleneck.

https://github.com/auerlab/vcf-split

0
Entering edit mode

Being able to do everything in a single pass would be great, but as you say, you have to be able to keep lots of file handles opened, and that can be slower than writting to each file sequentially. Have a look to my "performance note" on this answer, which does the job purely with perl+sort+perl in a 2-pass manner without any OS nor memory limitation, as this is what we have used the few times we had to divide multi-sample files with a few thousands of exomes in them.

0
Entering edit mode

Thanks Jorge - I took a look at your script and quickly realized that comparing the two approaches is complicated. A lot will depend on the size and shape of the multi-sample VCF. More samples and fewer calls will favor your approach, I think, since it avoids the need to partition the samples into multiple passes. I set a hard limit of 10,000 samples in the code to prevent too much I/O bottleneck.

Fewer samples and more calls favors my one-pass approach since it doesn't need to sort a huge temporary file.

Also vcf-split has a feature that was critical to my analysis: It support filtering individual samples for het sites, which greatly reduces the I/O load since most are homozygous.

I need to update my README to better document the pros and cons.

0
Entering edit mode

Your explanation seems logical to me, so thank you for taking the time and for helping me understand my own code. I generated it for a couple of particular tasks, but I didn't thought that throughly about its limitations. I'll have this in mind if I ever need to use it again.

0
Entering edit mode
9 weeks ago
bacon4000 ▴ 70

With a task like this, I think one will have to pay the piper in some way. Creating thousands of single-sample files is inherently I/O intensive.

BTW, because I was only interested in het sites for my analysis, I was able to use over 40,000 simultaneous open files in vcf-split without a problem on my workstation. If I were keeping every site I'd want to limit it to under 10k.

For anyone unfamiliar, Unix sort commands limit memory use for large files by sorting small sections, saving the data to temporary files, and merging them. If the list is huge, there are a lot of temp files and it becomes very I/O intensive to write them and merge them. You can mitigate this to some extent using sort --buffer-size, telling it to use larger fragments and hence fewer temp files.

I'm not sure a better strategy exists for files with relatively few calls and more than a few thousand samples.

One other thing I noticed but forgot to mention: Running a VCF with 10,000 samples, the perl script was actually CPU-bound for the first step (separating the fixed fields and reshaping the samples into a linear format). Perl CPU time seems to be the bottleneck, whereas disk I/O is minimal. I haven't checked the split step, but I suspect it's fine since it's simple and writes one file at a time.

FreeBSD barracuda.uits  bacon ~/Barracuda/TOPMed/phased 1005: top

PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
12053 bacon         1 103    0    39M    28M CPU0     0 559:53  99.80% perl
12055 bacon         1  23    0    53M    40M pipdwt   1  43:25   7.62% bcftools
12054 bacon         1  20    0    27M    11M piperd   0   1:25   0.33% gsort
17792 bacon         1  20    0    13M  3356K CPU1     1   0:00   0.04% top

FreeBSD barracuda.uits  bacon ~/Barracuda/TOPMed/phased 1006: zpool iostat 1

capacity     operations    bandwidth
----------  -----  -----  -----  -----  -----  -----
zroot       9.41T  1.47T     42     33   298K   347K
zroot       9.41T  1.47T      0      0      0      0
zroot       9.41T  1.47T      0      0      0      0
zroot       9.41T  1.47T      0    671      0  8.45M
zroot       9.41T  1.47T      0      0   103K      0
zroot       9.41T  1.47T      0      0      0      0
zroot       9.41T  1.47T      0      0      0      0


You might be able to speed it up by reimplementing in C. Biolibc and libxtend can help with this. I may do this as an alternative in vcf-split (with proper credit to Jorge) so users can pick the approach that best suits their data.