Question

How To Split A .Vcf.Gz File

8

Entering edit mode

13.5 years ago

Jorge Amigo 14k

I am currently working with 1000 Genomes latest released data, which is a large >60GB .vcf.gz file. I am having difficulties to process it as I used to process .vcf.gz files before, and for that reason I would like to split it into smaller files.

my first idea is to split it into chromosomes, but I have thoroughly checked the vcftools site and I haven't find any valid way of doing such split. I know I can extract chromosome lines with vcf tools, but if I query this large file for each chromosome wouldn't it be accessed (hence read) 22 times for the 22 chromosomes I want?

I have a home-made perl script that is capable of doing it by parsing the entire file and checking each line's contents, but I'm pretty sure it will be slow. I just wanted to know if anyone would like to suggest anything more elegant rather than starting to process the file and waiting for the results.

vcf split vcftools • 20k views

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 13.5 years ago by Jorge Amigo 14k

1

Entering edit mode

The BCF Format

BCF, or the binary variant call format, is the binary version of VCF. It keeps the same information in VCF, while much more efficient to process especially for many samples. The relationship between BCF and VCF is similar to that between BAM and SAM. The detailed format description of the BCF format can be found in bcf.tex included in the samtools source code package.

http://samtools.sourceforge.net/mpileup.shtml

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 13.5 years ago by Casbon ★ 3.3k

Ram · Answer 1 · 2011-02-01

zcat snps.vcf.gz | awk '!/^#/{print>$1}'

But if you need a header, more works to do. This command line should not be faster than a proper Perl script, though.

Another way is to use tabix. With tabix you can retrieve data in small regions very quickly. Most of released VCFs have already been indexed by tabix.

EDIT: the following hack allows you to compress the output at the same time

cut -f1 ref.fa.fai | xargs -i mkfifo {}
cut -f1 ref.fa.fai | xargs -i echo cat {} \| gzip \> {}.gz \&|sh
zcat snps.vcf.gz | awk '!/^#/{print>$1}'

where ref.fa.fai is the fasta index file generated by samtools faidx. Nonetheless, I do not really recommend this way.

The tabix way:

cut -f1 ref.fa.fai | xargs -i echo tabix snps.vcf.gz {} \| bgzip \> {}.gz \&|sh

This command line will launch multiple jobs in parallel. If this is not desired, remove \&.

Ram · Answer 2 · 2013-11-04

5

Entering edit mode

10.7 years ago

Giovanni M Dall'Olio 28k

In the past I used the --chr option to split the 1000 Genomes data into one file per chromosome:

seq 1 22 | parallel "vcftools --recode --gzvcf 1000genomes.vcf.gz --chr {} --out chr{}"

It worked fine in my computer, a 8 GB RAM machine. It took a while to run, but I had to do it only once, and then it was more comfortable to work with single chromosome files.

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 10.7 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

interesting review, although if you send the 22 jobs for parallel execution they'll be reading all from the same .vcf.gz file at the same time, hence the I/O would be a limitation, wouldn't it? my initial idea of processing the entire file only once and generating output files according to the chromosome mentioned in each line was not as suboptimal as I thought then. I only modified it slightly including an initial "sort -V" of the input file and reducing the I/O by loading everything in memory until the chromosome information changed, when the data was dumped into its appropriate chromosome file. it has been working reasonably fine whenever I have had to run it. the main reason why I never used lh3's answer, which I found the most straight-forward although I didn't time it, was only to allow any upcoming changes in chromosome names (which of course never happened, by the way).

ADD REPLY • link 10.7 years ago by Jorge Amigo 14k

1

Entering edit mode

lh3's solution is clearly the best one, because it only reads the file once. However, calling vcftools is not so inefficient, specially if you have an index file, and it also has the advantage that you can apply filters or other options at the same time.

ADD REPLY • link 10.7 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

@Giovanni: I run this scripts for my work where I have to split files based on chromosome but didn't get any output vcf. I only got some chr2.vcf.log files !! :( :(

seq 1 22 | parallel "vcftools --gzvcf 105_WGS_SNP_Genome.vcf --chr {} --out chr{}.vcf"

Above was the complete command I run on my machine !!

Am I missing some thing here?

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 9.9 years ago by always_learning ★ 1.1k

0

Entering edit mode

Sorry, you need to add a --recode flag to the command. I've updated the post.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 9.9 years ago by Giovanni M Dall'Olio 28k

Ram · Answer 3 · 2015-10-29

bcftools has been of great help for a long time now. this is exactly what I was looking for years ago:

for chr in `bcftools view -h file.vcf.gz | perl -ne '
 if (/^##contig=<ID=([^,]+)/) { if (length($1)<=2) {
   print "$1\n"
 } }'`; do
  bcftools view -Oz -r $chr file.vcf.gz > splitted_$chr.vcf.gz &
done

This code generates a new compressed vcf file for each contig present in the original vcf file header, as long as they are not longer than 2 characters (I want to focus on 1..22, X, Y and MT only, but that limitation could be removed of modified).

score 1 · Answer 4 · 2011-02-01

I have done something like this with a custom python script. I use pythons dictionary data structure (perl hash) to get the data in to an indexed/hashed format so I can query it based on position. After all thats the advantage of vcf format, it is indexable.
I used an interval of 20,000,000 bases based on my environments optimal conditions and ended up with ~150 small files of ~20-30 MBs but the interval is your choosing. You can then load the binary objects back into memory and use them very effectively. The whole process takes about 2 hours for me.

score 0 · Answer 5 · 2022-05-09

0

Entering edit mode

2.2 years ago

nicole.zeltser • 0

Just stumbled on a useful feature of SnpSift: SnpSift split which automatically splits VCFs by chromosome or by n lines: https://pcingola.github.io/SnpEff/ss_split/

Not sure if it's any faster/more efficient.

java -jar SnpSift.jar split myHugeVcf.vcf.gz

ADD COMMENT • link 2.2 years ago by nicole.zeltser • 0