Question: How To Split Multiple Samples In Vcf File Generated By Gatk?
gravatar for newDNASeqer
7.5 years ago by
United States
newDNASeqer710 wrote:

I did variant calling using BWA + PiCard + GATK and have just got the filtered VCF files from GATK. In the process of running GATK, I used list of inputs (11 samples) and for most steps, I had only one output file for each step. Now, I got two VCF files (one for SNPs and the other is for indels), each of which contains 11 samples. I can see the names of the 11 samples in the header of vcf files, and each sample seems to have one column of data. So I am wondering how to split each VCF files into individual sample vcf files?

From my search, vcf tools seems to have the capability of splitting vcf, but I could not find an example for splitting multiple samples. Can someone please help me? Thanks a lot

vcf gatk split • 23k views
ADD COMMENTlink modified 3 months ago by William4.7k • written 7.5 years ago by newDNASeqer710
gravatar for Jorge Amigo
6.4 years ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

I know this is an old question but, as I've already stated in this post, there's a very efficient way of doing this that hasn't been reported yet. hope it helps:

 for sample in `bcftools query -l multisample.vcf.gz`; do
   bcftools view -c1 -Oz -s $sample -o $sample.vcf.gz multisample.vcf.gz

performance note: the above code is quite slow when dealing with hundreds or thousands of samples. if that'd be the case, I've developed this other perl-based code for very large multisample files, that splits a few thousands of exomes in less than 1 hour (approximately half of the time for the split process itself and the other half for the individual compressing) by storing a couple of temporal files that do not get that large at all considering the size of the original and the splitted files:

time zcat multisample.vcf.gz | perl -lane '
if (/^#/) { if (/^##/) { print STDERR } else {
 print STDERR join "\t", @F[0..8]; @samples = @F;
} } else {
 print STDERR join "\t", @F[0..8];
 for ($i = 9; $i <= $#F; $i++) {
  if ($F[$i] =~ /^..[1-9]/) {
   print STDOUT join "\t", $samples[$i], $lc, $F[$i];
} } } $lc++' 2>_vcf.common.txt | sort -k1,1 -k2,2n >_vcf.private.txt

mkdir -p split
time perl -lane 'BEGIN {
open IN, "_vcf.common.txt" or die $!;
chomp(@common = <IN>); foreach (@common) {
 if (/^##/) { $headers .= "$_\n" } else { $headers .= $_; last }
} close IN }
if ($F[0] ne $previousSample) {
 close OUT if $previousSample;
 open OUT, ">split/$F[0].vcf";
 print OUT "$headers\t$F[0]";
} $previousSample = $F[0];
print OUT "$common[$F[1]]\t$F[2]";
END { close OUT }' _vcf.private.txt

time for file in split/*vcf; do
 bgzip -f $file; tabix -fp vcf $file.gz
ADD COMMENTlink modified 10 months ago • written 6.4 years ago by Jorge Amigo12k

This one is at least 10 times faster than the vcf-subset

ADD REPLYlink written 5.6 years ago by wqshi.nudt20

This one is great... You seriously made my day todAY. THANKS A LOT.

ADD REPLYlink written 4.7 years ago by ste.arnoux0

You're the man! Best solution ever!

ADD REPLYlink written 3.5 years ago by she.xinwei0

It works perfectly. Could you explain how does the latest part of the third line, '-o ${file/.vcf/.$sample.vcf.gz} $file', works? I know it has nothing to do with the BCF and it is just basic bash renaming but I cant figure out how can I google it (I mean what to actually write in the search bar), and I found it quite useful for any script (so I got cleaner, clearer file names). I understand how it works in this case but I'll like to learn a bit more.

Thank you

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Carles Borreda0

sure. since you will be generating a file per sample you could just simply use -o $sample.vcf.gz, but I personally prefer to keep the original file name to know where that data came from. for that reason I use .o ${file/.vcf*/.$sample.vcf.gz}, which uses a bash string manipulation function to substitute anything from .vcf to the right of the original file name for .$sample.vcf.gz. the result is a single file name for each sample that keeps the original file name (but the .vcf or .vcf.gz extension) before the sample name.

ADD REPLYlink written 3.4 years ago by Jorge Amigo12k

bash string manipulation

This is what I was looking for. Yeah I understand what you are doing but I didn't know it was possible, I was just using ${file}.${sample}.vcf.gz, but I ended up with a, which is quite annoying. I didn't know that it was possible to change $file on the fly. Not that I know I'll add this to may of my scripts. Thank you!

ADD REPLYlink written 3.4 years ago by Carles Borreda0

Hi I have one concern regarding this. When I download the individual samples using vcf view, I am getting the entire file. But with the above bcftools view method, the download does not contain Homozygous reference alleles. It does not display the 0/0 variants. Should I take some precaution for this? This is what is happening: For the following code ` vcf-subset -c NA21127 ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz > S1.vcf

Here is what the first five rows of output sample contains:


0 22 16050075 rs587697622 A G 0

1 22 16050115 rs587755077 G A 0

2 22 16050213 rs587654921 C T 0

3 22 16050319 rs587712275 C T 0

4 22 16050527 rs587769434 C A 0

For the code:

bcftools view -c1 -Oz -s NA21127 -o Sample.vcf.gz ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

The out put contains the following first five rows:


0 22 16052080 rs4965031 G A 1

1 22 16052167 rs375684679 A AAAAC 1

2 22 16052986 rs200777521 C A 1

3 22 16053444 rs80167676 A T 1

4 22 16053659 rs915675 A C 1

I even looked at the entire file separately by converting it to a dataframe, I noticed that it completely skipped the homozygous reference alleles.( I just converted the genotype column from 0/0, 0/1, 1/0 to 0,1,2 respectively just for my personal use. ) Is there anything wrong I am doing here?

ADD REPLYlink modified 16 months ago • written 16 months ago by mono0

the key is the -c1option, which filters variant lines containing less that 1 nonref allele, therefore it's supposed to be used to retrieve each sample's private variants and not to list reference homozigotes:

if you whish to have theses laters too in your output, just remove the -c1 option.

ADD REPLYlink modified 16 months ago • written 16 months ago by Jorge Amigo12k

that clears my doubt. thank you

ADD REPLYlink written 16 months ago by mono0

I am getting a syntax error by running this script at the first line:

*syntax error at line 1, near "time zcat multisamples1000gPhase3*"

Can you please tell what the first line is doing? I don't use perl so.

ADD REPLYlink written 4 months ago by Dhara Awasthi20

Without knowing what exactly have you done, I'll assume that you have copied the second piece of code of my answer into and ran it to split all 1000g samples, and you're seeing that error you describe. If that's the case, you've done one thing right and one thing wrong: the right thing is to choose that second code instead of the first since it will take much less time to give you the same results, and the wrong thing is to think that this is a Perl code. Although it uses perl in it, it is a bash script that you should save into and run it with a simple sh

ADD REPLYlink modified 4 months ago • written 4 months ago by Jorge Amigo12k
gravatar for Neilfws
7.5 years ago by
Sydney, Australia
Neilfws49k wrote:

You want vcf-subset, with the -c option:

-c, --columns <string>           File or comma-separated list of columns to keep in the vcf file. If file, one column per row

So if your sample is named S1 and you want a VCF file for only that sample named S1.vcf:

vcf-subset -c S1 bigfile.vcf > S1.vcf

There are examples on the VCFtools documentation page, but they are unhelpfully labelled "Stripping columns".

ADD COMMENTlink written 7.5 years ago by Neilfws49k

Is the bcftools view faster than vcf-subset for subsetting since it uses the new htslib?


ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by smilefreak420

it should be. in fact, on the vcftools page you can read the following note: "A fast HTSlib C version of this tool is now available (see bcftools view)."

ADD REPLYlink modified 16 months ago by Ram32k • written 6.4 years ago by Jorge Amigo12k
gravatar for William
7.5 years ago by
William4.7k wrote:

Use GATK SelectVariants -sn mySampleName to extract a single sample from a multiple sample vcf.

You can specify multiple sample names to keep, but if you want to extract all 11 to a separate vcf file you need to run the command 11 times.

I would try try to stick to the GATK vcf prostprocessing tools, because you also generated the vcf with GATK, and there is always the chance another tool set has a different interpretation of the same format. (And I don't trust tools written in Perl, but that is just my personal bias. )

ADD COMMENTlink modified 16 months ago by Ram32k • written 7.5 years ago by William4.7k

It is a good idea to use the same toolkit as much as possible to avoid things getting "lost in translation" between tools. Some tools out there do not enforce format specifications strictly enough and you can end up with files that are not compatible with other tools.

ADD REPLYlink written 7.2 years ago by vdauwera990

Thank you! I was not aware of GATK selectvariants can do it and was writing my own code.

Here are the updated links to the post:

ADD REPLYlink written 3.8 years ago by DVA550
gravatar for Ashutosh Pandey
7.5 years ago by
Ashutosh Pandey12k wrote:

vcf-subset is what you need. But you may find this post from another forum helpful.

ADD COMMENTlink written 7.5 years ago by Ashutosh Pandey12k
gravatar for alexej.knaus
7.5 years ago by
alexej.knaus130 wrote:

go to , upload your vcf file with multiple patients, let it preproocess and download single files, or analyze singe or multiple vcf files.

ADD COMMENTlink written 7.5 years ago by alexej.knaus130
gravatar for William
3 months ago by
William4.7k wrote:

There now also is a plugin in bcftools which does the split in a single pass over the multi-sample VCF/BCF file. It does not seem to be very fast, but looks correct and there are options to do the split in custom ways.

You do need to install bcftools with the plugins

Split plugin

    split VCF by sample, creating single- or multi-sample VCFs

Example command line

bcftools plugin split input.bcf -Oz -o ./

The help

About: Split VCF by sample, creating single- or multi-sample VCFs.

Usage: bcftools +split [Options]
Plugin options:
   -e, --exclude EXPR              exclude sites for which the expression is true (applied on the outputs)
   -G, --groups-file FILE          similar to -S, but the samples are split by group:

                                       # Create two output files (third column) with the second sample appearing
                                       # in both. The second column is for optional renaming of the samples, use
                                       # dash "-" to keep sample names unchanged
                                       sample1   -          file1
                                       sample2   -          file1,file2
                                       sample3   new-name3  file2

   -i, --include EXPR              include only sites for which the expression is true (applied on the outputs)
   -k, --keep-tags LIST            list of tags to keep. By default all tags are preserved
   -o, --output DIR                write output to the directory DIR
   -O, --output-type b|u|z|v       b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
   -r, --regions REGION            restrict to comma-separated list of regions
   -R, --regions-file FILE         restrict to regions listed in a file
   -S, --samples-file FILE         list of samples to keep with up to three columns, one line per output file:

                                       # Create two output files, the first sample is the basename
                                       # of the new file

                                       # Optional second column to rename the samples
                                       sample1           new-name2
                                       sample2,sample3   new-name2,new-name3

                                       # Optional third column to provide output file base name, use dash "-"
                                       # to keep sample names unchanged
                                       sample1           new-name1   output1
                                       sample2,sample3   -           output2

   -t, --targets REGION            similar to -r but streams rather than index-jumps
   -T, --targets-file FILE         similar to -R but streams rather than index-jumps
       --hts-opts LIST             low-level options to pass to HTSlib, e.g. block_size=32768

   # Split a VCF file
   bcftools +split input.bcf -Ob -o dir

   # Exclude sites with missing or hom-ref genotypes
   bcftools +split input.bcf -Ob -o dir -i'GT="alt"'

   # Keep all INFO tags but only GT and PL in FORMAT
   bcftools +split input.bcf -Ob -o dir -k INFO,FMT/GT,PL

   # Keep all FORMAT tags but drop all INFO tags
   bcftools +split input.bcf -Ob -o dir -k FMT
ADD COMMENTlink modified 3 months ago • written 3 months ago by William4.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1284 users visited in the last hour