Question: How To Split Multiple Samples In Vcf File Generated By Gatk?
gravatar for newDNASeqer
5.9 years ago by
United States
newDNASeqer650 wrote:

I did variant calling using BWA + PiCard + GATK and have just got the filtered VCF files from GATK. In the process of running GATK, I used list of inputs (11 samples) and for most steps, I had only one output file for each step. Now, I got two VCF files (one for SNPs and the other is for indels), each of which contains 11 samples. I can see the names of the 11 samples in the header of vcf files, and each sample seems to have one column of data. So I am wondering how to split each VCF files into individual sample vcf files?

From my search, vcf tools seems to have the capability of splitting vcf, but I could not find an example for splitting multiple samples. Can someone please help me? Thanks a lot

vcf gatk split • 19k views
ADD COMMENTlink modified 2.3 years ago by Biostar ♦♦ 20 • written 5.9 years ago by newDNASeqer650
gravatar for Jorge Amigo
4.8 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

I know this is an old question but, as I've already stated in this post, there's a very efficient way of doing this that hasn't been reported yet. hope it helps:

for file in *.vcf*; do
  for sample in `bcftools view -h $file | grep "^#CHROM" | cut -f10-`; do
    bcftools view -c1 -Oz -s $sample -o ${file/.vcf*/.$sample.vcf.gz} $file

EDIT: bcftools query -l lists all samples, so the fastest loop would be the following:

for file in *.vcf*; do
  for sample in `bcftools query -l $file`; do
    bcftools view -c1 -Oz -s $sample -o ${file/.vcf*/.$sample.vcf.gz} $file
ADD COMMENTlink modified 2.1 years ago • written 4.8 years ago by Jorge Amigo11k

This one is at least 10 times faster than the vcf-subset

ADD REPLYlink written 4.0 years ago by wqshi.nudt20

This one is great... You seriously made my day todAY. THANKS A LOT.

ADD REPLYlink written 3.1 years ago by ste.arnoux0

You're the man! Best solution ever!

ADD REPLYlink written 22 months ago by she.xinwei0

It works perfectly. Could you explain how does the latest part of the third line, '-o ${file/.vcf/.$sample.vcf.gz} $file', works? I know it has nothing to do with the BCF and it is just basic bash renaming but I cant figure out how can I google it (I mean what to actually write in the search bar), and I found it quite useful for any script (so I got cleaner, clearer file names). I understand how it works in this case but I'll like to learn a bit more.

Thank you

ADD REPLYlink modified 21 months ago • written 21 months ago by Carles Borreda0

sure. since you will be generating a file per sample you could just simply use -o $sample.vcf.gz, but I personally prefer to keep the original file name to know where that data came from. for that reason I use .o ${file/.vcf*/.$sample.vcf.gz}, which uses a bash string manipulation function to substitute anything from .vcf to the right of the original file name for .$sample.vcf.gz. the result is a single file name for each sample that keeps the original file name (but the .vcf or .vcf.gz extension) before the sample name.

ADD REPLYlink written 21 months ago by Jorge Amigo11k

bash string manipulation

This is what I was looking for. Yeah I understand what you are doing but I didn't know it was possible, I was just using ${file}.${sample}.vcf.gz, but I ended up with a, which is quite annoying. I didn't know that it was possible to change $file on the fly. Not that I know I'll add this to may of my scripts. Thank you!

ADD REPLYlink written 21 months ago by Carles Borreda0
gravatar for Neilfws
5.9 years ago by
Sydney, Australia
Neilfws48k wrote:

You want vcf-subset, with the -c option:

-c, --columns <string>           File or comma-separated list of columns to keep in the vcf file. If file, one column per row

So if your sample is named S1 and you want a VCF file for only that sample named S1.vcf:

vcf-subset -c S1 bigfile.vcf > S1.vcf

There are examples on the VCFtools documentation page, but they are unhelpfully labelled "Stripping columns".

ADD COMMENTlink written 5.9 years ago by Neilfws48k

Is the bcftools view faster than vcf-subset for subsetting since it uses the new htslib?


ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by smilefreak420

it should be. in fact, on the vcftools page you can read the following note: "A fast HTSlib C version of this tool is now available (see bcftools view)."

ADD REPLYlink written 4.8 years ago by Jorge Amigo11k
gravatar for William
5.9 years ago by
William4.4k wrote:

Use GATK SelectVariants -sn mySampleName to extract a single sample from a multiple sample vcf.

You can specify multiple sample names to keep, but if you want to extract all 11 to a separate vcf file you need to run the command 11 times.

I would try try to stick to the GATK vcf prostprocessing tools, because you also generated the vcf with GATK, and there is always the chance another tool set has a different interpretation of the same format. (And I don't trust tools written in Perl, but that is just my personal bias. )

ADD COMMENTlink written 5.9 years ago by William4.4k

It is a good idea to use the same toolkit as much as possible to avoid things getting "lost in translation" between tools. Some tools out there do not enforce format specifications strictly enough and you can end up with files that are not compatible with other tools.

ADD REPLYlink written 5.6 years ago by vdauwera940

Thank you! I was not aware of GATK selectvariants can do it and was writing my own code.

Here are the updated links to the post:

ADD REPLYlink written 2.2 years ago by DVA530
gravatar for Ashutosh Pandey
5.9 years ago by
Ashutosh Pandey11k wrote:

vcf-subset is what you need. But you may find this post from another forum helpful.

ADD COMMENTlink written 5.9 years ago by Ashutosh Pandey11k
gravatar for alexej.knaus
5.9 years ago by
alexej.knaus120 wrote:

go to , upload your vcf file with multiple patients, let it preproocess and download single files, or analyze singe or multiple vcf files.

ADD COMMENTlink written 5.9 years ago by alexej.knaus120
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1667 users visited in the last hour