Question: How To Split Multiple Samples In Vcf File Generated By Gatk?
4
gravatar for newDNASeqer
6.8 years ago by
newDNASeqer680
United States
newDNASeqer680 wrote:

I did variant calling using BWA + PiCard + GATK and have just got the filtered VCF files from GATK. In the process of running GATK, I used list of inputs (11 samples) and for most steps, I had only one output file for each step. Now, I got two VCF files (one for SNPs and the other is for indels), each of which contains 11 samples. I can see the names of the 11 samples in the header of vcf files, and each sample seems to have one column of data. So I am wondering how to split each VCF files into individual sample vcf files?

From my search, vcf tools seems to have the capability of splitting vcf, but I could not find an example for splitting multiple samples. Can someone please help me? Thanks a lot

vcf gatk split • 21k views
ADD COMMENTlink modified 3.1 years ago by Biostar ♦♦ 20 • written 6.8 years ago by newDNASeqer680
15
gravatar for Jorge Amigo
5.6 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

I know this is an old question but, as I've already stated in this post, there's a very efficient way of doing this that hasn't been reported yet. hope it helps:

 for sample in `bcftools query -l multisample.vcf.gz`; do
   bcftools view -c1 -Oz -s $sample -o $sample.vcf.gz multisample.vcf.gz
 done

performance note: the above code is quite slow when dealing with hundreds or thousands of samples. if that'd be the case, I've developed this other perl-based code for very large multisample files, that splits a few thousands of exomes in less than 1 hour (approximately half of the time for the split process itself and the other half for the individual compressing) by storing a couple of temporal files that do not get that large at all considering the size of the original and the splitted files:

time zcat multisample.vcf.gz | perl -lane '
if (/^#/) { if (/^##/) { print STDERR } else {
 print STDERR join "\t", @F[0..8]; @samples = @F;
} } else {
 print STDERR join "\t", @F[0..8];
 for ($i = 9; $i <= $#F; $i++) {
  if ($F[$i] =~ /^..[1-9]/) {
   print STDOUT join "\t", $samples[$i], $lc, $F[$i];
} } } $lc++' 2>_vcf.common.txt | sort -k1,1 -k2,2n >_vcf.private.txt

mkdir -p split
time perl -lane 'BEGIN {
open IN, "_vcf.common.txt" or die $!;
chomp(@common = <IN>); foreach (@common) {
 if (/^##/) { $headers .= "$_\n" } else { $headers .= $_; last }
} close IN }
if ($F[0] ne $previousSample) {
 close OUT if $previousSample;
 open OUT, ">split/$F[0].vcf";
 print OUT "$headers\t$F[0]";
} $previousSample = $F[0];
print OUT "$common[$F[1]]\t$F[2]";
END { close OUT }' _vcf.private.txt

time for file in split/*vcf; do
 bgzip -f $file; tabix -fp vcf $file.gz
done
ADD COMMENTlink modified 5 weeks ago • written 5.6 years ago by Jorge Amigo11k
1

This one is at least 10 times faster than the vcf-subset

ADD REPLYlink written 4.8 years ago by wqshi.nudt20

This one is great... You seriously made my day todAY. THANKS A LOT.

ADD REPLYlink written 3.9 years ago by ste.arnoux0

You're the man! Best solution ever!

ADD REPLYlink written 2.7 years ago by she.xinwei0

It works perfectly. Could you explain how does the latest part of the third line, '-o ${file/.vcf/.$sample.vcf.gz} $file', works? I know it has nothing to do with the BCF and it is just basic bash renaming but I cant figure out how can I google it (I mean what to actually write in the search bar), and I found it quite useful for any script (so I got cleaner, clearer file names). I understand how it works in this case but I'll like to learn a bit more.

Thank you

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Carles Borreda0

sure. since you will be generating a file per sample you could just simply use -o $sample.vcf.gz, but I personally prefer to keep the original file name to know where that data came from. for that reason I use .o ${file/.vcf*/.$sample.vcf.gz}, which uses a bash string manipulation function to substitute anything from .vcf to the right of the original file name for .$sample.vcf.gz. the result is a single file name for each sample that keeps the original file name (but the .vcf or .vcf.gz extension) before the sample name.

ADD REPLYlink written 2.6 years ago by Jorge Amigo11k

bash string manipulation

This is what I was looking for. Yeah I understand what you are doing but I didn't know it was possible, I was just using ${file}.${sample}.vcf.gz, but I ended up with a foo.vcf.bar.vcf.gz.sample.vcf, which is quite annoying. I didn't know that it was possible to change $file on the fly. Not that I know I'll add this to may of my scripts. Thank you!

ADD REPLYlink written 2.6 years ago by Carles Borreda0

Hi I have one concern regarding this. When I download the individual samples using vcf view, I am getting the entire file. But with the above bcftools view method, the download does not contain Homozygous reference alleles. It does not display the 0/0 variants. Should I take some precaution for this? This is what is happening: For the following code ` vcf-subset -c NA21127 ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz > S1.vcf

Here is what the first five rows of output sample contains:

CHROM POS ID REF ALT_1 Genotype

0 22 16050075 rs587697622 A G 0

1 22 16050115 rs587755077 G A 0

2 22 16050213 rs587654921 C T 0

3 22 16050319 rs587712275 C T 0

4 22 16050527 rs587769434 C A 0

For the code:

bcftools view -c1 -Oz -s NA21127 -o Sample.vcf.gz ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

The out put contains the following first five rows:

CHROM POS ID REF ALT_1 Genotype

0 22 16052080 rs4965031 G A 1

1 22 16052167 rs375684679 A AAAAC 1

2 22 16052986 rs200777521 C A 1

3 22 16053444 rs80167676 A T 1

4 22 16053659 rs915675 A C 1

I even looked at the entire file separately by converting it to a dataframe, I noticed that it completely skipped the homozygous reference alleles.( I just converted the genotype column from 0/0, 0/1, 1/0 to 0,1,2 respectively just for my personal use. ) Is there anything wrong I am doing here?

ADD REPLYlink modified 7 months ago • written 7 months ago by mono0
1

the key is the -c1option, which filters variant lines containing less that 1 nonref allele, therefore it's supposed to be used to retrieve each sample's private variants and not to list reference homozigotes:

http://www.htslib.org/doc/bcftools.html#view

if you whish to have theses laters too in your output, just remove the -c1 option.

ADD REPLYlink modified 7 months ago • written 7 months ago by Jorge Amigo11k

that clears my doubt. thank you

ADD REPLYlink written 7 months ago by mono0
8
gravatar for Neilfws
6.8 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

You want vcf-subset, with the -c option:

-c, --columns <string>           File or comma-separated list of columns to keep in the vcf file. If file, one column per row

So if your sample is named S1 and you want a VCF file for only that sample named S1.vcf:

vcf-subset -c S1 bigfile.vcf > S1.vcf

There are examples on the VCFtools documentation page, but they are unhelpfully labelled "Stripping columns".

ADD COMMENTlink written 6.8 years ago by Neilfws48k

Is the bcftools view faster than vcf-subset for subsetting since it uses the new htslib?

 

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by smilefreak420

it should be. in fact, on the vcftools page you can read the following note: "A fast HTSlib C version of this tool is now available (see bcftools view)."

ADD REPLYlink modified 7 months ago by RamRS27k • written 5.6 years ago by Jorge Amigo11k
4
gravatar for William
6.8 years ago by
William4.6k
Europe
William4.6k wrote:

Use GATK SelectVariants -sn mySampleName to extract a single sample from a multiple sample vcf.

http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_SelectVariants.html

You can specify multiple sample names to keep, but if you want to extract all 11 to a separate vcf file you need to run the command 11 times.

I would try try to stick to the GATK vcf prostprocessing tools, because you also generated the vcf with GATK, and there is always the chance another tool set has a different interpretation of the same format. (And I don't trust tools written in Perl, but that is just my personal bias. )

ADD COMMENTlink modified 7 months ago by RamRS27k • written 6.8 years ago by William4.6k

It is a good idea to use the same toolkit as much as possible to avoid things getting "lost in translation" between tools. Some tools out there do not enforce format specifications strictly enough and you can end up with files that are not compatible with other tools.

ADD REPLYlink written 6.4 years ago by vdauwera960

Thank you! I was not aware of GATK selectvariants can do it and was writing my own code.

Here are the updated links to the post: http://gatkforums.broadinstitute.org/gatk/discussion/54/selecting-variants-of-interest-from-a-callset https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php

ADD REPLYlink written 3.1 years ago by DVA540
0
gravatar for Ashutosh Pandey
6.8 years ago by
Philadelphia
Ashutosh Pandey12k wrote:

vcf-subset is what you need. But you may find this post from another forum helpful.

http://seqanswers.com/forums/showthread.php?t=24711

ADD COMMENTlink written 6.8 years ago by Ashutosh Pandey12k
0
gravatar for alexej.knaus
6.7 years ago by
alexej.knaus120
Berlin
alexej.knaus120 wrote:

go to www.Gene-Talk.de , upload your vcf file with multiple patients, let it preproocess and download single files, or analyze singe or multiple vcf files.

ADD COMMENTlink written 6.7 years ago by alexej.knaus120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1771 users visited in the last hour