Question

Extracting Information From Vcf File About Heterozygous,Homozygous From Gatk Pipeline And Comparing It With Other Outputs

1

Entering edit mode

10.7 years ago

skm770 ▴ 150

Hi

We have analyzed some data-sets for genome analysis using a couple of different pipelines : bwa-GATK, GSNAP-Samtools and for few soybean genomes. Although the final set of output from both the pipeline is in vcf-format which is a widely accepted format for further downstream analysis of SNP-effect prediction etc our collaborators want to get the output in different format that they have generated in the past. This is how that output looks like.

chr position REFERENCE IA3023_ALLELE J105_3_4_ALLELE M20_2_5_2_ALLELE CL0J095_4_6_ALLELE CL0J173_6_8_ALLELE

Gm01 975 G G G G G G

Gm01 3135 T H T T T T

Gm01 3727 T T T T T -

Gm01 3745 G G - - - -

Gm01 3748 G H - - G G

Here - (dash) indicates positions at which there was no reads mapped for that particular genotype. H,V,K etc denote ambiguous nucleotides. I am a newbie when it comes to vcf files as I have not perused all the information available at http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 but I would be interested to know How can I generate something similar using the vcf files and what tools exist to do that already if any. I have both single-sample and multi-sample vcf files. vcf-tools has a lot of options for getting info from vcf files but its documentation is rather scanty so I am not sure if it would fit the bill thus I would really appreciate any insights from folks out there who have done something similar

BTW I just found out vcf-to-tab program in vcf-tools gives the output very close to what I want.

vcf-to-tab <input.vcf > out.tab

'#CHROM POS REF HN001 HN002 HN003 HN004 HN005 HN006 HN007 HN008 HN009 HN010 HN011 HN012 HN013 HN014 HN015 HN018 HN019 HN020 HN021 HN022 HN023 HN024 HN026 HN027 Gm01 5 T T/T T/G T/G ./. T/T T/G T/T ./. T/T ./. T/G T/T T/G T/G T/G T/G T/G T/G T/T T/G T/T T/T T/G T/T Gm01 8 T G/G T/G T/G ./. T/T T/G T/T ./. T/G ./. T/G T/G T/G T/G T/G T/G T/G T/G G/G T/G T/T G/G T/G T/G Gm01 13 G G/A G/G G/G G/G G/G G/G G/G ./. G/G A/A G/G G/G G/G G/G G/G G/G G/G G/G G/A G/G G/A G/A G/G G/A '

gatk vcf • 7.0k views

ADD COMMENT • link 10.7 years ago by skm770 ▴ 150

score 2 · Answer 1 · 2013-08-18

2

Entering edit mode

10.7 years ago

Alex Paciorkowski 3.5k

It looks like your collaborators extracted specific columns from a multisample vcf, if they find that useful. I suppose having an extract like this gets to the bottom line if all you care about are the genotypes per sample, but it leaves out a lot of quality metrics, annotations that also can be helpful. But you can do something like this:

#!/usr/bin/perl
# use strict;
# use warnings;
my $file = shift;
open my ( $F ), $file;
LINE: while ($_=<$F>) {
     my @line = split /\t/;
     my @chr = $line[0];
     my @position = $line[1];
     my @reference = $line[2];
     my @IA3023_ALLELE = $line[9];  # or whatever column your sample variant calls start at
     ...   # lines for your other samples here
     my @CL0J173_6_8_ALLELE = $line[13]; # your last sample

     my $printme = 0;
     ++$printme if @chr;
     print join(qq/\t/, @line) if $printme;
}

print STDERR "Done.\n";

This should give you output that looks like this:

chr position REFERENCE IA3023_ALLELE J105_3_4_ALLELE M20_2_5_2_ALLELE CL0J095_4_6_ALLELE CL0J173_6_8_ALLELE
Gm01     975     G     0/0     0/0     0/0     0/0     0/0     0/0
...
Gm01    3745    G     0/0     .     .     .     .

Which is pretty close to what you want. I'll leave it to you to substitute in the actual nt's for the allele calls.

ADD COMMENT • link 10.7 years ago by Alex Paciorkowski 3.5k

0

Entering edit mode

Yeah Thanks I had figured that much out I don't know how to substitute actual nt's for the allele calls. I am going through the vcf specs rt now to determine that thanks

ADD REPLY • link 10.7 years ago by skm770 ▴ 150

0

Entering edit mode

If you keep the reference and alt columns you can convert easily between the numbers and the actual nucleotides for the allele calls. You may run in to some difficulties when you have multiple alternative alleles at that position though. If you are comfortable coding in Python you may find the PyVCF package easier to work with to convert your vcf files in to the format you want. I use it routinely to parse VCF files into tab delimited variant reports for collaborators.

ADD REPLY • link 10.7 years ago by DG 7.3k

0

Entering edit mode

Thanks for the reply. I was wondering that only way one could have H in the variant call is when there is a triploid call at that loci for that particular genotype. Let me know if I am correct.

ADD REPLY • link 10.7 years ago by skm770 ▴ 150

score 0 · Answer 2 · 2013-08-18

0

Entering edit mode

10.7 years ago

William ★ 5.3k

Tell your collaborators that it is 2013 and that they should use the vcf format for SNPs and indels.

If they have an outdated tertiary analyses pipeline that they didn't update to have vcf as an input format that is their problem not yours.

Or find a new collaborator that did keep up with recent developments.

ADD COMMENT • link 10.7 years ago by William ★ 5.3k

0

Entering edit mode

I'll bet the OPs collaborators were working from a vcf. It looks like they found it useful -- perhaps for a further downstream analysis -- to extract out certain columns from a vcf. We do this all the time. Additionally, lots of labs -- including sequencing cores -- generate all kinds of non-standard vcf files. Dealing with this certainly is our problem. And collaborators shouldn't be so easily discarded.

ADD REPLY • link 10.7 years ago by Alex Paciorkowski 3.5k