Question: add coordinates to line of output extracted from vcf
0
gravatar for cmccabe
17 months ago by
cmccabe180
Chicago
cmccabe180 wrote:

I am trying to extrapolate/convert variants from a vcf file to genotype. I am close but can not seem to add the chr and start of each line to the output. Maybe there is a better way? Thank you :).

#!/usr/bin/perl   # execute perl script

 use strict;
 use warnings;
 use Vcf;

 my $filename = $ARGV[0];

 open ( my $handle, "<", $filename);
 my $vcf = Vcf->new(fh=>$handle);
 $vcf->parse_header();
 vcf_iterate();

 sub vcf_iterate
 {
 while ( my $x=$vcf->next_data_hash() )
 {
 foreach my $sample ( keys $$x{gtypes} )
 {
 print "SAMPLE=$sample, genotype_raw=$$x{gtypes}{$sample}{GT}, ";
 my $decoded_genotype = decode_genotype($x, $sample);
 print "decoded_genotype=$decoded_genotype\n";
   }
  }
}

sub decode_genotype
{
my ( $x, $sample ) = ( $_[0], $_[1] );
my $gt = $vcf->decode_genotype($$x{REF}, $$x{ALT}, $$x{gtypes}{$sample}{GT}); # returns 'G/G'
return $gt;
 }

current output:

SAMPLE=MEC1, genotype_raw=0/1, decoded_genotype=G/A
SAMPLE=MEC1, genotype_raw=1/1, decoded_genotype=G/G
SAMPLE=MEC1, genotype_raw=0/1, decoded_genotype=T/C
SAMPLE=MEC1, genotype_raw=0/1, decoded_genotype=A/G
SAMPLE=MEC1, genotype_raw=0/1, decoded_genotype=T/C

desired output:

             $1  tab   $2,
SAMPLE=MEC1, chr1   949608, genotype_raw=0/1, decoded_genotype=G/A
SAMPLE=MEC1, chr1   949654, genotype_raw=1/1, decoded_genotype=G/G
SAMPLE=MEC1, chr1   977330, genotype_raw=0/1, decoded_genotype=T/C
SAMPLE=MEC1, chr1   981931, genotype_raw=0/1, decoded_genotype=A/G
SAMPLE=MEC1, chr1   982994, genotype_raw=0/1, decoded_genotype=T/C

vcf:

##fileformat=VCFv4.1
....
....
....
##INFO=<ID=OREF,Number=.,Type=String,Description="List of original reference bases">
##INFO=<ID=OALT,Number=.,Type=String,Description="List of original variant bases">
##INFO=<ID=OMAPALT,Number=.,Type=String,Description="Maps OID,OPOS,OREF,OALT entries to specific ALT alleles">
##deamination_metric=0.183561
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  MEC1
chr1    949608  .   G   A   852.727 PASS    .   .
chr1    949654  .   A   G   3815.58 PASS    .   .
chr1    977330  .   T   C   261.147 PASS    .   .
chr1    981931  .   A   G   352.821 PASS    .   .
chr1    982994  .   T   C   496.098 PASS    .   .
genotype vcf • 723 views
ADD COMMENTlink modified 17 months ago • written 17 months ago by cmccabe180
1

Why are you doing this from scratch? You can use bcftools query to do this, and they have predefined format labels (%GT, %TGT, etc)

ADD REPLYlink modified 17 months ago • written 17 months ago by RamRS21k

I was not aware bcftools could do this. I ran:

bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%GT]\n' NA12878s5.vcf.gz | head 5

chr1    949608  G   A   MEC1=0/1
chr1    949654  A   G   MEC1=1/1
chr1    977330  T   C   MEC1=0/1
chr1    981931  A   G   MEC1=0/1
chr1    982994  T   C   MEC1=0/1

Can the raw GT be coverted to actual genotype? Thank you :).

chr1    949608  G   A   MEC1=G/A
chr1    949654  A   G   MEC1=G/G
chr1    977330  T   C   MEC1=T/C
chr1    981931  A   G   MEC1=A/G
chr1    982994  T   C   MEC1=T/C

Thank you :).

ADD REPLYlink written 17 months ago by cmccabe180

The %TGT I mentioned might help

ADD REPLYlink written 17 months ago by RamRS21k

Amazing, works great.... thank you, really nice tool :)

ADD REPLYlink written 17 months ago by cmccabe180
1
gravatar for Pierre Lindenbaum
17 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

$ java -jar dist/bioalcidaejdk.jar -e 'stream().forEach(V->{println(V.getContig()+"\t"+V.getStart()+"\t"+V.getReference().getDisplayString()+"\t"+V.getAlternateAlleles().stream().map(A->A.getDisplayString()).collect(Collectors.joining(","))+"\t"+V.getGenotypes().stream().map(G->G.getSampleName()+"="+G.getAlleles().stream().filter(A->A.isCalled()).map(A->A.getDisplayString()).collect(Collectors.joining("/"))).collect(Collectors.joining("\t")));});' src/test/resources/test_vcf01.vcf

1   956852  C   T   S1= S2=T/T  S3=C/T  S4=C/T  S5=T/T  S6=C/T
1   959155  G   A   S1= S2=G/A  S3=G/A  S4=G/A  S5=A/A  S6=G/A
1   959169  G   C   S1= S2=G/C  S3=G/C  S4=G/C  S5=C/C  S6=G/C
1   959231  G   A   S1= S2=G/A  S3=G/A  S4=G/A  S5=A/A  S6=G/A
1   960409  G   C   S1= S2=G/C  S3=G/C  S4=G/C  S5=C/C  S6=G/C
1   962210  A   G   S1=A/G  S2=A/G  S3=A/G  S4=A/G  S5= S6=A/G
1   962606  G   A   S1= S2=G/A  S3=G/A  S4=G/A  S5=A/A  S6=G/A
1   962891  C   T   S1= S2=C/T  S3=C/T  S4=C/T  S5=T/T  S6=C/T
ADD COMMENTlink written 17 months ago by Pierre Lindenbaum119k

Times like these, I miss C#'s properties :-)

ADD REPLYlink written 17 months ago by RamRS21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 661 users visited in the last hour