Question

Phasing a trio in multi-sample vcf with zygosity

2

Entering edit mode

3.9 years ago

ellieuk ▴ 40

Hi there,

I'd be grateful for some help please.

I've got several joint-called trio VCFs (unaffected parents and proband) which I'm analysing. The multi-sample vcf nicely shows 0/0, 0/1, 1/1 for each individual per variant. However, I'd like to convert the notation to three columns of simply het, hom and ref to make it much easier to analyse.

I've checked out the annovar script convert2annovar.pl --withzyg, however this only works for individual sample vcfs which isn't what I need. I'd like all the samples together in the same multi-sample vcf but with zygosity stated for each variant.

Any help or suggestions would be greatly appreciated. E

genome • 1.4k views

ADD COMMENT • link updated 3.9 years ago by Pierre Lindenbaum 161k • written 3.9 years ago by ellieuk ▴ 40

0

Entering edit mode

t. However, I'd like to convert the notation to three columns of simply het, hom and ref to make it much easier to analyse.

sounds like a xyz problem. What are you trying to do?

ADD REPLY • link 3.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Pierre, I’m trying to get an annotated annovar file that includes all samples (mum, dad, and proband) but instead of having the genotypes as 1/1, 1/0, 0/0 etc, I would like 3 columns for mum, dad, and proband in the same file but with het, hom or ref.

Currently annovar allows you to print zygosity with (—withzyg) however, this only works for the first sample in the multi vcf, unless you include all samples in which case you end up with 3 separate files. I do not want 3 separate annotated files, I’d like everything together.

The idea is that in the final annotated file, I can quickly and easily filter each individual by het/hom/ref etc to suit the pattern of inheritance I’m expecting.

Hope that helps!

ADD REPLY • link 3.9 years ago by ellieuk ▴ 40

0

Entering edit mode

The idea is that in the final annotated file, I can quickly and easily filter each individual by het/hom/ref etc to suit the pattern of inheritance

so that's a XYZ problem.

What is this pattern of inheritance ?

ADD REPLY • link 3.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I want to be able to filter for all! We don’t know. It could be de novo, dominant and with incomplete penetrance, recessive, X linked etc. The idea is that I can filter for all inheritance patterns if I have the ref/het/hom for all individuals simply laid out in one single annotated file.

ADD REPLY • link 3.9 years ago by ellieuk ▴ 40

0

Entering edit mode

so it means that you only need the correct expression for bcftools view http://samtools.github.io/bcftools/bcftools.html#expressions or gatk select variants https://gatk.broadinstitute.org/hc/en-us/articles/360035891011-JEXL-filtering-expressions . There is no need to convert the genotypes.

ADD REPLY • link 3.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Pierre,

Thanks for your response. The problem is that this doesn’t work for the student I’m supervising who can’t use the command line. They need to work on a flat csv file such as in excel to filter on inheritance patterns. We need a multi-sample Annovar output that includes all variants as het/hom/ref instead of as usual GT output from a vcf.

ADD REPLY • link 3.9 years ago by ellieuk ▴ 40

0

Entering edit mode

enter image description here

ADD REPLY • link 3.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

laughing emoji

I can use R, but my student can't!

ADD REPLY • link 3.9 years ago by ellieuk ▴ 40

score 2 · Accepted Answer · 2020-05-25

2

Entering edit mode

3.9 years ago

Pierre Lindenbaum 161k

Here is a solution using bioalcidaejdk. http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

$ java -jar src/jvarkit-git/dist/bioalcidaejdk.jar -e 'stream().forEach(V->println(V.getContig()+":"+V.getStart()+" "+V.getGenotypes().stream().map(G->G.getType().name()).collect(Collectors.joining(" "))));' input.vcf

RF01:970 HOM_REF HOM_REF HOM_REF HOM_REF HOM_VAR
RF02:251 HOM_REF HET HET HOM_REF HOM_REF
RF02:578 HOM_REF HOM_REF HOM_REF HOM_VAR HOM_REF
RF02:877 HET HOM_REF HOM_REF HOM_REF HOM_REF
RF02:1726 HOM_REF HET HET HOM_REF HOM_REF
RF02:1962 HET HOM_REF HOM_REF HOM_REF HOM_REF
RF03:1221 HOM_REF HOM_VAR HOM_VAR HOM_REF HOM_REF
RF03:1242 HOM_REF HOM_REF HOM_REF HOM_VAR HOM_REF
RF03:1688 HOM_REF HOM_REF HOM_REF HOM_REF HOM_VAR
RF03:1708 HOM_REF HOM_REF HOM_REF HOM_REF HOM_VAR

ADD COMMENT • link 3.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks you very much, worked beautifully!

ADD REPLY • link 3.9 years ago by ellieuk ▴ 40

1

Entering edit mode

if that worked, please validate the answer by clicking on the green mark on the left.

ADD REPLY • link 3.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Will do. As an addition (in case it helps for anyone else) I've been working on an alternative solution using a perl script. A further option would be to use:

convert2annovar.pl -format vcf4 example/ex2.vcf -outfile ex2 -allsample

That will add zygosities and split a multi-sample vcf into individual samples. You can then use the following perl script:

#!/usr/bin/perl; 
use strict; use v5.8.8;
##List all .var files in directory
my $path = '.'; #Reads same directory as running .pl file
opendir(DIR, $path);
my @files = grep { /\.var$/ } readdir(DIR); #Reads names of .var files into @files
@files = sort @files;
my @samples;
foreach (@files) {
if ($_ =~ /\_/) {
    $_ =~ /(^.+?)\_/;
    push @samples, $1; 
} else {
    $_ =~ /(^.+)\.var/;
    push @samples, $1;
}}
print "@samples " . ".\n";
closedir(DIR);
@files = sort @files; 
my $file_count = @files;
print "\nRunning on $file_count files:\n";
print "@files ";
##Make unique list of variants observed in all opened .var files
my @variants; ##Create concatenated list of all variant locations and alleles
foreach my $file (@files) {
open VAR, $file;
chomp (my @var = <VAR>);
while ($var[0] =~ /^SAMPLE ID\t/) {
    shift @var;
}
foreach (@var) {
    my @fields = split /\t/, $_;
    my $locus = join "\t", @fields[1..5]; #first 5 columns of annovar file (chr, LBP, RBP, ref, alt)
    push @variants, $locus;
}
close VAR;
} 
my %variants_hash = map { $_, 1 } @variants; ##Generates the unique list
my @variants_list = keys %variants_hash;
%variants_hash = undef;
@variants_list = sort @variants_list; #List of just loci
my %variants_geno; #List of loci to which genotypes will be appended
my %variants_info; # List of variant information to append after genotypes for printing.
my $sum_variants = @variants_list;
print "\nRunning on $sum_variants non-redundant variants.\n";
#Pull out genotypes for all of the non-redundant variants for each individual
my @variant_geno; #Pointless array for anonymous assignation of arrays in hash
foreach my $file (@files) {
open VAR, $file;
my @var = <VAR>;
foreach my $variant (@variants_list) {
    my $found = 0;
    my $genotype = 'REF?';
    foreach my $line (@var) {
        if ($line =~ /$variant/) {
            my @fields = split /\t/, $line;
            if ($found == 0) {
                $variants_info{$variant} = (join "\t", @fields[9..25]); ## this joins remaining columns of annovar file (or as many as desired)
                $found ++;
            }
            if ($fields[8] eq 'HET') {
                $genotype = 'HET';
            }
            if ($fields[8] eq 'HOM') {
                $genotype = 'HOM';
            }
            last;
        }
    }
    if ($variants_geno{$variant} =~ /./) {
        push @{$variants_geno{$variant}}, $genotype;
    } else {
        @{$variants_geno{$variant}} = [@variant_geno];
        push @{$variants_geno{$variant}}, $genotype;
    }   
}
close VAR;
}
open OUT, ">$ARGV[0]" . '_merge.vars';
my $samples_list = join "\t", @samples; ##re below, include the headers for the matching number of columns from annotated file
print OUT "CHR  LBP RBP Ref Alt $samples_list   Exonic? Gene    Variant type    Variant info    Novel or Clinical   phylop,1-sift,polyphen2,lrt,mutationtaster,gerp++   Grantham score  dbSNP Nonflagged    dbSNP   cg46    1000g   esp5400_ea  Soton disease   Soton non disease   Soton somatic   Mutability  False+ve Fuentes etal 12\n";
foreach my $variant_ident (@variants_list) {
shift @{$variants_geno{$variant_ident}};
my $genotypes = join "\t", @{$variants_geno{$variant_ident}};
print OUT "$variant_ident   $genotypes  $variants_info{$variant_ident}\n";
}
close OUT;

Happy for anyone to streamline the code!! E

ADD REPLY • link 3.9 years ago by ellieuk ▴ 40