Question: Output per variant and per sample heterozygosity fraction from VCF.
0
gravatar for William
4.4 years ago by
William4.7k
Europe
William4.7k wrote:

As a QC measure I would like to know the per variant and per sample heterozygosity fraction.

I already used vcftools to output the missingness per variant and sample.

https://vcftools.github.io/man_latest.html

Is there any tool that can do the same for heterozygosity?

qc vcf • 2.5k views
ADD COMMENTlink modified 4.4 years ago by Pierre Lindenbaum129k • written 4.4 years ago by William4.7k

Using vcftools --het ? According to the manual this option calculates a measure of heterozygosity on a per-individual basis.

ADD REPLYlink written 4.4 years ago by iraun3.8k

This functionality seems to be broken in my installed version of vcftools (v0.1.14 (= the latest?) ) . It says all variants and samples are included but outputs a file called out.het with a header " INDV O(HOM) E(HOM) N_SITES F " and no records. I am also still looking for the per site heterozygosity .

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by William4.7k
2
gravatar for Zev.Kronenberg
4.4 years ago by
United States
Zev.Kronenberg11k wrote:

Here is a VCFLIB solution:

  ./genotypeSummary -f ../samples/1kg-phaseIII-v5a.20130502.genotypes.chr22-16-16.5mb.vcf.gz -y GT -t 0,1,2,3,4

Here is the output:

sample-id   n-nocall    n-hom-ref   n-het   n-hom-alt
HG00096 0   5899    133 49
HG00097 0   5855    149 77
HG00099 0   5909    130 42
HG00100 0   5864    167 50
HG00101 0   5884    132 65
ADD COMMENTlink written 4.4 years ago by Zev.Kronenberg11k

I get an error about a unknown genotype. The file is produced by Freebayes. but might contain ./. and . as missing genotypes? genotypeSummary -f my_file.vcf.gz -y GT -t 0,1,2,3,4,5 FATAL: unknown genotype: .

Also I would like to do this for a large set of samples so a parsing for a range of targets might be usefull. i.e.: -t 1-100

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by William4.7k

William Thanks for reporting the error. I just fixed the genotype parser to allow both '.' and './.' for genotypes. Let me know if the most recent version generates the same error; I don't have a test file handy. The range parsing would be ideal. I'll look into that. Feel free to open an issue on github.

ADD REPLYlink written 4.4 years ago by Zev.Kronenberg11k
2
gravatar for Pierre Lindenbaum
4.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

using bioalcidae : https://github.com/lindenb/jvarkit/wiki/BioAlcidae

var sample2count={};
var samples = header.getSampleNamesInOrder();
for(var i=0;i< samples.size();i++) {
sample2count[samples.get(i)]={
    "nocall":0,
    "homref":0,
    "homvar":0,
    "hetnonref":0,
    "het":0,
    };
}

while(iter.hasNext()) {
    var ctx=iter.next();
    for(var i=0;i< samples.size();i++) {
        var count = sample2count[samples.get(i)];
        var  g = ctx.getGenotype(samples.get(i));
        if( !g.isCalled()) count["nocall"]++;
        else if( g.isHomRef()) count["homref"]++;
        else if( g.isHomVar()) count["homvar"]++;
        else if( g.isHetNonRef()) count["hetnonref"]++;
        else if( g.isHet()) count["het"]++;
        }
    }
 out.println("Sample nocall homref homvar hetnonref het");  
 for(var i=0;i< samples.size();i++)
 {
 var count = sample2count[samples.get(i)];
 out.println(samples.get(i)+" "+ count["nocall"]+" "+count["homref"]+" "+ count["homvar"]+" "+count["hetnonref"]+" "+count["het"]);
 }

run:

 curl -s "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz" | gunzip -c  |\
head -n 1000 | \
java -jar ~/src/jvarkit-git/dist/bioalcidae.jar -f script.js -F vcf 

Sample nocall homref homvar hetnonref het
HG00096 0 729 4 0 14
HG00097 0 712 7 0 28
HG00099 0 715 2 0 30
HG00100 0 711 13 0 23
HG00101 0 708 16 0 23
HG00102 0 714 7 0 26
HG00103 0 720 10 0 17
HG00105 0 720 2 0 25
HG00106 0 704 12 0 31
HG00107 0 707 14 0 26
HG00108 0 719 17 0 11
HG00109 0 707 18 0 22
HG00110 0 726 10 0 11
HG00111 0 724 14 0 9
HG00112 0 712 18 0 17
HG00113 0 729 3 0 15
HG00114 0 724 6 0 17
HG00115 0 713 22 0 12
HG00116 0 718 9 0 20

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by Pierre Lindenbaum129k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 689 users visited in the last hour