Question: Comparing Genotypes from NA12878 to that of her parents (NA12891 and NA12892)
3
gravatar for sichan
4.8 years ago by
sichan80
Canada
sichan80 wrote:

Hello,

I'm interested in comparing the genotypes from Genome in a Bottle's NA12878 (GIAB) to those of her parents (NA12891 and NA12892).

I downloaded GIAB's NA12878 vcf from here:
ftp://ftp-trace.ncbi.nih.gov/giab/ftp/release/NA12878_HG001/latest/NIST_RTG_PlatGen_merged_highconfidence_v0.2_Allannotate.vcf.gz

After a lot of searching around, I found this page from Broad describing a vcf containing the genotypes for the trio:
http://gatkforums.broadinstitute.org/discussion/1292/which-datasets-should-i-use-for-reviewing-or-benchmarking-purposes

And I downloaded the variants for NA12891 and NA12892 from here:
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20140625_high_coverage_trios_broad/CEU.wgs.consensus.20131118.snps_indels.high_coverage_pcr_free_v2.genotypes.vcf.gz

In total, there are ~3.3 million variants in the GIAB vcf.  I compared the alternate alleles and genotypes in the GIAB vcf with the corresponding values in her parents and found that ~27% of the positions had parental genotypes that didn't make sense.

e.g. a position in the daughter is genotyped as 1/1, but the father is 0/1 and the mother is 0/0.  That is, it's impossible for the daughter to be 1/1 if her parents are 0/1 and 0/0.

I'm aware that the GIAB vcf has gone through a lot more curation than those of her parents, so perhaps that accounts for the discrepancy?  

I'm pretty sure I'm using the correct files, but if anyone thinks otherwise, please let me know.

Thank you.

snp next-gen genome • 6.4k views
ADD COMMENTlink modified 4.8 years ago by Len Trigg1.3k • written 4.8 years ago by sichan80
2

You need to make sure to restrict your analyses to the high confidence regions provided in the NIST bed file. 

ADD REPLYlink written 4.8 years ago by Zev.Kronenberg11k

According to the README from Genome in a Bottle, the VCF contains highly confident hetero- and homozygous variant calls, thus implying that those variants are in highly confident regions.  
Any position in the confident BED file but not the VCF can be confidently treated as homozygous reference.

ftp://ftp-trace.ncbi.nih.gov/giab/ftp/release/NA12878_HG001/latest/README.GIAB.v0.2.txt

As a quick sanity check, I previously used bedtools to confirm that there were zero positions in the VCF that were not in the confident BED file.

 

ADD REPLYlink written 4.8 years ago by sichan80
2
gravatar for Len Trigg
4.8 years ago by
Len Trigg1.3k
New Zealand
Len Trigg1.3k wrote:

As well as the fact that you should get better mendelian consistency by restricting to the high confidence regions, much of the discrepancy may be due to differences in variant representation between the NIST set and the what is produced by GATK.  As noted here:

IMPORTANT NOTE: Some differences between the integrated calls and your datasets are likely due to different representations of the same complex variants, so be careful about this. In our experience, for some datasets, over half of the putative false positive snps and indels can be due to different correct representations of complex variants. Running vcflib vcfallelicprimitives on your vcf should allow proper comparison of all homozygous complex variants, but not all heterozygous complex variants since our calls are currently unphased. Real Time Genomics has freely released their vcf comparison algorithm vcfeval, which can properly compare most unphased heterozygous complex variants. Currently for complex variants, our calls generally use the representation from Real Time Genomics caller.

(RTG Tools includes the vcfeval tool for comparing a call set vs baseline handling the representational difficulties, and also a separate tool for flagging mendelian violations as you have been doing, but AFAIK there isn't something that does both together)

 

ADD COMMENTlink written 4.8 years ago by Len Trigg1.3k

Len Trigg has a great point about variant normalization.  This is especially important for INDELS.  Here is another way to normalize variants: http://genome.sph.umich.edu/wiki/Variant_Normalization

ADD REPLYlink written 4.8 years ago by Zev.Kronenberg11k

Normalization does help somewhat, but there are still plenty of problematic situations with complex variants where you need to go beyond that -- the endgame involves replaying the variants into the reference so that comparisons are carried out at the local haplotype level. See: http://www.slideshare.net/GenomeInABottle/140127-rtg-vcfeval-vcf-comparison-tool

AFAIK, only RTG vcfeval and possibly the Java version of SMaSH do this with any degree of sophistication.

 

ADD REPLYlink written 4.8 years ago by Len Trigg1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 969 users visited in the last hour