Question: Why is 0/0 homozygous reference call in VCF useful information?
2
gravatar for lillian.thistlethwaite
20 months ago by
lillian.thistlethwaite30 wrote:

I have a set of patient VCF files, and I am looking at the genotypes of each variant called. I observe the three expected categories:

0/0 : homozygous for the reference allele
0/1 : heterozygous (one ref allele, one alt allele)
1/1:  homozygous for the alternate allele

My question is, what is the point of calling a homozygous reference allele? Is this only done when there is low confidence on the variant call? To as to say "hey, this could be a different genotype, but here's what we got with the data quality given"?

What would I lose if I filtered all rows in my VCFs out with the genotype listed as "0/0"?

sequencing variant-calling vcf • 1.9k views
ADD COMMENTlink modified 19 months ago by Biostar ♦♦ 20 • written 20 months ago by lillian.thistlethwaite30

how many samples per vcf file ?

ADD REPLYlink written 20 months ago by Pierre Lindenbaum129k

Just one sample per file.

ADD REPLYlink written 20 months ago by lillian.thistlethwaite30
1

I like to know the difference between "confident data we see hom-ref" and "no-data". Your single-sample VCF is only showing mutant sites, but I often run a VCF at a given set of coordinates, and I want to know which of the four situations is going on. 0/0 or 0/1 or 1/1 or no-data.

ADD REPLYlink written 20 months ago by karl.stamm3.6k

Ah, so I think what both Pierre and Karl are saying is "0/0" help to rule OUT a disease if you know the variant is found at location X and for patientA, they have "0/0" genotype at location X.

However, if you are looking for patterns in an unsupervised fashion, and want to look for variants that you can then attribute to a patient's phenotype, filtering out all "0/0"'s makes sense, because you want to prioritize the 0/1 and 1/1 calls.

Did I represent what you are saying correctly?

Also, I assume "no data" in a VCF is just the "absence" of information for a particular locus?

ADD REPLYlink written 20 months ago by lillian.thistlethwaite30

and how did you generate the VCF ?

ADD REPLYlink written 20 months ago by Pierre Lindenbaum129k

I didn't generate the VCF. It was given to me by a collaborator. What decisions regarding generation would affect the answer to this question?

We used Atlas-SNP2 v1.4.3, with the following flags (removed the input and output default/required flags) and filters:

-F -y 6 -s --Illumina -f 3500

##FILTER=<ID=low_snpqual,Description="SNP posterior probability lower than 0.95">
##FILTER=<ID=low_VariantReads,Description="Number of variant reads is less than 3">
##FILTER=<ID=low_VariantRatio,Description="Variant read ratio is less than 0.1">
##FILTER=<ID=single_strand,Description="All variant reads are in a single strand direction">
##FILTER=<ID=low_coverage,Description="Total coverage is less than 6">
##FILTER=<ID=high_coverage,Description="Total coverage is more than 3500">
##FILTER=<ID=No_data,Description="No valid reads on this site">
##FILTER=<ID=No_var,Description="No valid variant read on this site">
ADD REPLYlink modified 20 months ago • written 20 months ago by lillian.thistlethwaite30

. What decisions regarding generation would affect the answer to this question?

tools like samtools are free to print all the position or the variant only. It's useful later for example when you want to merge some VCFs. If you have all the positions there will be no ambiguity between a HOM_REF and a NO_CALL for a missing value.

ADD REPLYlink written 20 months ago by Pierre Lindenbaum129k
6
gravatar for Kevin Blighe
20 months ago by
Kevin Blighe61k
University College London
Kevin Blighe61k wrote:

Yes, it is important to know that a homozygous reference may exist at a particular site. If you think of a clinical context, it is just as important to know with high confidence that a particular variant does not exist as it is to know that it does exist. The absence of information from a VCF over a particular site does not necessarily imply anything.

Usually, variant callers are configured to only output sites that have a variant call with regard to the chosen reference genome. They can be configured to output calls at each and every site, though. If a call cannot be made due to lack of information, you may see ./.

In most situations, if you see a 0/0, it will appear in a multi-sample VCF at a site where a HET or HOM variant was called in another sample at the same site.

ADD COMMENTlink written 20 months ago by Kevin Blighe61k

Awesome - very clear! Thank you so much!

ADD REPLYlink written 19 months ago by lillian.thistlethwaite30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1921 users visited in the last hour