Why is 0/0 homozygous reference call in VCF useful information?
1
3
Entering edit mode
3.7 years ago

I have a set of patient VCF files, and I am looking at the genotypes of each variant called. I observe the three expected categories:

0/0 : homozygous for the reference allele
0/1 : heterozygous (one ref allele, one alt allele)
1/1:  homozygous for the alternate allele

My question is, what is the point of calling a homozygous reference allele? Is this only done when there is low confidence on the variant call? To as to say "hey, this could be a different genotype, but here's what we got with the data quality given"?

What would I lose if I filtered all rows in my VCFs out with the genotype listed as "0/0"?

vcf variant-calling sequencing • 5.9k views
ADD COMMENT
0
Entering edit mode

how many samples per vcf file ?

ADD REPLY
0
Entering edit mode

Just one sample per file.

ADD REPLY
1
Entering edit mode

I like to know the difference between "confident data we see hom-ref" and "no-data". Your single-sample VCF is only showing mutant sites, but I often run a VCF at a given set of coordinates, and I want to know which of the four situations is going on. 0/0 or 0/1 or 1/1 or no-data.

ADD REPLY
0
Entering edit mode

Ah, so I think what both Pierre and Karl are saying is "0/0" help to rule OUT a disease if you know the variant is found at location X and for patientA, they have "0/0" genotype at location X.

However, if you are looking for patterns in an unsupervised fashion, and want to look for variants that you can then attribute to a patient's phenotype, filtering out all "0/0"'s makes sense, because you want to prioritize the 0/1 and 1/1 calls.

Did I represent what you are saying correctly?

Also, I assume "no data" in a VCF is just the "absence" of information for a particular locus?

ADD REPLY
0
Entering edit mode

and how did you generate the VCF ?

ADD REPLY
0
Entering edit mode

I didn't generate the VCF. It was given to me by a collaborator. What decisions regarding generation would affect the answer to this question?

We used Atlas-SNP2 v1.4.3, with the following flags (removed the input and output default/required flags) and filters:

-F -y 6 -s --Illumina -f 3500

##FILTER=<ID=low_snpqual,Description="SNP posterior probability lower than 0.95">
##FILTER=<ID=low_VariantReads,Description="Number of variant reads is less than 3">
##FILTER=<ID=low_VariantRatio,Description="Variant read ratio is less than 0.1">
##FILTER=<ID=single_strand,Description="All variant reads are in a single strand direction">
##FILTER=<ID=low_coverage,Description="Total coverage is less than 6">
##FILTER=<ID=high_coverage,Description="Total coverage is more than 3500">
##FILTER=<ID=No_data,Description="No valid reads on this site">
##FILTER=<ID=No_var,Description="No valid variant read on this site">
ADD REPLY
0
Entering edit mode

. What decisions regarding generation would affect the answer to this question?

tools like samtools are free to print all the position or the variant only. It's useful later for example when you want to merge some VCFs. If you have all the positions there will be no ambiguity between a HOM_REF and a NO_CALL for a missing value.

ADD REPLY
9
Entering edit mode
3.7 years ago

Yes, it is important to know that a homozygous reference may exist at a particular site. If you think of a clinical context, it is just as important to know with high confidence that a particular variant does not exist as it is to know that it does exist. The absence of information from a VCF over a particular site does not necessarily imply anything.

Usually, variant callers are configured to only output sites that have a variant call with regard to the chosen reference genome. They can be configured to output calls at each and every site, though. If a call cannot be made due to lack of information, you may see ./.

In most situations, if you see a 0/0, it will appear in a multi-sample VCF at a site where a HET or HOM variant was called in another sample at the same site.

Added August 1st, 2020:

also be aware of these situations: A: Alternate nucleotide is more frequent than reference nucleotide. OMG I'm dizzy.

ADD COMMENT
0
Entering edit mode

Awesome - very clear! Thank you so much!

ADD REPLY

Login before adding your answer.

Traffic: 1160 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6