VCF files 101 - for people from non Bioinformatics Background
0
1
Entering edit mode
3.7 years ago
akshaykum684 ▴ 20

Hello Everyone,

Am new to Bioinformatics analysis. Currently, I have a set of VCF files and have been tasked to analyze them. However, am not from the bioinformatics background. Limited understanding of biology (High school stuff). Though I went through the below tutorials and also bought the Biostar book, I am looking for some tutorial which can make it easy to understand for people with no bioinformatics background like me. The resources that I referred are a bit technical. I am looking for resources which have simplified the concepts with example? I understand it also depends on readers ability but just want to make sure that I don't miss any well-known/popular resource that beginners usually refer?

Can any of you please direct me to such a resource, please?

Tutorials that I referred

https://faculty.washington.edu/browning/intro-to-vcf.html#intro

http://alimanfoo.github.io/2017/06/14/read-vcf.html

https://www.ebi.ac.uk/training-beta/online/courses/human-genetic-variation-introduction/variant-identification-and-analysis/understanding-vcf-format/

vcf sequencing genome variant-calling • 1.7k views
ADD COMMENT
1
Entering edit mode

Currently, I have a set of VCF files and have been tasked to analyze them.

What kind of analysis are you looking to do? VCF is a format with a defined specification. Here is an additional link (GATK help) that simplifies things.

ADD REPLY
1
Entering edit mode

The first link was last updated in 2014. You're better off reading VCFv4.3 specs, given how much has changed since then. The EBI link is good though.

ADD REPLY
0
Entering edit mode

@RamRs @genomax

can you help me with the below fields in VCF file? The image is from VCF specification file.

https://ibb.co/c1hxX12

1) May I know from where do we get the QUAL, FILTER and INFO values of a VCF file?

2) How is INFO and QUAL values calculated? can I kindly request your help, please? Any simple example of how these numbers are computed? can you show with an example please?

ADD REPLY
1
Entering edit mode

See GATK link in my previous comment for explanation for both of your questions.

ADD REPLY
0
Entering edit mode

Hi @Genomax @RamRS

I referred the link that you shared. However, I have another question on PL and GQ. If we look at the below sample

20 10001019 . T G 364.77 . [CLIPPED] GT:AD:DP:GQ:PL 0/1:18,15:33:99:393,0,480

For which I read the below from link

"The degree of certainty in our genotype is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele; the next PL is PL(0/0) = 393, corresponding to 10^(-39.3), or 5.0118723e-40 which is a very small number indeed; and the next one will be even smaller"

"The Genotype Quality represents the Phred-scaled confidence that the genotype assignment (GT) is correct, derived from the genotype PLs. Specifically, the GQ is the difference between the PL of the second most likely genotype, and the PL of the most likely genotype. As noted above, the values of the PLs are normalized so that the most likely PL is always 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99. In GATK, the value of GQ is capped at 99 because larger values are not more informative, but they take more space in the file. So if the second most likely PL is greater than 99, we still assign a GQ of 99."

But my bad. I am not able to understand this. Not sure whether it is due to my limitation with English or being new to this domain..

Actually I understand what those two fields are but I don't understand the way they compute it.

Can you help me to understand this better in layman terms? Is there tutorial where they have computed this step-step walking users through on this?

Q1) Here what is the most likely genotype and which is the second most likely genotype?

Q2) how is the normalized score calculated are being used to address Q1. Is it even being used? whats the use of calculating normalized score? I understand how the normalized score is calculated but whats the use of it;

Would really be helpful please

ADD REPLY
1
Entering edit mode

My personal opinion is that it is not always necessary to understand the statistical underpinnings of how each field is calculated in order to use the data. It always helps to be curious, but keep in mind the fact that things need to get done, and maybe knowing that the GT is accurate to a high degree of certainty is more relevant to us than knowing exactly how that degree of certainty was arrived at.

ADD REPLY
0
Entering edit mode

Hi, yes, good suggestion. But sometimes, I tend to get lost with so much of information to interpret from the VCF file. Anyway thanks for the help. Can I also kindly request your help with the below post?

C: VCF file analysis - Tutorial resources

ADD REPLY
1
Entering edit mode

genomax pretty much covers it in their comment. A VCF file is two tables masquerading as one.

In my experience, the learning is never ending, but the basics are annotation and filtering. It would also help to look at parsimony, left alignment and normalization to understand certain conventions when it comes to non-SNV variants in repeat regions. GATK's Best Practices pipeline is a good place to learn about pipelines that generate variant calls, and also a good place to learn about gVCF files.

ADD REPLY
0
Entering edit mode

ADD REPLY
0
Entering edit mode

Please stop doing this. Your post will receive comments from people without you spamming links to it on other threads.

ADD REPLY
0
Entering edit mode

ADD REPLY
0
Entering edit mode

If you wish to delete a comment, use the Delete Post option under moderate. Erasing the content is bad etiquette as it removes context.

ADD REPLY

Login before adding your answer.

Traffic: 2813 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6