Help Needed With Annovar - Csv Summary
1
0
Entering edit mode
10.7 years ago
newDNASeqer ▴ 760

I tried to use AnnoVar to annotate the final VCF output files, and used the "summarize_annovar.pl" script. I got the CSV file and it's big (90 MB for 11 exome sequencing samples of cancer/tumor tissues). I need to set up a criteria to shrink the large amount of data to something more manageable and useful. The criteria I can think of are: filter out synonymous SNVs, intronic(?). I am also thinking about using Polyphen2 prediction score to screen the data, but I am not sure if this is the right way. What other criteria do you guys recommend in order to find out the variants in the cancer sample?

Also, in the AAChange column, I don't know how to understand this "uc001gkl.1:c.C2789T:p.P930L". Dos this mean two point mutations (C2789T, and P930L)? and what do the "c." and "p." mean that respectively precede the two point mutations? My hunch is "c." is for confidence and "p" is for probable? I did not find my answer on AnnoVar website, so I decided to post the questions here. Thanks

annovar vcf • 4.2k views
ADD COMMENT
3
Entering edit mode
10.7 years ago

It sounds like you are fairly new working with this kind of data. Here's my advice:

1) Learn some basic command-line skills in unix. grep for example will help you wrangle a great deal of this type of data

2) Keep your files in vcf format. Allows you to do so much more. Like filter from the command-line. Don't go for csv, which makes me worry you are trying to load your data all into MS excel, and you don't ever want to do that. So, for example:

$ grep nonsynonymous your_annotated_file.vcf > your_annotated_file_nonsynon.vcf

That will give you a selection of variants perhaps more interesting to you. Many also write filters in things like Perl or Python etc. Or you can use something like awk from the command line too. These are all things to read about and try.

3) Ask around other labs that are working on data like yours, read papers in your field with experiments like yours. Then you can formulate reasonable hypotheses and test them against your data. Filtering by polyphen may be a reasonable approach, it may not be, depends what you are trying to accomplish.

4) Learn about genomic notation. "uc001gkl.1:c.C2789T:p.P930L" means in gene id uc001gkl there is a coding "c" nucleotide change from C to T at position 2789 and this results in an amino acid substitution "p" from P to L at codon 930. It has nothing to do with confidence or probable or anything like that.

ADD COMMENT
0
Entering edit mode

thanks Alex for your detailed reply. Another question related to AnnoVar:

As my VCF file contails 11 samples, when I use annovar to annotate the VCF time, will the final output retail the same order of 11 samples? I tried to verify this myself, but have not matched the order in the annovar output with the input VCF file.

ADD REPLY

Login before adding your answer.

Traffic: 2921 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6