Question

Variant Reduction Using Annovar

2

Entering edit mode

10.4 years ago

ivivek_ngs ★ 5.2k

Dear all ,

I would like to put some queries regarding the filtering and the annotation step that is usually done to retrieve the causal variants and the extract the most important candidate genes that are likely to cause mutations in tumor samples. I have designed a exome sequencing data analysis pipeline after reading through the different pipeline that have been provided online and modifying them according to my experimental design and have been able to extract the variants which I want to annotate using annovar and filter out the potential mutated genes. The catch in my analysis is that I do not have any normal samples and so far my idea of experimental design is to analyze the exome data of tumor sample and the IPSC line derived from the same tumor. To this what I did is to analyze separately the tumor and its corresponding IPSC line and then annotate them separately with Annovar . The command I used for annovar is :

#### Annotations using annovar
### Conversion to annovar file format

perl5.8.8 /data/PGP/exome/annovar/convert2annovar.pl /scratch/GT/vdas/pietro/exome_seq/results/T_S7999/T_S7999.recal.snps.vcf -format vcf4 --outfile /scratch/GT/vdas/pietro/exome_seq/results/T_S7999/T_S7999.recal.snps.vcf.annovar -includeinfo

######final annotation
perl5.8.8 /data/PGP/exome/annovar30_01_2013/summarize_annovar.pl -veresp 6500 -ver1000g 1000g2012apr -buildver hg19 -verdbsnp 137 /scratch/GT/vdas/pietro/exome_seq/results/T_S7999/T_S7999.recal.snps.vcf.annovar /data/PGP/exome/annovar30_01_2013/humandb -outfile /scratch/GT/vdas/pietro/exome_seq/results/T_S7999/T_S7999_snps -step 1-9

This I have also done for the IPSC line as well. Next I got a list of over 5000 mutated genes and I want to compare the tumor and its IPSC to check if the genetic landscape of both is still maintained or not. But I am a bit curious that the 5000 gene counts is too large and also since I have no normal samples so I cannot apply the subtraction method where I can do away with the mutations that are usually found in the normal sample with respect to the refgene. So I would like to ask if there is any protocol for filtering the non synonymous and synonymous SNV obtained from the annovar step to reduce the number of mutated genes to more potential candidates and then compare the tumor and its IPSCs. I see there is another program in Annovar variants_reduction.pl which can be used , does anyone have any idea of using this program or is there any standard filtering method which can be applied on the output obtained from the final annotation step as mentioned above? I can only see the AVSIFT scores and based on a ranking I can select the genes that are having below AVSIFT scores less than 0.05 and filter the genes. But does this idea sound good? I am not looking for any novel mutations as of now so I am not removing the variants that are found in dbSNP and 1000 Genomes. Again is there any way to check the MAF value in annovar and put a stringent parameter threshold and use such genes that are having MAF scores less than the thresholds? I would be thankful if anyone can share their experience with me during the annotation filtration process they have used while exome analysis. This filed is new to me so I might be wrong in some areas , please feel free to correct me and show me the right path. It would be nice if anyone can share any script for filtration post annovar usage or also the variant reduction program script with parameters that they follow.

Thanks

annovar exome-sequencing filtering • 7.5k views

ADD COMMENT • link updated 10.4 years ago by Jorge Amigo 14k • written 10.4 years ago by ivivek_ngs ★ 5.2k

score 4 · Answer 1 · 2013-11-12

4

Entering edit mode

10.4 years ago

Jorge Amigo 14k

regarding the methodology, I sincerely doubt you are able to get any reliable conclusions from your data without using control samples, either from the same sample but from a different non-affected tissue. I'm sure there'll be people around here that may be able to suggest a better way to proceed rather than me.

but regarding the filtering process I may be able to share our experience. in fact the concept of "filtering" is something that we don't usually like, because we are very much oriented into clinical matters, so we favor the concept of "prioritize" instead. we use ANNOVAR to annotate all the variants we have with as much information as possible, and for that reason we were using the summarize_annovar.pl script, as it was able to give us plenty annotation columns (fixed list though) in a easy-to-handle tabulated format. but recently ANNOVAR has released an update that has deprecated this summarize_annovar.pl, and proposes the use of table_annovar.pl instead, as it allows you to select the exact databases you want in your final output table, and even to place them in any particular order. the result of an ANNOVAR run is always the same number of variants as detected by the variant caller, but with a lot more information to be used later for prioritizing the variants. think it as a tool that has to be used always: it's much better to use it always in the same way and let the end user decide what to do with the whole pack rather than implement filters and thresholds through the process that surely depend on each run or on each end user's needs.

having said all this, we use ANNOVAR to get all the information you are looking for: allele frequencies (we get them from 1000genomes, ESP and COSMIC), genetic function (we get it from RefGene) or functional predictions (ljb_all contains up to 10 different functional scores including AVSIFT), but we also get conservated and segmental duplication sites, dbSNP code, clinical relevance reported in dbSNP,... once you have all this information condensed in a single table it's just a matter of implementing any prioritizing algorithm you may be interested in. typical ones include looking for exonic variants, non synonymous, frameshifts, stop codons, low allele frequencies, relevant functional predictions,... we have seen that filtering is useful, but building a weighted algorithm that evaluates all this annotations and ranks the variant list can be even more useful. unfortunately this formula hugely depends on the experiment you're performing, so it's very complicated to build and share a single weight algorithm, although commercial softwares are moving in this direction.

ADD COMMENT • link 10.4 years ago by Jorge Amigo 14k

0

Entering edit mode

Thank you very much for the reply. Infact it is really not a good approach without the control samples but as of now we donot have the control samples as we are dealing with patient samples and our preliminary analysis needs to be done on the tumor and its derived IPS lines only. As far as the prioritizing and the filtering is concerned. I am interested in non syn and syn SNV and stop codons. So what am thinking to do is to separate out these 3 categories from the exome summary csv file which is the outout having all tabular information of the mutations from the summarize.pl and then filter on the basis of AVSIFT scores less than 0.05 and these candidates then extracted will be my priority mutations which I can then take into account and try to compare between the tumor and its IPSC lines. This will in a way give me a much potential list of candidates. How does this sound for filtering selection? Please let me know. Also it would be nice if you can give idea of some open source prioritizing algorithm that are available online which I can use for the same. Thanks again for the suggestions.

ADD REPLY • link 10.4 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

as I said, filtering by exonic function and a single functional prediction score is fine, but very limited. I would broad this filter at least to include allele frequencies (in tumour variants they're very useful) and any of the available functional prediction scores. and as far as I know, there's no open source prioritizing algorithm published as a universal way of highlighting relevant variants in any kind of experiment, but I'm sure you will find plenty of ideas from tumour sequencing papers to build your own.

ADD REPLY • link 10.4 years ago by Jorge Amigo 14k

0

Entering edit mode

" ANNOVAR to get all the information you are looking for: allele frequencies (we get them from 1000genomes, ESP and COSMIC), genetic function (we get it from RefGene) or functional predictions (ljb_all contains up to 10 different functional scores including AVSIFT), but we also get conservated and segmental duplication sites, dbSNP code, clinical relevance reported in dbSNP"

I want to know how these scores are generated in the final output of summarize perl program of annovar. Is it some predefined scores or some scores that are obtained with the samples in analysis with the different databases having information of those nucleotides. Below is the table of avsift.txt file. Can anyone explain me what each column signifies and what is the score that is in one of the columns and how is that used while annotation and then scores are calculated in respective samples for each mutations? Please let me know.

ADD REPLY • link 10.4 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Can anyone tell me the score for AVSIFT column which we get in ANNOVAR output csv file after the summarize perl program is run how is it organized. I mean to say in the SIFT paper I see that the SIFT scores greater than 0.05 are considered benign and below 0.05 are deleterious while in dbNSFP article they say about new SIFT scores which is opposite where higher SIFT scores denote deleterious and lower as neutral. I want to prioritize my mutated genes of tumor sample on the basis of the AVSIFT scores. So what shall be the criteria? Can anyone guide me in this?

ADD REPLY • link 10.4 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

the SIFT score is always as explained in their website: "Positions with normalized probabilities less than 0.05 are predicted to be deleterious, those greater than or equal to 0.05 are predicted to be tolerated.". if you use the old "avsift" annotation then the numbers should match this description, but if you use the recent "ljb2_sift", as dbNSFP seems to be doing, then the numbers would be opposite as it uses "1-SIFT"

ADD REPLY • link 10.4 years ago by Jorge Amigo 14k

0

Entering edit mode

Thank you very much for the valuable inputs this will help me in prioritizing my candidates

ADD REPLY • link 10.4 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

I would like to know the meaning of the LJB_PhyloP score meaning. It says score >0.95 gives an idea of the pathogenicity and that it is conserved. I would like to know what conservation is it talking about? If am not wrong the phyloP uses the HMM to construct the lineage specific selection scores of the substitutions of the amino acids. Then in a tumor sample for a non synonymous mutation having a phylop score >0.95 means its conserved and that is deleterious but I would like to know what conservation it is referring to in that context. Can anyone enlighten me?

ADD REPLY • link 10.4 years ago by ivivek_ngs ★ 5.2k