Dear all ,
I would like to put some queries regarding the filtering and the annotation step that is usually done to retrieve the causal variants and the extract the most important candidate genes that are likely to cause mutations in tumor samples. I have designed a exome sequencing data analysis pipeline after reading through the different pipeline that have been provided online and modifying them according to my experimental design and have been able to extract the variants which I want to annotate using annovar and filter out the potential mutated genes. The catch in my analysis is that I do not have any normal samples and so far my idea of experimental design is to analyze the exome data of tumor sample and the IPSC line derived from the same tumor. To this what I did is to analyze separately the tumor and its corresponding IPSC line and then annotate them separately with Annovar . The command I used for annovar is :
#### Annotations using annovar
### Conversion to annovar file format
perl5.8.8 /data/PGP/exome/annovar/convert2annovar.pl /scratch/GT/vdas/pietro/exome_seq/results/T_S7999/T_S7999.recal.snps.vcf -format vcf4 --outfile /scratch/GT/vdas/pietro/exome_seq/results/T_S7999/T_S7999.recal.snps.vcf.annovar -includeinfo
######final annotation
perl5.8.8 /data/PGP/exome/annovar30_01_2013/summarize_annovar.pl -veresp 6500 -ver1000g 1000g2012apr -buildver hg19 -verdbsnp 137 /scratch/GT/vdas/pietro/exome_seq/results/T_S7999/T_S7999.recal.snps.vcf.annovar /data/PGP/exome/annovar30_01_2013/humandb -outfile /scratch/GT/vdas/pietro/exome_seq/results/T_S7999/T_S7999_snps -step 1-9
This I have also done for the IPSC line as well. Next I got a list of over 5000 mutated genes and I want to compare the tumor and its IPSC to check if the genetic landscape of both is still maintained or not. But I am a bit curious that the 5000 gene counts is too large and also since I have no normal samples so I cannot apply the subtraction method where I can do away with the mutations that are usually found in the normal sample with respect to the refgene. So I would like to ask if there is any protocol for filtering the non synonymous and synonymous SNV obtained from the annovar step to reduce the number of mutated genes to more potential candidates and then compare the tumor and its IPSCs. I see there is another program in Annovar variants_reduction.pl which can be used , does anyone have any idea of using this program or is there any standard filtering method which can be applied on the output obtained from the final annotation step as mentioned above? I can only see the AVSIFT scores and based on a ranking I can select the genes that are having below AVSIFT scores less than 0.05 and filter the genes. But does this idea sound good? I am not looking for any novel mutations as of now so I am not removing the variants that are found in dbSNP and 1000 Genomes. Again is there any way to check the MAF value in annovar and put a stringent parameter threshold and use such genes that are having MAF scores less than the thresholds? I would be thankful if anyone can share their experience with me during the annotation filtration process they have used while exome analysis. This filed is new to me so I might be wrong in some areas , please feel free to correct me and show me the right path. It would be nice if anyone can share any script for filtration post annovar usage or also the variant reduction program script with parameters that they follow.
Thanks
Thank you very much for the reply. Infact it is really not a good approach without the control samples but as of now we donot have the control samples as we are dealing with patient samples and our preliminary analysis needs to be done on the tumor and its derived IPS lines only. As far as the prioritizing and the filtering is concerned. I am interested in non syn and syn SNV and stop codons. So what am thinking to do is to separate out these 3 categories from the exome summary csv file which is the outout having all tabular information of the mutations from the summarize.pl and then filter on the basis of AVSIFT scores less than 0.05 and these candidates then extracted will be my priority mutations which I can then take into account and try to compare between the tumor and its IPSC lines. This will in a way give me a much potential list of candidates. How does this sound for filtering selection? Please let me know. Also it would be nice if you can give idea of some open source prioritizing algorithm that are available online which I can use for the same. Thanks again for the suggestions.
as I said, filtering by exonic function and a single functional prediction score is fine, but very limited. I would broad this filter at least to include allele frequencies (in tumour variants they're very useful) and any of the available functional prediction scores. and as far as I know, there's no open source prioritizing algorithm published as a universal way of highlighting relevant variants in any kind of experiment, but I'm sure you will find plenty of ideas from tumour sequencing papers to build your own.
" ANNOVAR to get all the information you are looking for: allele frequencies (we get them from 1000genomes, ESP and COSMIC), genetic function (we get it from RefGene) or functional predictions (ljb_all contains up to 10 different functional scores including AVSIFT), but we also get conservated and segmental duplication sites, dbSNP code, clinical relevance reported in dbSNP"
I want to know how these scores are generated in the final output of summarize perl program of annovar. Is it some predefined scores or some scores that are obtained with the samples in analysis with the different databases having information of those nucleotides. Below is the table of avsift.txt file. Can anyone explain me what each column signifies and what is the score that is in one of the columns and how is that used while annotation and then scores are calculated in respective samples for each mutations? Please let me know.
Can anyone tell me the score for AVSIFT column which we get in ANNOVAR output csv file after the summarize perl program is run how is it organized. I mean to say in the SIFT paper I see that the SIFT scores greater than 0.05 are considered benign and below 0.05 are deleterious while in dbNSFP article they say about new SIFT scores which is opposite where higher SIFT scores denote deleterious and lower as neutral. I want to prioritize my mutated genes of tumor sample on the basis of the AVSIFT scores. So what shall be the criteria? Can anyone guide me in this?
the SIFT score is always as explained in their website: "Positions with normalized probabilities less than 0.05 are predicted to be deleterious, those greater than or equal to 0.05 are predicted to be tolerated.". if you use the old "avsift" annotation then the numbers should match this description, but if you use the recent "ljb2_sift", as dbNSFP seems to be doing, then the numbers would be opposite as it uses "1-SIFT"
Thank you very much for the valuable inputs this will help me in prioritizing my candidates
I would like to know the meaning of the LJB_PhyloP score meaning. It says score >0.95 gives an idea of the pathogenicity and that it is conserved. I would like to know what conservation is it talking about? If am not wrong the phyloP uses the HMM to construct the lineage specific selection scores of the substitutions of the amino acids. Then in a tumor sample for a non synonymous mutation having a phylop score >0.95 means its conserved and that is deleterious but I would like to know what conservation it is referring to in that context. Can anyone enlighten me?