Question

Indel Prioritizing Method To Extract Most Damaging Effect On Protein

2

Entering edit mode

10.4 years ago

ivivek_ngs ★ 5.2k

Dear All,

I have been using ANNOVAR for a while with my tumor samples. I do not have control rarther am trying to understand the variants that are novel for my samples not present in dbSNP and are common in tumor and its corresponding IPS lines. I have extracted the non synonymous SNPs for my samples after annotation and prioritized the candidates on the basis of the functional scores reported by the annovar in its output file but when I am trying to understand the processing for the INDELS, am not getting any scores reported in the ANNOVAR output. The manual filtering I did for my INDELS using GATK was ( --filterExpression "QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0" --filterName GATKStandard --missingValuesInExpressionsShouldEvaluateAsFailing ). Now I used this vcf file which I get from this step to annotate using ANNOVAR but in the exome_summary.csv of the ANNOVAR output I do not see any functional scores at all. Is there any way to prioritize the candidates? Or I should just take into account the large INDELS based on the quality score? Any method of prioritizing the INDEL ANNOVAR output without functional scores that can be suggested? I will be more interested in frameshift INDELS but how will I asses the impact on the protein functions from the annovar output if there is no functional scores to suggest how much damaging they are. Then there are a lot of genes where I can see INDELS but they are marked as unknown which means they have never been annotated but then how shall I use them or prioritize from them as well? Is there any strategy that can be applied?

exome-sequencing annovar gatk • 2.6k views

ADD COMMENT • link updated 10.4 years ago by Alex Paciorkowski 3.5k • written 10.4 years ago by ivivek_ngs ★ 5.2k

score 2 · Answer 1 · 2013-12-09

A couple of points to think about:

1) How can you identify variants novel for your samples when you don't have controls? If these are human samples you may be able to use some of the publicly available exome data (ie EVS)

2) What kind of functional scores would you like? SIFT and PolyPhen are meant for single nucleotide variants, not indels, so annovar wouldn't have anything to output as far as I know.

3) In general, any indel that changes the reading frame is suspicious for possibly having an impact on protein function because probably a premature stop is introduced at some point. Nonframeshifts are more difficult to tell -- depending on the position and size they may have functional consequences as well. This is one reason why interpretation of indels is so much harder than snvs (and there are also more false positives).

4) One thing to consider is running GATK and Pindel on your samples, and comparing the output. This should improve at least your confidence that the union of the two outputs is less likely to contain false positives. You can also get SNP arrays on your samples and compare that with the indel output.

5) This is where a lot of labs use pathway analysis and other prediction techniques to identify genes that are good candidates for the phenotype being studied, and then from there evaluating if the indels are likely to be causative or not...but that may need to be done at the bench. But you are right, if annotation of your genes is incomplete, this won't take you all that far. Still many labs do use the "novel indels were found in genes in the blahblah pathway and not in controls" approach.