Question: Generating Protein Databases From Snp And Indel Information
Doug20 wrote:

Using NGS technology I recently detected thousands of SNPs and indels in a yeast strain for which we have proteomic data. I wrote software to generate a protein database from these results. However, i ran into several troubling events including disruption of start codons, stop codons, and itron/exon boundaries. For each case, I made my own judgement calls and moved on. But I would like to compare my results to others. Is anyone aware of software that generates protein fasta files from genomic data? I currently have a .vcf file but could probably convert it int other usable formats if necessary.

I have looked into several variant effect predictor tools including Polyphen2, annovar, snpEff, and EnsEMBL Variant effect predictor. However, these tools are more focused on predicting phenotypic effect than simply generating a fasta file. They might do what i am looking for but if so I haven't figured out how to do it. I would appreciate any input or feedback on this subject.

Zev.Kronenberg11k wrote:

Have you though about annotating the variants using VAT which is a part of the VAAST suite? VCF->GVF->annotation is a relatively easy.

You could then use these annotations to create protein sequences.

Larry_Parnell16k wrote:

Because your genome encodes mostly non-spliced or single-exon protein-coding genes, I think that the analysis approach would be rather straightforward. Thus, what comes to mind is the analysis pipeline followed by those looking into pathogenic outbreaks such as EHEC/EAEC O104:H4 in Germany last summer. While the focus of that and similar studies was genome sequencing without proteomic data, they likely employed a rapid screen to identify protein-based differences between a standard, benign strain and the one (or several) isolated during the outbreak.

This topic is not my forte. Just an idea that comes to mind.

