Question

Generating Protein Databases From Snp And Indel Information

2

Entering edit mode

12.6 years ago

Doug ▴ 20

Using NGS technology I recently detected thousands of SNPs and indels in a yeast strain for which we have proteomic data. I wrote software to generate a protein database from these results. However, i ran into several troubling events including disruption of start codons, stop codons, and itron/exon boundaries. For each case, I made my own judgement calls and moved on. But I would like to compare my results to others. Is anyone aware of software that generates protein fasta files from genomic data? I currently have a .vcf file but could probably convert it int other usable formats if necessary.

I have looked into several variant effect predictor tools including Polyphen2, annovar coding_change.pl), snpEff, and EnsEMBL Variant effect predictor. However, these tools are more focused on predicting phenotypic effect than simply generating a fasta file. They might do what i am looking for but if so I haven't figured out how to do it. I would appreciate any input or feedback on this subject.

proteomics vcf fasta • 3.0k views

ADD COMMENT • link updated 12.6 years ago by Zev.Kronenberg 12k • written 12.6 years ago by Doug ▴ 20

score 1 · Answer 1 · 2011-11-22

1

Entering edit mode

12.4 years ago

Zev.Kronenberg 12k

Have you though about annotating the variants using VAT which is a part of the VAAST suite? VCF->GVF->annotation is a relatively easy.

ADD COMMENT • link 12.4 years ago by Zev.Kronenberg 12k

0

Entering edit mode

You could then use these annotations to create protein sequences.

ADD REPLY • link 12.4 years ago by Zev.Kronenberg 12k

score 0 · Answer 2 · 2011-09-29

Because your genome encodes mostly non-spliced or single-exon protein-coding genes, I think that the analysis approach would be rather straightforward. Thus, what comes to mind is the analysis pipeline followed by those looking into pathogenic outbreaks such as EHEC/EAEC O104:H4 in Germany last summer. While the focus of that and similar studies was genome sequencing without proteomic data, they likely employed a rapid screen to identify protein-based differences between a standard, benign strain and the one (or several) isolated during the outbreak.

This topic is not my forte. Just an idea that comes to mind.