Question: VCF to Uniprot Mutation
3
gravatar for ostrokach
3.3 years ago by
ostrokach280
Canada
ostrokach280 wrote:

Does anyone know of a tool for converting SNPs in VCF format to amino acid mutations in UniProt proteins?


I know snpEff can do this for Ensembl variants.

For example, for the VCF file with the line:

1   69538   COSM75742   G   A   .   .

snpEff adds the following annotation:

 1  69538   COSM75742   G   A   .   .   ANN=A|missense_variant|MODERATE|OR4F5|ENSG00000186092|transcript|ENST00000335137.3|protein_coding|1/1|c.448G>A|p.Val150Met|448/918|448/918|150/305||

I am looking for something that would give me the UniProt ID and the protein mutation mapped to the UniProt sequence.

snp uniprot snpeff • 2.0k views
ADD COMMENTlink modified 3.0 years ago • written 3.3 years ago by ostrokach280
2
gravatar for ostrokach
3.0 years ago by
ostrokach280
Canada
ostrokach280 wrote:

The best tool that I could find for annotating VCF files with UniProt mutations is Oncotator. It explicitly provides "Site-specific protein annotations from UniProt".

Alternatively, you can annotate VCF files with Ensembl mutations, and then map Ensembl to Uniprot using pairwise sequence alignments between proteins mapped to the same gene.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by ostrokach280
1
gravatar for Elisabeth Gasteiger
3.3 years ago by
Geneva
Elisabeth Gasteiger1.6k wrote:

On the UniProt FTP site, you can find files for amino acid altering variants imported from Ensembl Variation databases. Mapped sequence variants are supplied per species in tab delimited text files. ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/variants/

In particular, the human file is homo_sapiens_variation.txt.gz: The variants listed are the Ensembl Variation databases' set of 1000 Genomes project (http://www.1000genomes.org/) and Catalogue of Somatic Mutations In Cancer (COSMIC) v71, imported directly from COSMIC and via Ensembl Variation, protein altering variants (SO:0001583). COSMIC v71 variants are the last freely available somatic variants from COSMIC before their licence change; therefore the accuracy of the information provided for a COSMIC variant should be verified with COSMIC. (Text from README file in that directory)

These files should help you map from Ensembl to UniProt for these variants.

Please don't hesitate to contact the UniProt helpdesk in case of questions.

ADD COMMENTlink written 3.3 years ago by Elisabeth Gasteiger1.6k

Thank you for your answer! As you point out, Ensembl and consequently UniProt only have access to COSMIC v71. One of the things I am trying to accomplish is to map variants in a more recent version of COSMIC to UniProt.

ADD REPLYlink written 3.3 years ago by ostrokach280
1
gravatar for proteins-ebi
3.3 years ago by
proteins-ebi10
EMBL-EBI
proteins-ebi10 wrote:

Hi

Protein dataservices: http://www.ebi.ac.uk/uniprot/api/doc/swagger/#!/coordinates/search maybe able to provide a solution to your problem. Though at this stage it will not return the protein sequence mapping when given a single nucleotide genomic coordinate. If you have the ENSG/ENST/ENSP identifiers you can get the genomic coordinates for each exon transcribed into the final protein sequence. The coordinate service returns the protein sequence range within each exon. From there you will be able to calculate protein sequence location and get the wild type amino acid.

If the COMIC variant existed in v71 of COSMIC you can get all the annotation information UniProtKB holds concerning the variant using the variation dataservice taking the UniProt accession as your starting point.

Both the coordinate and variation dataservice will return data for reviewed canonical sequences, isoforms and unreviewed TrEMBL entries.

ADD COMMENTlink written 3.3 years ago by proteins-ebi10
1
gravatar for Pierre Lindenbaum
3.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

use vep ?http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_uniprot

ADD COMMENTlink written 3.3 years ago by Pierre Lindenbaum123k

I am hesitant to use VEP (or other web services) because to me they are black boxes (I am not a Perl expert) and they do not scale to millions of mutations. As far as I understand, VEP relies on the Ensembl Core and Variation databases. However, those databases map to UniProt through gene identifiers (ENSG) rather than protein identifiers (ENST / ENSP) and, therefore, carry no sequence information.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by ostrokach280

they do not scale to millions of mutations.

it does. You can download VEP as a standalone software+ cache database.

because to me they are black boxes (I am not a Perl expert)

and not snpEff ?

ADD REPLYlink written 3.3 years ago by Pierre Lindenbaum123k

The reason I called VEP a black box is because it isn't clear where the data comes from (this is my gripe with Biomart as well). It is easy to see the Uniprot checkbox and feel that it does what you want it to do. But then several steps down your pipeline you realise that 30% of your mutations don't match the UniProt sequence that they are supposed to mutate (been there done that).

ADD REPLYlink written 3.3 years ago by ostrokach280
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2077 users visited in the last hour