VCF to Uniprot Mutation
4
3
Entering edit mode
6.0 years ago
ostrokach ▴ 340

Does anyone know of a tool for converting SNPs in VCF format to amino acid mutations in UniProt proteins?

I know snpEff can do this for Ensembl variants.

For example, for the VCF file with the line:

1   69538   COSM75742   G   A   .   .


snpEff adds the following annotation:

 1  69538   COSM75742   G   A   .   .   ANN=A|missense_variant|MODERATE|OR4F5|ENSG00000186092|transcript|ENST00000335137.3|protein_coding|1/1|c.448G>A|p.Val150Met|448/918|448/918|150/305||


I am looking for something that would give me the UniProt ID and the protein mutation mapped to the UniProt sequence.

SNP uniprot snpeff • 3.0k views
2
Entering edit mode
5.7 years ago
ostrokach ▴ 340

The best tool that I could find for annotating VCF files with UniProt mutations is Oncotator. It explicitly provides "Site-specific protein annotations from UniProt".

Alternatively, you can annotate VCF files with Ensembl mutations, and then map Ensembl to Uniprot using pairwise sequence alignments between proteins mapped to the same gene.

1
Entering edit mode
6.0 years ago

On the UniProt FTP site, you can find files for amino acid altering variants imported from Ensembl Variation databases. Mapped sequence variants are supplied per species in tab delimited text files. ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/variants/

In particular, the human file is homo_sapiens_variation.txt.gz: The variants listed are the Ensembl Variation databases' set of 1000 Genomes project (http://www.1000genomes.org/) and Catalogue of Somatic Mutations In Cancer (COSMIC) v71, imported directly from COSMIC and via Ensembl Variation, protein altering variants (SO:0001583). COSMIC v71 variants are the last freely available somatic variants from COSMIC before their licence change; therefore the accuracy of the information provided for a COSMIC variant should be verified with COSMIC. (Text from README file in that directory)

These files should help you map from Ensembl to UniProt for these variants.

Please don't hesitate to contact the UniProt helpdesk in case of questions.

0
Entering edit mode

Thank you for your answer! As you point out, Ensembl and consequently UniProt only have access to COSMIC v71. One of the things I am trying to accomplish is to map variants in a more recent version of COSMIC to UniProt.

1
Entering edit mode
6.0 years ago
proteins-ebi ▴ 10

Hi

Protein dataservices: http://www.ebi.ac.uk/uniprot/api/doc/swagger/#!/coordinates/search maybe able to provide a solution to your problem. Though at this stage it will not return the protein sequence mapping when given a single nucleotide genomic coordinate. If you have the ENSG/ENST/ENSP identifiers you can get the genomic coordinates for each exon transcribed into the final protein sequence. The coordinate service returns the protein sequence range within each exon. From there you will be able to calculate protein sequence location and get the wild type amino acid.

If the COMIC variant existed in v71 of COSMIC you can get all the annotation information UniProtKB holds concerning the variant using the variation dataservice taking the UniProt accession as your starting point.

Both the coordinate and variation dataservice will return data for reviewed canonical sequences, isoforms and unreviewed TrEMBL entries.

1
Entering edit mode
6.0 years ago
0
Entering edit mode

I am hesitant to use VEP (or other web services) because to me they are black boxes (I am not a Perl expert) and they do not scale to millions of mutations. As far as I understand, VEP relies on the Ensembl Core and Variation databases. However, those databases map to UniProt through gene identifiers (ENSG) rather than protein identifiers (ENST / ENSP) and, therefore, carry no sequence information.

0
Entering edit mode

they do not scale to millions of mutations.

it does. You can download VEP as a standalone software+ cache database.

because to me they are black boxes (I am not a Perl expert)

and not snpEff ?

0
Entering edit mode

The reason I called VEP a black box is because it isn't clear where the data comes from (this is my gripe with Biomart as well). It is easy to see the Uniprot checkbox and feel that it does what you want it to do. But then several steps down your pipeline you realise that 30% of your mutations don't match the UniProt sequence that they are supposed to mutate (been there done that).