Navigating Public Data For A Variant / Variant Annotation
4
2
Entering edit mode
13.2 years ago
Biomed 5.0k

I have thousands of nucleotide changes such as (coordinates are based on HG18)

chr   position   ref nucleotide  var nucleotide
chr11 1112345    A               T

First I want to identify the most important (or all?) transcript/s that this change occurred in, find the gene name, find the amino acid change and if this is in a swiss prot protein domain find the name of that protein so I can use polyphen II pre-annotated data to predict deleteriousness.

How can I do this using publicly available datasets? I can program but I need guidance in the form of "You need to get the transcript name from this table that you can download from (i.e UCSC table browser or get from this ftp address) and find the AA position of your change in this related table etc etc. Although this is not a programming question, any simple solutions would also be appreciated.

There is a similar discussion here and here. Although, there are very good answers in those discussions, I guess I am looking for a more detailed and practical recipe than general information like "you can get this from Ensemble or Biomart"

Please feel free to answer parts of this question as we can use a pipeline of answers. Thanks

variant annotation • 3.9k views
ADD COMMENT
4
Entering edit mode
13.2 years ago
Chaim ▴ 40

You can also try SeattleSeq Annotation and SeqAnt.

ADD COMMENT
3
Entering edit mode
13.2 years ago

For a pre-built solution to this, snpEff is very good:

Here's a python script that uses snpEff to calculate the effects directly from a VCF file.

If you want to dig into the Ensembl databases, this Clojure code queries the public MySQL databases for transcripts associated with variations by rs name. The function variation-genes is the entry point, and this could help with navigating the tables.

ADD COMMENT
0
Entering edit mode

Actually snpEff supports VCF input ( see option -vcf4 http://snpeff.sourceforge.net/manual.html :-)

ADD REPLY
0
Entering edit mode

Sorry Pablo, that's exactly what I was trying to say in my answer. I re-phrased it to be more clear. Thanks.

ADD REPLY
2
Entering edit mode
13.2 years ago

I wrote a program, VCFAnnotator, for annotating the VCFs. It only uses the resources from the UCSC. The code is available on github. See my post.

ADD COMMENT
2
Entering edit mode
13.2 years ago
Mutated_Dater ▴ 290

I'm new to this field but this is what i would do based on my limited knowledge

Step 1) the ensembl snp effect predictor program. It's a perl script that gets the gene/transcripts affected by a variant(though you will have to reformat your results into the appropriate input format). Here is some example output

Uploaded Variation      Location        Gene    Transcript      Consequence     Position in cDNA        Position in protein     Amino acid change       Corresponding Variation
1_63268_T/C     1:63268 ENSG00000240361 ENST00000492842 WITHIN_NON_CODING_GENE  -       -       -       rs28664618
1_69511_A/G     1:69511 ENSG00000177693 ENST00000326183 NON_SYNONYMOUS_CODING   457     141     T/A     rs2691305
1_565901_G/A    1:565901        ENSG00000248149 ENST00000503254 WITHIN_NON_CODING_GENE  -       -       -       rs7411575
1_565901_G/A    1:565901        ENSG00000230021 ENST0

Step 2) parse the results and find those consequence that are actually inside a transcript (for example the consequence non_synonymous_coding is inside upstream isn't inside a transcript but the nearby transcript is given in the results) use the ensembl perl api to get these transcripts and then get the swiss prot domains for each transcript and see if the domain overlaps your snp. All of these methods are on the ensembl website for the api. Note the snp predictor script will give you more than one consequence per snp as consequences aren't mutually exclusive (e.g. a synonymous snp could also be in a splice site so you will get 2 rows of output for that snp) so you'll need to bear that in mind

Step 3) I've never heard of the polyphen annotated data but hopefully the above will give you enough information to use it

ADD COMMENT

Login before adding your answer.

Traffic: 2323 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6