Question

Navigating Public Data For A Variant / Variant Annotation

2

Entering edit mode

14.5 years ago

Biomed 5.0k

I have thousands of nucleotide changes such as (coordinates are based on HG18)

chr   position   ref nucleotide  var nucleotide
chr11 1112345    A               T

First I want to identify the most important (or all?) transcript/s that this change occurred in, find the gene name, find the amino acid change and if this is in a swiss prot protein domain find the name of that protein so I can use polyphen II pre-annotated data to predict deleteriousness.

How can I do this using publicly available datasets? I can program but I need guidance in the form of "You need to get the transcript name from this table that you can download from (i.e UCSC table browser or get from this ftp address) and find the AA position of your change in this related table etc etc. Although this is not a programming question, any simple solutions would also be appreciated.

There is a similar discussion here and here. Although, there are very good answers in those discussions, I guess I am looking for a more detailed and practical recipe than general information like "you can get this from Ensemble or Biomart"

Please feel free to answer parts of this question as we can use a pipeline of answers. Thanks

variant annotation • 4.5k views

ADD COMMENT • link updated 14.5 years ago by Brad Chapman 9.7k • written 14.5 years ago by Biomed 5.0k

Ram · Answer 1 · 2011-02-02

4

Entering edit mode

14.5 years ago

Chaim ▴ 40

You can also try SeattleSeq Annotation and SeqAnt.

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 14.5 years ago by Chaim ▴ 40

Ram · Answer 2 · 2011-02-02

3

Entering edit mode

14.5 years ago

Brad Chapman 9.7k

For a pre-built solution to this, snpEff is very good:

Here's a python script that uses snpEff to calculate the effects directly from a VCF file.

If you want to dig into the Ensembl databases, this Clojure code queries the public MySQL databases for transcripts associated with variations by rs name. The function variation-genes is the entry point, and this could help with navigating the tables.

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 14.5 years ago by Brad Chapman 9.7k

0

Entering edit mode

Actually snpEff supports VCF input ( see option -vcf4 http://snpeff.sourceforge.net/manual.html :-)

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 14.5 years ago by Pablo ★ 1.9k

0

Entering edit mode

Sorry Pablo, that's exactly what I was trying to say in my answer. I re-phrased it to be more clear. Thanks.

ADD REPLY • link 14.5 years ago by Brad Chapman 9.7k

Ram · Answer 3 · 2011-02-01

2

Entering edit mode

14.5 years ago

Pierre Lindenbaum 166k

I wrote a program, VCFAnnotator, for annotating the VCFs. It only uses the resources from the UCSC. The code is available on github. See my post.

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 14.5 years ago by Pierre Lindenbaum 166k

Ram · Answer 4 · 2011-02-02

I'm new to this field but this is what i would do based on my limited knowledge

Step 1) the ensembl snp effect predictor program. It's a perl script that gets the gene/transcripts affected by a variant(though you will have to reformat your results into the appropriate input format). Here is some example output

Uploaded Variation      Location        Gene    Transcript      Consequence     Position in cDNA        Position in protein     Amino acid change       Corresponding Variation
1_63268_T/C     1:63268 ENSG00000240361 ENST00000492842 WITHIN_NON_CODING_GENE  -       -       -       rs28664618
1_69511_A/G     1:69511 ENSG00000177693 ENST00000326183 NON_SYNONYMOUS_CODING   457     141     T/A     rs2691305
1_565901_G/A    1:565901        ENSG00000248149 ENST00000503254 WITHIN_NON_CODING_GENE  -       -       -       rs7411575
1_565901_G/A    1:565901        ENSG00000230021 ENST0

Step 2) parse the results and find those consequence that are actually inside a transcript (for example the consequence non_synonymous_coding is inside upstream isn't inside a transcript but the nearby transcript is given in the results) use the ensembl perl api to get these transcripts and then get the swiss prot domains for each transcript and see if the domain overlaps your snp. All of these methods are on the ensembl website for the api. Note the snp predictor script will give you more than one consequence per snp as consequences aren't mutually exclusive (e.g. a synonymous snp could also be in a splice site so you will get 2 rows of output for that snp) so you'll need to bear that in mind

Step 3) I've never heard of the polyphen annotated data but hopefully the above will give you enough information to use it