Question: Navigating Public Data For A Variant / Variant Annotation
gravatar for Biomed
8.2 years ago by
Bethesda, MD, USA
Biomed4.5k wrote:

I have thousands of nucleotide changes such as (coordinates are based on HG18)

chr   position   ref nucleotide  var nucleotide
chr11 1112345    A               T

First I want to identify the most important (or all?) transcript/s that this change occurred in, find the gene name, find the amino acid change and if this is in a swiss prot protein domain find the name of that protein so I can use polyphen II pre-annotated data to predict deleteriousness.

How can I do this using publicly available datasets? I can program but I need guidance in the form of "You need to get the transcript name from this table that you can download from (i.e UCSC table browser or get from this ftp address) and find the AA position of your change in this related table etc etc. Although this is not a programming question, any simple solutions would also be appreciated.

There is a similar discussion here and here. Although, there are very good answers in those discussions, I guess I am looking for a more detailed and practical recipe than general information like "you can get this from Ensemble or Biomart"

Please feel free to answer parts of this question as we can use a pipeline of answers. Thanks

annotation variant • 2.5k views
ADD COMMENTlink modified 8.2 years ago by Brad Chapman9.4k • written 8.2 years ago by Biomed4.5k
gravatar for Chaim
8.2 years ago by
Chaim40 wrote:

You can also try SeattleSeq Annotation and SeqAnt

ADD COMMENTlink written 8.2 years ago by Chaim40
gravatar for Brad Chapman
8.2 years ago by
Brad Chapman9.4k
Boston, MA
Brad Chapman9.4k wrote:

For a pre-built solution to this, snpEff is very good:

Here's a python script that uses snpEff to calculate the effects directly from a VCF file:

If you want to dig into the Ensembl databases, this Clojure code queries the public MySQL databases for transcripts associated with variations by rs name. The function variation-genes is the entry point, and this could help with navigating the tables:

ADD COMMENTlink modified 8.2 years ago • written 8.2 years ago by Brad Chapman9.4k

Actually snpEff supports VCF input ( see option -vcf4 :-)

ADD REPLYlink written 8.2 years ago by Pablo1.9k

Sorry Pablo, that's exactly what I was trying to say in my answer. I re-phrased it to be more clear. Thanks.

ADD REPLYlink written 8.2 years ago by Brad Chapman9.4k
gravatar for Pierre Lindenbaum
8.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

I wrote a program, VCFAnnotator, for annotating the VCFs. It only uses the resources from the UCSC. The code is available on github. See my post:

ADD COMMENTlink written 8.2 years ago by Pierre Lindenbaum119k
gravatar for Mutated_Dater
8.2 years ago by
Mutated_Dater290 wrote:

I'm new to this field but this is what i would do based on my limited knowledge

step 1) the ensembl snp effect predictor program. It's a perl script that gets the gene/transcripts affected by a variant(though you will have to reformat your results into the appropriate input format). Here is some example output

Uploaded Variation      Location        Gene    Transcript      Consequence     Position in cDNA        Position in protein     Amino acid change       Corresponding Variation
1_63268_T/C     1:63268 ENSG00000240361 ENST00000492842 WITHIN_NON_CODING_GENE  -       -       -       rs28664618
1_69511_A/G     1:69511 ENSG00000177693 ENST00000326183 NON_SYNONYMOUS_CODING   457     141     T/A     rs2691305
1_565901_G/A    1:565901        ENSG00000248149 ENST00000503254 WITHIN_NON_CODING_GENE  -       -       -       rs7411575
1_565901_G/A    1:565901        ENSG00000230021 ENST0

step 2) parse the results and find those consequence that are actually inside a transcript (for example the consequence non_synonymous_coding is inside upstream isn't inside a transcript but the nearby transcript is given in the results) use the ensembl perl api to get these transcripts and then get the swiss prot domains for each transcript and see if the domain overlaps your snp. All of these methods are on the ensembl website for the api. Note the snp predictor script will give you more than one consequence per snp as consequences aren't mutually exclusive (e.g. a synonymous snp could also be in a splice site so you will get 2 rows of output for that snp) so you'll need to bear that in mind

step 3) I've never heard of the polyphen annotated data but hopefully the above will give you enough information to use it

ADD COMMENTlink written 8.2 years ago by Mutated_Dater290
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 701 users visited in the last hour