GO Annotation Data of Similar Sequences
2
0
Entering edit mode
8.6 years ago

Hi,

In order to perform a function prediction using machine learning algorithm where do I can find GO annotation data with similar sequences?I found some GO annotation data sets from NCBI GEO Data but is there a place where I can get annotation data with similar sequences directly?

gene annotation sequence • 1.5k views
ADD COMMENT
1
Entering edit mode
8.6 years ago
cyril-cros ▴ 950

Check out Ensembl's Biomart interface. You can take an organism like the mouse and ask for all its genes, grab GO terms, genomic location, Entrez ID, whatever you want. Uuse no filter, and in the Attributes panel GO are in the External category. You can even grab orthologous genes from Panther afterwards, if you want different species.

However, 'similar sequences' is a loose term, as Istvan Albert writes. Two protein sequences can only have ~20% identity and yet code for homologous proteins (roughly the same structure and function). It is much worse for the genomic sequence as the genetic code is redundant. You can also have multiple variants depending on species or subunits (think about all the variations on hemoglobins).

The usual way is to consider protein domains, as it is more relevant. Note that two proteins can still share the same type of domain but have quite different biological process GO terms, especially if you are looking at proteic complexes with lots of subunits. The only term you can use is molecular function.

There are also tools like HMMER, who uses a training set of proteins to build a hidden markov model. It can then score protein queries for similarity with you training set. It works by definition as a function prediction machine learning algorithm.

More generally, think about the problem of de novo annotation: you have a new genome, and you want to find all the genes and what they do. There are lots of different methods and approaches that can be used, but they often rely on comparing your new gene models to existing ones and attributing them the same function.

ADD COMMENT
0
Entering edit mode

Thank You!, I tried Ensembl's Biomart interface. It has almost all data needed but I want the sequences information and GO annotations to be in the same dataset. I can get GO annotations and homologs separately but not for same genes/ gene ids(genebank or entrez)

ADD REPLY
0
Entering edit mode
8.6 years ago

There is no such thing as annotation for similar sequences. The concept of "similar" is too loose, is it not meaningful. Every sequence is similar to every other sequence - it is all about the degree of similarity.

There are GO annotations for individual proteins on the GO website: http://geneontology.org/page/download-annotations

There there are methods by which you can find homologs (I assume you mean homologs not just similar sequences) otherwise functional characterization would probably be less likely to make sense

Homolog Finding

or there may be other lists compiled that list these.

ADD COMMENT
0
Entering edit mode

Thank you for the clarification Istvan!

ADD REPLY

Login before adding your answer.

Traffic: 2898 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6