Genomic Positions Of Protein Domains
6
5
Entering edit mode
13.4 years ago
Pascal ▴ 130

Hi there,

I am looking for a database containing genomic positions of known protein domains. In principle I need the genomic start and stop position on the genome of domain of each gene. I know, that these positions would span introns, but this is not important for my purpose. Is there something like that? I took a look at BioMart and other sources, but mostly I just got the position on the protein sequence, not the abolute position on the genome.

Regards

protein protein • 11k views
ADD COMMENT
0
Entering edit mode

did you find a solution, that can be used by others?

ADD REPLY
8
Entering edit mode
13.4 years ago

To create this resource I would:

ADD COMMENT
0
Entering edit mode

I was hoping to get around that, but this is maybe the best possibility.

ADD REPLY
0
Entering edit mode

Sorry, probably it is a silly question, but I don't get how do you builkd the translated protein using knownGene. Moreover, when you say align the SWP entry and the reconstituted, you are talking about aminoacids, isn't it? So, my problem is that i can follow the procedure, because you are aligning proteins and then you must go back to DNA

ADD REPLY
0
Entering edit mode

yes but the knownGene table contains the structure of the exon on the genomic reference; So, you can map each amino acid back to a a base on the genome; See my program backlocate: http://code.google.com/p/variationtoolkit/wiki/BackLocate

ADD REPLY
6
Entering edit mode
13.4 years ago

Few weeks back, I was also looking for such a resource for my analysis and realised exactly what you figured out: you won't be able to get this information from BioMart. I contacted the Ensembl help desk and they suggested me to integrate data using Ensembl resources (some of the data via Biomart and rest via the Ensembl Core/Variation API). So you have two options now, you may explore the Ensembl API path or proceed as described by Pierre using UCSC resources. Also remember, it will get a bit more complex due to alternate transcripts and alternate exons; This can change final protein product and exact genomic location of the domains, because of this complexity you may not be able to get a perfect one to one-mapping.

ADD COMMENT
5
Entering edit mode
13.4 years ago
Neilfws 49k

What you're describing is a coordinate conversion problem.

It's possible of course; if you know all of the required coordinates in both coordinate systems (i.e. exons and domains, in amino acid and nucleotide coordinates), but it is quite technically challenging.

One solution, if you're comfortable in Perl/Bioperl might be the Bioperl module Bio::Coordinate::GeneMapper, which was written for just this purpose. There may be similar libraries available for other languages.

As Pierre mentioned, you may also be able to use the UCSC tables, many of which have positional information.

ADD COMMENT
3
Entering edit mode
13.4 years ago

You say that you are using Biomart? Does that mean your genome of interest is in Ensembl? If so, the work may already done for the annotated protein domains; these are stored as Bio::EnsEMBL::ProteinFeatures which have a location both on the protein (in protein coordinates) and on the genome (in chromosome coordinates).

To find these you would need to obtain genes, then transcripts and from those the translations. Given a translation, you can get the protein features and then filter these to include only those whose analysis type you require e.g. Pfam.

While this is possible according to the API docs, I don't know whether these data are present for your organism. It's probably worth checking, though because it will only take a short script to find out.

ADD COMMENT
0
Entering edit mode

That's what I checked first. Unfortunately it seems like BioMart (the web site) doesn't offer any positional information about protein domains at all. I also checked out the Perl API, but I wasn't able to get genomic positions, but only the location on the protein.

ADD REPLY
0
Entering edit mode

No Biomart doesn't offer this, but the fact that the data are in Biomart means that there is very probably a core Ensembl database for your organism and you can use the API to get the information.

ADD REPLY
2
Entering edit mode
13.4 years ago
Darked89 4.6k

For COGs there is Genome ProtMap:

http://www.ncbi.nlm.nih.gov/sutils/protmap.cgi?cluster=COG4690E&result=map

The (very) hard way would be to map selected Pfam domains back to genome of interest using genewise.

ADD COMMENT
1
Entering edit mode
13.4 years ago

PAL2NAL is a tool that can project a protein alignment onto nucleotide sequences. It's not exactly meant for what you want to do, but might be usable if you use the domain sequence and the nucleotide sequence of the gene.

ADD COMMENT

Login before adding your answer.

Traffic: 2632 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6