Question: Genomic Positions Of Protein Domains
5
gravatar for Pascal
8.4 years ago by
Pascal130
Pascal130 wrote:

Hi there,

I am looking for a database containing genomic positions of known protein domains. In principle I need the genomic start and stop position on the genome of domain of each gene. I know, that these positions would span introns, but this is not important for my purpose. Is there something like that? I took a look at BioMart and other sources, but mostly I just got the position on the protein sequence, not the abolute position on the genome.

Regards

protein • 7.2k views
ADD COMMENTlink written 8.4 years ago by Pascal130

did you find a solution, that can be used by others?

ADD REPLYlink written 7.2 years ago by Bioinfosm610
7
gravatar for Pierre Lindenbaum
8.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

To create this resource I would:

ADD COMMENTlink written 8.4 years ago by Pierre Lindenbaum119k

I was hoping to get around that, but this is maybe the best possibility.

ADD REPLYlink written 8.4 years ago by Pascal130

Sorry, probably it is a silly question, but I don't get how do you builkd the translated protein using knownGene. Moreover, when you say align the SWP entry and the reconstituted, you are talking about aminoacids, isn't it? So, my problem is that i can follow the procedure, because you are aligning proteins and then you must go back to DNA

ADD REPLYlink written 7.2 years ago by Tonig430

yes but the knownGene table contains the structure of the exon on the genomic reference; So, you can map each amino acid back to a a base on the genome; See my program backlocate: http://code.google.com/p/variationtoolkit/wiki/BackLocate

ADD REPLYlink written 7.2 years ago by Pierre Lindenbaum119k
6
gravatar for Khader Shameer
8.4 years ago by
Manhattan, NY
Khader Shameer18k wrote:

Few weeks back, I was also looking for such a resource for my analysis and realised exactly what you figured out: you won't be able to get this information from BioMart. I contacted the Ensembl help desk and they suggested me to integrate data using Ensembl resources (some of the data via Biomart and rest via the Ensembl Core/Variation API). So you have two options now, you may explore the Ensembl API path or proceed as described by Pierre using UCSC resources. Also remember, it will get a bit more complex due to alternate transcripts and alternate exons; This can change final protein product and exact genomic location of the domains, because of this complexity you may not be able to get a perfect one to one-mapping.

ADD COMMENTlink written 8.4 years ago by Khader Shameer18k
5
gravatar for Neilfws
8.4 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

What you're describing is a coordinate conversion problem.

It's possible of course; if you know all of the required coordinates in both coordinate systems (i.e. exons and domains, in amino acid and nucleotide coordinates), but it is quite technically challenging.

One solution, if you're comfortable in Perl/Bioperl might be the Bioperl module Bio::Coordinate::GeneMapper, which was written for just this purpose. There may be similar libraries available for other languages.

As Pierre mentioned, you may also be able to use the UCSC tables, many of which have positional information.

ADD COMMENTlink written 8.4 years ago by Neilfws48k
3
gravatar for iw9oel_ad
8.4 years ago by
iw9oel_ad6.0k
iw9oel_ad6.0k wrote:

You say that you are using Biomart? Does that mean your genome of interest is in Ensembl? If so, the work may already done for the annotated protein domains; these are stored as Bio::EnsEMBL::ProteinFeatures which have a location both on the protein (in protein coordinates) and on the genome (in chromosome coordinates).

To find these you would need to obtain genes, then transcripts and from those the translations. Given a translation, you can get the protein features and then filter these to include only those whose analysis type you require e.g. Pfam.

While this is possible according to the API docs, I don't know whether these data are present for your organism. It's probably worth checking, though because it will only take a short script to find out.

ADD COMMENTlink written 8.4 years ago by iw9oel_ad6.0k

That's what I checked first. Unfortunately it seems like BioMart (the web site) doesn't offer any positional information about protein domains at all. I also checked out the Perl API, but I wasn't able to get genomic positions, but only the location on the protein.

ADD REPLYlink written 8.4 years ago by Pascal130

No Biomart doesn't offer this, but the fact that the data are in Biomart means that there is very probably a core Ensembl database for your organism and you can use the API to get the information.

ADD REPLYlink written 8.4 years ago by iw9oel_ad6.0k
2
gravatar for Darked89
8.4 years ago by
Darked894.2k
Barcelona, Spain
Darked894.2k wrote:

For COGs there is Genome ProtMap:

http://www.ncbi.nlm.nih.gov/sutils/protmap.cgi?cluster=COG4690E&result=map

The (very) hard way would be to map selected Pfam domains back to genome of interest using genewise.

ADD COMMENTlink written 8.4 years ago by Darked894.2k
1
gravatar for Michael Kuhn
8.4 years ago by
Michael Kuhn5.0k
EMBL Heidelberg
Michael Kuhn5.0k wrote:

PAL2NAL is a tool that can project a protein alignment onto nucleotide sequences. It's not exactly meant for what you want to do, but might be usable if you use the domain sequence and the nucleotide sequence of the gene.

ADD COMMENTlink written 8.4 years ago by Michael Kuhn5.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2274 users visited in the last hour