Question: How often are domains single- versus multi-exon?
gravatar for ellan
3.2 years ago by
ellan0 wrote:

Hi! I am working on a project where the purpose is to answer the question in the title. How often are domains single or multi exon? Basically, is it more likely to find a protein domain in a single exon or will it be encoded by multiple exons? Within the project we have to look at the human genes and use Python. We also have to use BioMart to download data.

First, my approach is to download all human protein coding genes from BioMart and in the Attributes settings, use exon sequences and exon_start and exon_end to get the coordinates. Then download the same thing, but instead take the coding sequences and CDS_start and CDS_end coordinates. If the latter coordinates are within one exon's coordinates, then that CDS (i.e. that protein) should be a single-exon, right? And if the coordinates are split between several exon sequences, it is a multi-exon. By dividing the number of single and multi-exon sequences (saved in arrays using python) with the total, I will get the proportion of each. Is this even reasonable and can you actually use the coordinates in this way?

Second, to prove that this is significant or not significant, I have to perform a randomization experiment where I ignore a domain's actual position and just align it to the exon sequences to see if it, by random, is found in one exon or in several exons. How would you simulate protein domains? Since I can not use the actual domains because their positions are known, I assume I will have to create random sequences of random reasonable lengths to resemble average domains?

I think the programming is fairly OK, what is problematic is to get the correct data considering the coordinate systems. Looking at it on BioMart, it doesn't seem consistent and I'm not sure if I would be able to trust my results even if the program would work on a test-set.

Any feedback is helpful! Thank you!

ensembl exon biomart python domain • 1.1k views
ADD COMMENTlink written 3.2 years ago by ellan0

You started talking about multi-exon domains but then you are only focussing on multi-exon genes, which is definitely not the same. Could you clarify? I'm a bit confused by your approach. Are you sure you know what a domain is?

ADD REPLYlink written 3.2 years ago by WouterDeCoster43k

My approach was to download all the known exon sequences from BioMart. Let's say exon 1 has coordinates in bp 20 - 400 (just as an example) and a domain sequence has coordinates 60-300, then we could say that domain sequence was found (or comes from) one exon, namely exon 1. If however, half of the sequence is found in exon 1, and half of the sequence in exon 2, then the domain would be multi-exon since it encoded for in several exons and not just one. Is this more clear? Therefor, I'd like to use the coordinates of the exons, and coordinates of the coding sequences (also available from BioMart) and compare them. Is for example the coordinates in bp of CDS sequence 1 within the coordinates span of one exon or several? and if within the CDS the specific sequence for the domain is known, one could see if that sequence is within one exon or partially found in several exons. Sorry I'm not sure if this makes sense, thank you for your help though!

ADD REPLYlink written 3.2 years ago by ellan0

I understand the part about finding the overlap between coordinates of exon and domains. But I think you are biologically incorrect when you use the CDS. This has nothing to do with the protein domains. The sum of the exons is the CDS, regardless of the domains.

ADD REPLYlink written 3.2 years ago by WouterDeCoster43k

Do you have a suggestion of what kind of data I could use instead? For example if I would download actual domain sequences from another database and align them to the exon sequences and see if they align within one exon or partially within several exons. This will be a lot more work though to actually align than just looking at given coordinates. However it seems like different databases uses different coordinates, and so the coordinates given for the protein domain sequences might not be the same as those given for the exons and therefore they can't be compared.

ADD REPLYlink written 3.2 years ago by ellan0

I'm not really a protein-guy, but I just had a look at BioMart and turns out you can just download the protein domain annotation together with the rest of your data (see 'Attributes')

ADD REPLYlink written 3.2 years ago by WouterDeCoster43k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1545 users visited in the last hour