Question

How often are domains single- versus multi-exon?

0

Entering edit mode

7.3 years ago

ellan • 0

Hi! I am working on a project where the purpose is to answer the question in the title. How often are domains single or multi exon? Basically, is it more likely to find a protein domain in a single exon or will it be encoded by multiple exons? Within the project we have to look at the human genes and use Python. We also have to use BioMart to download data.

First, my approach is to download all human protein coding genes from BioMart and in the Attributes settings, use exon sequences and exon_start and exon_end to get the coordinates. Then download the same thing, but instead take the coding sequences and CDS_start and CDS_end coordinates. If the latter coordinates are within one exon's coordinates, then that CDS (i.e. that protein) should be a single-exon, right? And if the coordinates are split between several exon sequences, it is a multi-exon. By dividing the number of single and multi-exon sequences (saved in arrays using python) with the total, I will get the proportion of each. Is this even reasonable and can you actually use the coordinates in this way?

Second, to prove that this is significant or not significant, I have to perform a randomization experiment where I ignore a domain's actual position and just align it to the exon sequences to see if it, by random, is found in one exon or in several exons. How would you simulate protein domains? Since I can not use the actual domains because their positions are known, I assume I will have to create random sequences of random reasonable lengths to resemble average domains?

I think the programming is fairly OK, what is problematic is to get the correct data considering the coordinate systems. Looking at it on BioMart, it doesn't seem consistent and I'm not sure if I would be able to trust my results even if the program would work on a test-set.

Any feedback is helpful! Thank you!

domain exon python ensembl biomart • 1.8k views

ADD COMMENT • link 7.3 years ago by ellan • 0

0

Entering edit mode

You started talking about multi-exon domains but then you are only focussing on multi-exon genes, which is definitely not the same. Could you clarify? I'm a bit confused by your approach. Are you sure you know what a domain is?

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

My approach was to download all the known exon sequences from BioMart. Let's say exon 1 has coordinates in bp 20 - 400 (just as an example) and a domain sequence has coordinates 60-300, then we could say that domain sequence was found (or comes from) one exon, namely exon 1. If however, half of the sequence is found in exon 1, and half of the sequence in exon 2, then the domain would be multi-exon since it encoded for in several exons and not just one. Is this more clear? Therefor, I'd like to use the coordinates of the exons, and coordinates of the coding sequences (also available from BioMart) and compare them. Is for example the coordinates in bp of CDS sequence 1 within the coordinates span of one exon or several? and if within the CDS the specific sequence for the domain is known, one could see if that sequence is within one exon or partially found in several exons. Sorry I'm not sure if this makes sense, thank you for your help though!

ADD REPLY • link 7.3 years ago by ellan • 0

0

Entering edit mode

I understand the part about finding the overlap between coordinates of exon and domains. But I think you are biologically incorrect when you use the CDS. This has nothing to do with the protein domains. The sum of the exons is the CDS, regardless of the domains.

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

Do you have a suggestion of what kind of data I could use instead? For example if I would download actual domain sequences from another database and align them to the exon sequences and see if they align within one exon or partially within several exons. This will be a lot more work though to actually align than just looking at given coordinates. However it seems like different databases uses different coordinates, and so the coordinates given for the protein domain sequences might not be the same as those given for the exons and therefore they can't be compared.

ADD REPLY • link 7.3 years ago by ellan • 0

0

Entering edit mode

I'm not really a protein-guy, but I just had a look at BioMart and turns out you can just download the protein domain annotation together with the rest of your data (see 'Attributes')

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k