Hi! I am working on a project where the purpose is to answer the question in the title. How often are domains single or multi exon? Basically, is it more likely to find a protein domain in a single exon or will it be encoded by multiple exons? Within the project we have to look at the human genes and use Python. We also have to use BioMart to download data.
First, my approach is to download all human protein coding genes from BioMart and in the Attributes settings, use exon sequences and exon_start and exon_end to get the coordinates. Then download the same thing, but instead take the coding sequences and CDS_start and CDS_end coordinates. If the latter coordinates are within one exon's coordinates, then that CDS (i.e. that protein) should be a single-exon, right? And if the coordinates are split between several exon sequences, it is a multi-exon. By dividing the number of single and multi-exon sequences (saved in arrays using python) with the total, I will get the proportion of each. Is this even reasonable and can you actually use the coordinates in this way?
Second, to prove that this is significant or not significant, I have to perform a randomization experiment where I ignore a domain's actual position and just align it to the exon sequences to see if it, by random, is found in one exon or in several exons. How would you simulate protein domains? Since I can not use the actual domains because their positions are known, I assume I will have to create random sequences of random reasonable lengths to resemble average domains?
I think the programming is fairly OK, what is problematic is to get the correct data considering the coordinate systems. Looking at it on BioMart, it doesn't seem consistent and I'm not sure if I would be able to trust my results even if the program would work on a test-set.
Any feedback is helpful! Thank you!