Question: isoforms and the definition of a protein
gravatar for rayoub
3.1 years ago by
rayoub110 wrote:

I'm reviewing the UniProt data model and I'm confused by what is a protein. Originally, I thought that a protein was defined by its sequence of amino acids. However, the fact that isoforms of a gene are stored in one entry for UniProt/Swiss-Prot leads me to believe a protein is defined by the gene from which it originates. Otherwise, these alternative splicing would receive distinct UniProt/Swiss-Prot entries. Or perhaps it is more complicated than that and if isoforms are distinctly different enough they receive different entries? Just confused a bit by the definition of protein in this light.

Looking at the word itself, iso- means equal and isoform would seem to mean equal form. Understanding that structure is more conserved than sequences I would understand that form is better to define a protein than sequence. But I don't think this is what isoform really means from looking at the data.

Any help appreciated.

proteins protein isoforms • 1.2k views
ADD COMMENTlink modified 3.1 years ago by cdsouthan1.8k • written 3.1 years ago by rayoub110
gravatar for Petr Ponomarenko
3.1 years ago by
United States / Los Angeles /
Petr Ponomarenko2.6k wrote:

Protein is a molecule made of amino acids that have a specific function. Sometimes changes in protein sequence due to alternative splicing are very little and have almost no effect on function in other cases differences in sequence and function are dramatic. So one gene can encode multiple isoforms of the same protein and/or multiple different proteins.

ADD COMMENTlink written 3.1 years ago by Petr Ponomarenko2.6k

Also, you can have multiple genes that encode identical proteins. This can both happen within an organism and between organisms. Suppose you have a human and a mouse protein that are 100% identical across the full length of the protein. Is that the same protein or two different proteins? On one hand they are molecularly identical. On the other hand, one is a human protein and one is a mouse protein. As far as I know, whereas it is reasonably well defined what a protein is, there is no commonly agreed upon definition of when two proteins are to be considered "the same".

ADD REPLYlink written 3.1 years ago by Lars Juhl Jensen11k

Lars Juhl Jensen is right and there are more complicated questions coming from biology: what if we made a construct and put a human gene into yeast? Is it the same protein as in human body? If we look at the same protein by the sequence in different human tissues are these the same proteins (we know that due to chaperones the very same sequence can be folded differently in different tissues and can have a different function)? What if the sequences and functions are the same but proteins are coming from different genes? What if we have a gene that has protein sequence product that is always mutated in different ways in the process of maturing, are all of these mutated proteins the same? What if the ability to have all of these different mutations is very important for the key function of the protein? What if protein modification like phosphorylation at certain position changes its function, is it the same protein?

Here we fall in the merger of two very distinct approaches: biology is descriptive science and programming is based on mathematics that is basic (fundamental) knowledge. Data in databases and their schemas are about programming and have to be well defined, while biological information that is stored there is descriptive and thus much less structured and defined by its nature (and our understanding how to structure data and what data can be changes very quickly in time). This is why we have so complicated data structures in bioinformatics. I prefer to think about it this way: "life is so diverse and unpredictable that if you can imagine something then most likely it can be found somewhere plus lots of things beyond my imagination".

Protein is a biological term and the way we try to use it in our data analysis is just an oversimplified projection of real life and it is usually different from database to database, from tool to tool and from year to year. When experimental biologist studies a given protein, he/she considers it to be a substance made of amino acids and having a particular observable function. So when the protein is extracted and given a name all we usually know is its weight in kilodaltons, its function and maybe some extra properties like its ability to bind certain molecules or how quickly its' concentration changes in response to a given stress, etc. Only later we start to find protein sequence, the gene it is coming from, it's processing, its 3D structure and so on. When the protein first discovered it might be unknown that it is multisubunit and made of several different gene products, moreover some genes can be from the nucleus and other from mitochondria or chloroplasts. Then scientists will figure this out, but papers were already published and protein there was called say ABC. What should we do with the updated information? We start to call subunits ABCI, ABCII and some databases will have old data about protein ABC in it, while other will call it multisubunit protein complex ABC and to make a distinction between the two some third database might call it protein complex UAT and so on. We try to solve all this mess with names and definition by making up ontologies, but it is still a mess because biology is a descriptive science.

I was trained as a physicist first and bioinformatics specialist second, so I think about such "biological definition problems" this way: "The most important question is how we can model processes in a way that we can predict outcomes?" If for a given set of data protein is defined one way and in another set of data in another way than it is merely a problem of normalizing the data before using it in my model.

ADD REPLYlink written 3.1 years ago by Petr Ponomarenko2.6k

This is a fantastic explanation. Your last sentence is what it boils down to. I have to figure out what I want and normalize the data per application.

I want to play around with predicting residue-residue contacts from sequence alone within a single protein chain but what it looks like I have to do is stick with monomers in the PDB and only grab chain A (multiple chains will just be asymmetric unit stuff) and also find a way to find the representative one, perhaps the one best matching to the canonical sequence in Swiss-Prot, Doing this will reduce the impact of all that mutagenesis and ligands binding stuff in the SEQADV section that I barely understand. Not to mention isoforms, ugh.

Is it at all possible that folks coming at structure prediction from the comp sci/math direction are being foiled by incorrect data normalization and usage? I see that double counting is one obvious possibility here since proteins are represented more than once in the PDB and there is redundancy elsewhere such as TrEMBL. I know that sometimes folks with a deep learning algorithms in hand are often more concerned about the deep learning than the actual biological data, that is a bias in one direction or the other when it comes to interdisciplinary work.

ADD REPLYlink written 3.1 years ago by rayoub110

Bioinformatics is an interdisciplinary field. I have seen people from very different backgrounds doing, including trained linguists and economists. When there people from different fields working together on the same project, then interesting things can be found. I would say a team of statistician, physicist, computational biologist, experimental biologist and a good research manager (can be one of them) will outcompete 4 computational biologists easily =)

Also with "stick with monomers in the PDB and only grab chain A (multiple chains will just be asymmetric unit stuff) " you are kind of wrong from a practical point. Even with exactly the same sequence, every copy in the asymmetric unit can be different, and this is why in some PDB structures we get several copies of the same protein or protein complex. Try to superimpose these "exact copies" and see the difference for yourself. Usually, it is not dramatic, but free loops and free tails sometimes are very very different between such "copies". Fortunately, these are rarely important for the function and thus for your study.

ADD REPLYlink written 3.1 years ago by Petr Ponomarenko2.6k
gravatar for cdsouthan
3.1 years ago by
cdsouthan1.8k wrote:

Isoform is such an ambiguous qualifier it should be banned (but we know this is spitting in the wind). It has ancient origins back in the days of protein isoelectric focusing where band-splits were just conveniently named isoforms (even though this splitting was dominated by carbohydrate side chains, not endogenous protein ionisation states). These days isoforms can be conflated between alternative splicing and or initiations, or sometimes sequence variants of different mechanistic origins as well as a range of post-translational differences. As inferred above each of these "isoforms" needs to be rigorously defined

The key (or what I find useful at least) is to grasp the canonical concept of UniProt already alluded to (actually only the Swiss-Prot section). Simply put, the curators default to the longest, maximum exon sequence as a defined reference (but different to the RefSeq concept) to which all other data-supported changes in coding sequence (or post-trans mods) are then mapped as cross-references. This is not the only organizing principle one could come up with by a long chalk but its a good one in my opinion. Note though from 1986 until post 2000 this was applied pre-genomic (to the longest CDS in a cDNA) so post-genomic mapping brings a new set of challenges (as mentioned one of these is what to do with say histones with identical protein sequences). But, the canonical model still holds as the exon set from a single gene locus.

PDB mappings raise their own can of worms but, in answer to the question of matching to the canonical sequence in Swiss-Prot this is exactly what PDBe tries to do. The actual sequence as a string could be resolved from the individual structures (with its own errors or PCR-induced variants) used to be explicitly designated via a GI number (and indexed in BLASTP against nr). I'm not sure how that is handled now but, as mentioned even the smallest changes between PDBs (e.g. trimming or leaving a His purification tag) will spawn a new sequence in nr and TrEMBL.

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by cdsouthan1.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1603 users visited in the last hour