I'm reviewing the UniProt data model and I'm confused by what is a protein. Originally, I thought that a protein was defined by its sequence of amino acids. However, the fact that isoforms of a gene are stored in one entry for UniProt/Swiss-Prot leads me to believe a protein is defined by the gene from which it originates. Otherwise, these alternative splicing would receive distinct UniProt/Swiss-Prot entries. Or perhaps it is more complicated than that and if isoforms are distinctly different enough they receive different entries? Just confused a bit by the definition of protein in this light.
Looking at the word itself, iso- means equal and isoform would seem to mean equal form. Understanding that structure is more conserved than sequences I would understand that form is better to define a protein than sequence. But I don't think this is what isoform really means from looking at the data.
Any help appreciated.
Also, you can have multiple genes that encode identical proteins. This can both happen within an organism and between organisms. Suppose you have a human and a mouse protein that are 100% identical across the full length of the protein. Is that the same protein or two different proteins? On one hand they are molecularly identical. On the other hand, one is a human protein and one is a mouse protein. As far as I know, whereas it is reasonably well defined what a protein is, there is no commonly agreed upon definition of when two proteins are to be considered "the same".
Lars Juhl Jensen is right and there are more complicated questions coming from biology: what if we made a construct and put a human gene into yeast? Is it the same protein as in human body? If we look at the same protein by the sequence in different human tissues are these the same proteins (we know that due to chaperones the very same sequence can be folded differently in different tissues and can have a different function)? What if the sequences and functions are the same but proteins are coming from different genes? What if we have a gene that has protein sequence product that is always mutated in different ways in the process of maturing, are all of these mutated proteins the same? What if the ability to have all of these different mutations is very important for the key function of the protein? What if protein modification like phosphorylation at certain position changes its function, is it the same protein?
Here we fall in the merger of two very distinct approaches: biology is descriptive science and programming is based on mathematics that is basic (fundamental) knowledge. Data in databases and their schemas are about programming and have to be well defined, while biological information that is stored there is descriptive and thus much less structured and defined by its nature (and our understanding how to structure data and what data can be changes very quickly in time). This is why we have so complicated data structures in bioinformatics. I prefer to think about it this way: "life is so diverse and unpredictable that if you can imagine something then most likely it can be found somewhere plus lots of things beyond my imagination".
Protein is a biological term and the way we try to use it in our data analysis is just an oversimplified projection of real life and it is usually different from database to database, from tool to tool and from year to year. When experimental biologist studies a given protein, he/she considers it to be a substance made of amino acids and having a particular observable function. So when the protein is extracted and given a name all we usually know is its weight in kilodaltons, its function and maybe some extra properties like its ability to bind certain molecules or how quickly its' concentration changes in response to a given stress, etc. Only later we start to find protein sequence, the gene it is coming from, it's processing, its 3D structure and so on. When the protein first discovered it might be unknown that it is multisubunit and made of several different gene products, moreover some genes can be from the nucleus and other from mitochondria or chloroplasts. Then scientists will figure this out, but papers were already published and protein there was called say ABC. What should we do with the updated information? We start to call subunits ABCI, ABCII and some databases will have old data about protein ABC in it, while other will call it multisubunit protein complex ABC and to make a distinction between the two some third database might call it protein complex UAT and so on. We try to solve all this mess with names and definition by making up ontologies, but it is still a mess because biology is a descriptive science.
I was trained as a physicist first and bioinformatics specialist second, so I think about such "biological definition problems" this way: "The most important question is how we can model processes in a way that we can predict outcomes?" If for a given set of data protein is defined one way and in another set of data in another way than it is merely a problem of normalizing the data before using it in my model.
This is a fantastic explanation. Your last sentence is what it boils down to. I have to figure out what I want and normalize the data per application.
I want to play around with predicting residue-residue contacts from sequence alone within a single protein chain but what it looks like I have to do is stick with monomers in the PDB and only grab chain A (multiple chains will just be asymmetric unit stuff) and also find a way to find the representative one, perhaps the one best matching to the canonical sequence in Swiss-Prot, Doing this will reduce the impact of all that mutagenesis and ligands binding stuff in the SEQADV section that I barely understand. Not to mention isoforms, ugh.
Is it at all possible that folks coming at structure prediction from the comp sci/math direction are being foiled by incorrect data normalization and usage? I see that double counting is one obvious possibility here since proteins are represented more than once in the PDB and there is redundancy elsewhere such as TrEMBL. I know that sometimes folks with a deep learning algorithms in hand are often more concerned about the deep learning than the actual biological data, that is a bias in one direction or the other when it comes to interdisciplinary work.
Bioinformatics is an interdisciplinary field. I have seen people from very different backgrounds doing, including trained linguists and economists. When there people from different fields working together on the same project, then interesting things can be found. I would say a team of statistician, physicist, computational biologist, experimental biologist and a good research manager (can be one of them) will outcompete 4 computational biologists easily =)
Also with "stick with monomers in the PDB and only grab chain A (multiple chains will just be asymmetric unit stuff) " you are kind of wrong from a practical point. Even with exactly the same sequence, every copy in the asymmetric unit can be different, and this is why in some PDB structures we get several copies of the same protein or protein complex. Try to superimpose these "exact copies" and see the difference for yourself. Usually, it is not dramatic, but free loops and free tails sometimes are very very different between such "copies". Fortunately, these are rarely important for the function and thus for your study.