Question: Same Protein Sequence
gravatar for Farhad
5.7 years ago by
Farhad10 wrote:

I'm a beginner in bioinformatics, sorry for asking this question!

When I  have two different Uniprot-IDs for two proteins with exactly same sequence, how should I interpret this? are these same protein?

Example : Q68149 , Q6QY91

and I know that hepatitis C virus (HCV) has about 11 proteins, but when I search DataBases I find tens of  of Uniprot IDs for HCV proteins !!!

I would really appreciate any comment for this question



sequence • 1.4k views
ADD COMMENTlink modified 11 months ago by Biostar ♦♦ 20 • written 5.7 years ago by Farhad10
gravatar for pld
5.7 years ago by
United States
pld4.8k wrote:

For future reference, UniProt, ENSEMBL, NCBI, etc are all very good about documenting these things.

Entries can have more than one accession number. This can be due to two distinct mechanisms:
  • a) When two or more entries are merged, the accession numbers from all entries are kept. The first accession number is referred to as the ‘Primary (citable) accession number’, while the others are referred to as ‘Secondary accession numbers’. These are listed in alphanumerical order.
  • b) If an existing entry is split into two or more entries (‘demerged’), new ‘primary’ accession numbers are attributed to all the split entries while all original accession numbers are retained as ‘secondary’ accession numbers.
Example: P29358 which has been ‘demerged’ into P68250 and P68251.


There's way more than "tens"

A bit of virology:

HCV is a (+)ssRNA virus of the family Flaviviridae. RNA viruses are prone to mutation and there's often high degrees of interhost and intrahost genetic variability. HCV has been studied frequently in this arena, a common example is using deep sequencing to look at how viral populations in each person may impact treatment or disease outcome. In a similar sense it has been used as a model for studying quasispecies. Additionally it has been a fairly major public health concern so one may expect for there to be many genome sequences available. So you're going to have a wealth of sequence data, which is not what I've gotten to deal with (till very recently).

In general with all viruses, but especially RNA and ssRNA viruses, you shouldn't consider each entry under the virus (species) as the same thing. For example check out the NCBI Influenza Virus Resource:

Back to your case:

You shouldn't assume that each entry will have the same sequence, especially if you can see that there was a different source/isolate provided. The trick is knowing when two accessions refer to different entries or represent a old and new accessions for the same entry. If they have the same sequence you can assume they're probably the same thing, but that isn't always the case, it is possible that it a second entry really did have the same sequence, or that the sequencing approach used wasn't sensitive enough to capture some low frequency variant.

One caveat is that the ends, NTRs, of ssRNA genomes are often highly structured and NGS can have trouble getting through those regions. Some labs will go in and finish them correctly with RACE/Sanger, some labs use the same one from whatever isolate they felt was close enough. Genes typically aren't a problem, but you should be aware of this issue. So if you start seeing that the ends of fairly distinct isolates are the same, you may want to dig deeper. You should also be aware of intergenic and non-3'/5' proximal structures, they may look like regions of low quality. These regions can be very important, if not totally required, to the life cycle and greater infectious process of the virus (in a species specific way).


ADD COMMENTlink written 5.7 years ago by pld4.8k

Thanks for detailed reply,

If my goal will be creating a data set from known and verified interactions between human and HCV proteins, I Use Data Bases Like, how should I remove duplicate interactions. 

Thanks for your help.


ADD REPLYlink written 5.7 years ago by Farhad10

When entries have duplicate accessions UniProt still designates a "primary (citable) accession", while other accessions for that entry are listed as secondary, you could add a step in your processing where you convert the accessions from the database to the primary one UniProt gives.

In that case I would use the accession that the database uses, since you're looking at the protein level you could just call everything by the same ORF symbol/name.

You could get a bit more complex and have an interaction "instance" wherein you collect each piece of evidence (e.g. an entry in phisto) that says virus x interacts with human y. This would allow you easily make lists of interactions but would allow for more in depth filtering if needed.

Lastly, it sort of depends on what a duplicate really is. Did lab a use an old accession and lab b use a new accession, or did one lab enter the same interaction for every accession available. From a quick look it seems some interactions may have been confirmed in multiple ways by the same lab. If there's multiple entries for a single interaction, you should track the method and group that did it. Even if they used the same approach it is good to know that someone else was able to replicate the result.

You may also want to be careful about what isolate each interaction is coming from. I'm not an HCV expert so I don't really know how different the isolates are but if an interaction is only reported for one isolate, is it because that group was the only one who looked or because that interaction is specific to that isolate?

ADD REPLYlink written 5.7 years ago by pld4.8k

Thanks again for your comment,

about my mentioned examples  and  both are listed as primary which have same sequence.

In this case it seems that I can consider only unique sequences!!! to be sure about duplicate free data set.

ADD REPLYlink written 5.7 years ago by Farhad10

Look at the references for each entry, they come from two different papers. It may be that the core region was really the same or that because these are protein sequences, synonymous mutations in the coding region may be masked.

ADD REPLYlink written 5.7 years ago by pld4.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 790 users visited in the last hour