Question

Same Protein Sequence

0

Entering edit mode

9.3 years ago

Farhad ▴ 10

I'm a beginner in bioinformatics, sorry for asking this question!

When I have two different Uniprot-IDs for two proteins with exactly same sequence, how should I interpret this? are these same protein?

Example: Q68149, Q6QY91

and I know that hepatitis C virus (HCV) has about 11 proteins, but when I search DataBases I find tens of of Uniprot IDs for HCV proteins !!!

I would really appreciate any comment for this question

sequence • 2.2k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Farhad ▴ 10

Ram · Accepted Answer · 2015-01-06

http://www.uniprot.org/help/accession_numbers

For future reference, UniProt, ENSEMBL, NCBI, etc are all very good about documenting these things.

Entries can have more than one accession number. This can be due to two distinct mechanisms:

a) When two or more entries are merged, the accession numbers from all entries are kept. The first accession number is referred to as the 'Primary (citable) accession number', while the others are referred to as 'Secondary accession numbers'. These are listed in alphanumerical order.

b) If an existing entry is split into two or more entries ('demerged'), new 'primary' accession numbers are attributed to all the split entries while all original accession numbers are retained as 'secondary' accession numbers.

Example: P29358 which has been 'demerged' into P68250 and P68251.

There's way more than "tens"

http://www.uniprot.org/uniprot/?query=taxonomy:11103

A bit of virology:

HCV is a (+)ssRNA virus of the family Flaviviridae. RNA viruses are prone to mutation and there's often high degrees of interhost and intrahost genetic variability. HCV has been studied frequently in this arena, a common example is using deep sequencing to look at how viral populations in each person may impact treatment or disease outcome. In a similar sense it has been used as a model for studying quasispecies. Additionally it has been a fairly major public health concern so one may expect for there to be many genome sequences available. So you're going to have a wealth of sequence data, which is not what I've gotten to deal with (till very recently).

In general with all viruses, but especially RNA and ssRNA viruses, you shouldn't consider each entry under the virus (species) as the same thing. For example check out the NCBI Influenza Virus Resource: http://www.ncbi.nlm.nih.gov/genomes/FLU/growth.html.

Back to your case:

You shouldn't assume that each entry will have the same sequence, especially if you can see that there was a different source/isolate provided. The trick is knowing when two accessions refer to different entries or represent a old and new accessions for the same entry. If they have the same sequence you can assume they're probably the same thing, but that isn't always the case, it is possible that it a second entry really did have the same sequence, or that the sequencing approach used wasn't sensitive enough to capture some low frequency variant.

One caveat is that the ends, NTRs, of ssRNA genomes are often highly structured and NGS can have trouble getting through those regions. Some labs will go in and finish them correctly with RACE/Sanger, some labs use the same one from whatever isolate they felt was close enough. Genes typically aren't a problem, but you should be aware of this issue. So if you start seeing that the ends of fairly distinct isolates are the same, you may want to dig deeper. You should also be aware of intergenic and non-3'/5' proximal structures, they may look like regions of low quality. These regions can be very important, if not totally required, to the life cycle and greater infectious process of the virus (in a species specific way).