Question

How to deal with the issue when STRING database could not find a protein, or instead matched its similar protein?

0

Entering edit mode

3.2 years ago

Farah ▴ 80

Hello,

I have a list of proteins with their uniprot IDs from an MS experiment (with no quantitative data). From this list, I should explore protein-protein interaction network using STRING database. However, there are some proteins that String could not identify ("Sorry, STRING found no proteins by this name in Homo sapiens"). I tried to find other possible IDs that these proteins may have in other databases such as gene cards, ensembl,... but still String could not find these proteins.

However, instead of finding these proteins, String identified some similar proteins and or paralogs of these proteins.

For example, I have these four proteins:

protein name in my dataset     --->  String output
Q5JXB2 (UBE2NL)                ---> P61088 (UBE2N) 
P0CG22 (DHRS4L1)               ---> Q9BTZ2 (DHRS4)
Q5T1J5 (CHCHD2P9)              ---> Q9Y6H1 (CHCHD2) 
P0C7P4 (UQCRFS1P1)             ---> P47985 (UQCRFS1)

Now, I am highly confused that what I have to do in this situation, whether I should remove these four proteins from my dataset in order NOT to include them in the analysis (as String can not identify them and I have no choice), or keep them and accept String recognition of them as UBE2N, DHRS4, CHCHD2, UQCRFS1, or there are other ways to deal with this condition but I am not aware of.

I was wondering if you could help and guide me what is the best that I can do in this situation. Any advices and suggestions are highly appreciated.

Many thanks.

Best wishes, Farah

proteomics STRING interaction network uniprot • 1.7k views

ADD COMMENT • link updated 3.2 years ago by damian.szk ▴ 80 • written 3.2 years ago by Farah ▴ 80

score 2 · Answer 1 · 2021-03-06

2

Entering edit mode

3.2 years ago

damian.szk ▴ 80

Hi Farah,

STRING is a protein centric resource, and as such carries only protein coding genes.

However all of the genes you have listed according to ENSEMBL (from which STRING's human proteome is derived) seems to be either processed transcript or pseudogenes:

https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000276380;r=X:143884071-143885255;t=ENST00000618570

https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000225766;r=14:24036453-24051028

https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000186940;r=9:79391304-79391759;t=ENST00000461726

https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000226085;r=22:39875289-39876108;t=ENST00000435169

I don't think you should use their paralogs in the network as some of them are not that close of a homologs.

Hope that answers your question.

Best, Damian.

ADD COMMENT • link 3.2 years ago by damian.szk ▴ 80

0

Entering edit mode

Hi Damian, thank you so much for your great explanation and clarification. Now my assumption is that, from my proteomic analysis, I should remove those genes which are either processed transcripts or pseudogenes, and only keep protein coding genes. As STRING also does not contain transcripts or pseudogenes, and therefore could not identify them. I was wondering if you could also let me know about other cases. For example, in my MS list, there are two NEDD4L proteins with two different IDs which I do not know which one I should keep or remove from the dataset, and what "(fragment)" means in K7ENS6.

protein name in my dataset     
K7ENS6 (NEDD4L)  --->   E3 ubiquitin-protein ligase NEDD4-like (Fragment) OS=Homo sapiens OX=9606 GN=NEDD4L PE=1 SV=1
A0A1B0GVY1 (NEDD4L)  --->   E3 ubiquitin-protein ligase NEDD4-like OS=Homo sapiens OX=9606 GN=NEDD4L PE=1 SV=1

Also, for PRSS2, STRING returns PRSS3P2 which seems to be a different protein.

protein name in my dataset     --->  String output
A6XMV9 (PRSS2) --->  PRSS3P2 (Q8NHM4)
ENSG00000275896 (PRSS2) ---> PRSS3P2 (Q8NHM4)

I would highly appreciate your great help. Best, Farah

ADD REPLY • link 3.2 years ago by Farah ▴ 80