Question

How to retrieve the original protein code before it has been published on NCBI as XP_?

0

Entering edit mode

4.8 years ago

Raito92 ▴ 90

Hello everyone, I'm new on Biostars and to Bioinformatics in general.

I'm having some trouble about NCBI accession numbers, which I need to perform some analysis simulation. I'm using a previously published gene annotation dataset (Olea europaea var.sylvestris), published by Unver et al 2017. Such genome and annotation data have been published on NCBI and can be found by XP accession numbers.

My question here is quite basic... I've downloaded the specific dataset I need from their website (more specifically, the one where each gene is annotated with GO terms) but...as you can see, genes are named with a code like Oeu064910.1 (Oeu*. in general), while the dataset I downloaded from NCBI has the official XP codes...

How to find the matches between the two nomenclature systems, the one internal to the project and the official NCBI one? How to find what Oeu code a XP one used to be?

On the left, I have reported the GO annotation file with Oe6* codes, on the right the official annotation NCBI file containing XP, but also LOC codes.

enter image description here

I checked all the files on the website out but this information doesn't seem to be anywhere...

Thanks in advance for your help!

genbank accession number annotation transcripts • 950 views

ADD COMMENT • link 4.8 years ago by Raito92 ▴ 90

1

Entering edit mode

is there any specific reason you want to use the NCBI one? why not use the version they offer for download at their olivegenome.org website (or ORCAE or pythozome, ... ) ?

ADD REPLY • link 4.8 years ago by lieven.sterck 15k

0

Entering edit mode

Actually, no. I was thinking of running my analysis again using the provided databases, but I got curious about this... if there is a place where the matching data are kept and it's a kind of information easy to retrieve.

ADD REPLY • link 4.8 years ago by Raito92 ▴ 90

0

Entering edit mode

I had a quick look at the NCBI ones and it turns out that their IDs are derived from a new annotation round they performed with their inhouse pipeline. It is thus very likely that there will not exists a nice 1-to-1 conversion table for the IDs as the genes will differ.

One option you have is to construct such a table yourself, get all the CDSs from NCBI and the ones from the olivegenome.org website version, blastn them to each other and extract the geneID conversion table. This is an acceptable approach for doing so but keep in mind you likely will not get a 100% success rate with it

ADD REPLY • link 4.8 years ago by lieven.sterck 15k