Question: how to relate protein id(protein sequence) to genbank file seuqnce
3.0 years ago by
kws1540 wrote:

Hi everyone,

I am completely new to bioinformatics and I'm working on a project about tomato. So I have used some package to identify the orthologs of S.pennellii to transcription factors of S.lycopersicum. I did that by aligning the S.lycopersicum's transcription factor protein sequences against all the protein sequences (fasta file on ncbi) of S.pennellii.

Now I basically have something like this

Solyc07g053610.2.1 100%,Sopen07g027560.100%

What I want to do about these protein ids is that I want to relate them to genbank file (nucleotide sequences), does anyone have any idea how can I do this? These protein id may not be compatible with the genbank files as they having different naming system? Thank you very much

3.0 years ago by
planet earth
piet1.6k wrote:

It seems that these protein identifiers have only been used internally by ITAG (international tomato annotation group) but never submitted to Genbank.

There is currently only one full genome of tomato in Genbank. It has seen some upgrades in recent years, but with every upgrade the chromosomal coordinates are shifted.The latest assembly from ITAG is available as a NCBI refsequence. This refsequence has been automatically reannotated by NCBI, but the original ITAG annotation can be downloaded from


The GFF file can be grepped for the position of protein Solyc07g053610 in the chromosomal DNA sequence:

awk '$3~/gene/ && $9~/Solyc07g053610/' ITAG2.4_gene_models.gff3 | sed 's/SL2.50ch07/NC_015444.2/'

NC_015444.2     ITAG_eugene     gene    62033451        62049779        .       +       .       ID=gene:Solyc07g053610.2;Name=Solyc07g053610.2;Alias=Solyc07g053610;from_BOGAS=1;length=16329

Table on mapping between chromosome numbers and NCBI refsequence accessions here

