To compare two genes from different species from a .gbff
0
1
Entering edit mode
2.2 years ago

I'm trying to compare two genes from different species using a genbank file (gbff), but I'm not sure how to do it. i.e. How do I know if two genes from different gbff files code for the same protein without comparing the actual sequence? For instance, is the protein_id field the way to go?

PS: Is the locus_tag unique for each gene from each species?

Thanks in advance!

EDIT: I have an example as requested. I took the first gene of mycoplasma hyopneumoniae 7422 and 232 (two similar strains) which I'll show below:

7422

 gene            207..1598
                 /locus_tag="MHL_RS00010"
                 /old_locus_tag="MHL_3508"
 CDS             207..1598
                 /locus_tag="MHL_RS00010"
                 /old_locus_tag="MHL_3508"
                 /inference="COORDINATES: similar to AA
                 sequence:RefSeq:WP_011205840.1"
                 /note="Derived by automated computational analysis using
                 gene prediction method: Protein Homology."
                 /codon_start=1
                 /transl_table=4
                 /product="chromosomal replication initiator protein DnaA"
                 /protein_id="WP_020835480.1"
                 /translation="MQTNKNNLKVRTQQIRQQIENLLNDRMLYNNFFSTIYVFNETET
                 EIIIDFTDLIAKQEVISRWVDTVEKAIKNLEISKILTFNNTNNYTINSKESQNFSIKN
                 KYCSFNINNVLNKFTFRNFIKSSYNFQIFSIYDAIVANSRLNYSPIFISGPSGIGKTH
                 FINAIGNLLVEKQKKVFYINDYKFISCVSSWMQNGQNEKISEFLNWLSQVDAFLFDDI
                 QGLANKQQTSIVALEILNRFIEEDKTVIITSDKSPSLLGGFEERFITRFSSGLHIKLN
                 KPKKEDFLRIFKHKLVEEKLEKHIWTNDAFEFLSKHFRNSIRELEGALKSIVFYIQTN
                 KNKFENEIYFDKKKMFEIFVEKYEIEQTITPDLIIEVVSKYYGVSILDIKSEKRGKNI
                 VHARDIAIWLIKNILDLTHNSVGTFFNNRRHSTIISTLKKIDTLKQSNNNELEIALNH
                 IYKQLNWSFKQRK"

232

 gene            1..1392
                 /locus_tag="MHP_RS00005"
                 /old_locus_tag="mhp001"
 CDS             1..1392
                 /locus_tag="MHP_RS00005"
                 /old_locus_tag="mhp001"
                 /inference="COORDINATES: similar to AA
                 sequence:RefSeq:WP_011205840.1"
                 /note="Derived by automated computational analysis using
                 gene prediction method: Protein Homology."
                 /codon_start=1
                 /transl_table=4
                 /product="chromosomal replication initiator protein DnaA"
                 /protein_id="WP_011205840.1"
                 /translation="MQTNKNNLKVRTQQIRQQIENLLNDRMLYNNFFSTIYVLNETET
                 EIIIDFTDLIAKQEVISRWVDTVEKAIKNLEISKILTFNNTNNYTINSKESQNFSIKN
                 KYCSFNINNVLNKFTFRNFIKSSYNFQIFSIYDAIVANSRLNYSPIFISGPSGIGKTH
                 FINAIGNLLVEKQKKVFYINDYKFISCVSSWMQNGQNEKISEFLNWLSQVDAFLFDDI
                 QGLANKQQTSIVALEILNRFIEEDKTVIITSDKSPSLLGGFEERFITRFSSGLHIKLN
                 KPKKEDFLRIFKHKLVEEKLEKHIWTNDAFEFLSKHFRNSIRELEGALKSIVFYIQTN
                 KNKFEDEIYFDKKKMFEIFVEKYEIEQTITPDLIIEVVSKYYGVSILDIKSEKRGKNI
                 VHARDIAIWLIKNILDLTHNSVGTFFNNRRHSTIISTLKKIDTLKQSNNNELEIALNH
                 IYKQLNWSFKQRK"
genome gene gbff ncbi genbank • 708 views
ADD COMMENT
1
Entering edit mode

You will need to compare the sequences if you really care about them being identical. If you only want to know if they encode the same product at a 'phenotypic' level, i.e. they are both the same polymerase subunit for instance, it might be enough to compare the product fields, but these are notoriously random.

I wouldn't use the locus tags under any circumstances.

ADD REPLY
0
Entering edit mode

I am not how sure you would be able to do that unless you compare the sequence. If you are happy just "protein_id" or "Locus_tags" being identical as a criteria for their similarity/identity then it would be different.

As I recall "Locus_tags" used to be unique in early days of GenBank. Not sure if they are so any longer, with many more sequences available.

Can you provide examples of some accessions you are looking at?

ADD REPLY
0
Entering edit mode

I've added examples in the original post as requested, also I've run a blastp comparison between both protein sequences and they are the same.

ADD REPLY
0
Entering edit mode

If blastp comparisons are 100% identical then that would be a reasonable inference. This also assumes that there no additional hits that are equally good.

Both are WP* records. These are special non-redundant protein records.

ADD REPLY

Login before adding your answer.

Traffic: 1770 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6