Question: To compare two genes from different species from a .gbff
1
gravatar for marcos.sep.ro
19 months ago by
marcos.sep.ro10 wrote:

I'm trying to compare two genes from different species using a genbank file (gbff), but I'm not sure how to do it. i.e. How do I know if two genes from different gbff files code for the same protein without comparing the actual sequence? For instance, is the protein_id field the way to go?

PS: Is the locus_tag unique for each gene from each species?

Thanks in advance!

EDIT: I have an example as requested. I took the first gene of mycoplasma hyopneumoniae 7422 and 232 (two similar strains) which I'll show below:

7422

 gene            207..1598
                 /locus_tag="MHL_RS00010"
                 /old_locus_tag="MHL_3508"
 CDS             207..1598
                 /locus_tag="MHL_RS00010"
                 /old_locus_tag="MHL_3508"
                 /inference="COORDINATES: similar to AA
                 sequence:RefSeq:WP_011205840.1"
                 /note="Derived by automated computational analysis using
                 gene prediction method: Protein Homology."
                 /codon_start=1
                 /transl_table=4
                 /product="chromosomal replication initiator protein DnaA"
                 /protein_id="WP_020835480.1"
                 /translation="MQTNKNNLKVRTQQIRQQIENLLNDRMLYNNFFSTIYVFNETET
                 EIIIDFTDLIAKQEVISRWVDTVEKAIKNLEISKILTFNNTNNYTINSKESQNFSIKN
                 KYCSFNINNVLNKFTFRNFIKSSYNFQIFSIYDAIVANSRLNYSPIFISGPSGIGKTH
                 FINAIGNLLVEKQKKVFYINDYKFISCVSSWMQNGQNEKISEFLNWLSQVDAFLFDDI
                 QGLANKQQTSIVALEILNRFIEEDKTVIITSDKSPSLLGGFEERFITRFSSGLHIKLN
                 KPKKEDFLRIFKHKLVEEKLEKHIWTNDAFEFLSKHFRNSIRELEGALKSIVFYIQTN
                 KNKFENEIYFDKKKMFEIFVEKYEIEQTITPDLIIEVVSKYYGVSILDIKSEKRGKNI
                 VHARDIAIWLIKNILDLTHNSVGTFFNNRRHSTIISTLKKIDTLKQSNNNELEIALNH
                 IYKQLNWSFKQRK"

232

 gene            1..1392
                 /locus_tag="MHP_RS00005"
                 /old_locus_tag="mhp001"
 CDS             1..1392
                 /locus_tag="MHP_RS00005"
                 /old_locus_tag="mhp001"
                 /inference="COORDINATES: similar to AA
                 sequence:RefSeq:WP_011205840.1"
                 /note="Derived by automated computational analysis using
                 gene prediction method: Protein Homology."
                 /codon_start=1
                 /transl_table=4
                 /product="chromosomal replication initiator protein DnaA"
                 /protein_id="WP_011205840.1"
                 /translation="MQTNKNNLKVRTQQIRQQIENLLNDRMLYNNFFSTIYVLNETET
                 EIIIDFTDLIAKQEVISRWVDTVEKAIKNLEISKILTFNNTNNYTINSKESQNFSIKN
                 KYCSFNINNVLNKFTFRNFIKSSYNFQIFSIYDAIVANSRLNYSPIFISGPSGIGKTH
                 FINAIGNLLVEKQKKVFYINDYKFISCVSSWMQNGQNEKISEFLNWLSQVDAFLFDDI
                 QGLANKQQTSIVALEILNRFIEEDKTVIITSDKSPSLLGGFEERFITRFSSGLHIKLN
                 KPKKEDFLRIFKHKLVEEKLEKHIWTNDAFEFLSKHFRNSIRELEGALKSIVFYIQTN
                 KNKFEDEIYFDKKKMFEIFVEKYEIEQTITPDLIIEVVSKYYGVSILDIKSEKRGKNI
                 VHARDIAIWLIKNILDLTHNSVGTFFNNRRHSTIISTLKKIDTLKQSNNNELEIALNH
                 IYKQLNWSFKQRK"
ncbi genbank gbff gene genome • 481 views
ADD COMMENTlink modified 19 months ago • written 19 months ago by marcos.sep.ro10
1

You will need to compare the sequences if you really care about them being identical. If you only want to know if they encode the same product at a 'phenotypic' level, i.e. they are both the same polymerase subunit for instance, it might be enough to compare the product fields, but these are notoriously random.

I wouldn't use the locus tags under any circumstances.

ADD REPLYlink written 19 months ago by Joe18k

I am not how sure you would be able to do that unless you compare the sequence. If you are happy just "protein_id" or "Locus_tags" being identical as a criteria for their similarity/identity then it would be different.

As I recall "Locus_tags" used to be unique in early days of GenBank. Not sure if they are so any longer, with many more sequences available.

Can you provide examples of some accessions you are looking at?

ADD REPLYlink modified 19 months ago • written 19 months ago by GenoMax92k

I've added examples in the original post as requested, also I've run a blastp comparison between both protein sequences and they are the same.

ADD REPLYlink written 19 months ago by marcos.sep.ro10

If blastp comparisons are 100% identical then that would be a reasonable inference. This also assumes that there no additional hits that are equally good.

Both are WP* records. These are special non-redundant protein records.

ADD REPLYlink modified 19 months ago • written 19 months ago by GenoMax92k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1682 users visited in the last hour