Structure of the Entrez Gene xml: Entrezgene_locus
1
0
Entering edit mode
9.6 years ago
bioruffo • 0

Edit - yes, I had found the structure of the Gene Record, but unfortunately this doesn't answer my question.

Hello all,

I'm trying to understand the structure of the Entrez Gene XML. In the simplest scenario, I want to retrieve a list of mRNA GIs associated with a gene's reference sequences. Apparently I should rely on the "Entrezgene_locus" key for this, but I'm afraid I didn't grasp the structure of "Entrezgene_locus", and after trying to look around the net for explanations, I ended up here.

Say, for example: if I retrieve the record of human TP53:

from Bio import Entrez
Entrez.email="name@provider.com"
handle = Entrez.efetch(db="gene", id='7157', retmode="xml")
records = Entrez.read(handle)

I see that the first GI, 371502114 that corresponds to Isoform A, can be found in a series of entries:

records[0]['Entrezgene_locus'][0]['Gene-commentary_products'][2]['Gene-commentary_seqs'][0]['Seq-loc_whole']['Seq-id']['Seq-id_gi']
records[0]['Entrezgene_locus'][1]['Gene-commentary_products'][3]['Gene-commentary_seqs'][0]['Seq-loc_whole']['Seq-id']['Seq-id_gi']
records[0]['Entrezgene_locus'][2]['Gene-commentary_products'][3]['Gene-commentary_seqs'][0]['Seq-loc_whole']['Seq-id']['Seq-id_gi']
records[0]['Entrezgene_locus'][3]['Gene-commentary_products'][2]['Gene-commentary_seqs'][0]['Seq-loc_whole']['Seq-id']['Seq-id_gi']
records[0]['Entrezgene_comments'][5]['Gene-commentary_comment'][0]['Gene-commentary_products'][0]['Gene-commentary_source'][0]['Other-source_src']['Dbtag']['Dbtag_tag']['Object-id']['Object-id_id']
records[0]['Entrezgene_comments'][5]['Gene-commentary_comment'][0]['Gene-commentary_products'][0]['Gene-commentary_seqs'][0]['Seq-loc_whole']['Seq-id']['Seq-id_gi']

Of these, the first four are different ListElement objects all under the Entrezgene_locus key. The structure is identical, but three of them resolve into a 15-element list under Gene-commentary_products, while one has only 8 elements, missing a few of the isoforms.

So, the question is, why four elements in this list? I mean, what do these four different elements refer to, biologically? And if I want to download all refseqs for a gene, should I make a set of all GIs listed in all the four elements of the list, or should I assume that element 0 is the most complete (but where is this written?), or should I rely on Entrezgene_comments which seems more linear for this specific gene (but is it always so)?

Thank you very much! Sorry for the long post!

Roberto

Biopython Entrez gene • 1.9k views
ADD COMMENT
0
Entering edit mode
9.6 years ago

The structure of the Gene record should be given in the EntrezGene DTD: http://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_Entrezgene.dtd -> http://www.ncbi.nlm.nih.gov/dtd/NCBI_Entrezgene.mod.dtd

(...)
<!ELEMENT Gene-commentary (
        Gene-commentary_type, 
        Gene-commentary_heading?, 
        Gene-commentary_label?, 
        Gene-commentary_text?, 
        Gene-commentary_accession?, 
        Gene-commentary_version?, 
        Gene-commentary_xtra-properties?, 
        Gene-commentary_refs?, 
        Gene-commentary_source?, 
        Gene-commentary_genomic-coords?, 
        Gene-commentary_seqs?, 
        Gene-commentary_products?, 
        Gene-commentary_properties?, 
        Gene-commentary_comment?, 
        Gene-commentary_create-date?, 
        Gene-commentary_update-date?)>
<!-- type of Gene Commentary -->
<!ELEMENT Gene-commentary_type (%INTEGER;)>

<!--
    property    -  used to display tag/value pair
         for this type label is used as property tag, text is used as property value, 
         other fields are not used.
    reference   -  currently not used             
    generif -  to include generif in the main blob             
    phenotype   -  to display phenotype information
    complex -  used (but not limited) to identify resulting 
         interaction complexes
    compound    -  pubchem entities
    gene-group  -  for relationship sets (such as pseudogene / parent gene)
    assembly    -  for full assembly accession
    assembly-unit   -  for the assembly unit corresponding to the refseq
-->

(...)
ADD COMMENT
0
Entering edit mode

Thank you... I had found this, but it doesn't really explain why there are multiple, only partially overlapping entries in the 'Entrezgene_locus' list. Or where does it?

ADD REPLY

Login before adding your answer.

Traffic: 2660 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6