Edit - yes, I had found the structure of the Gene Record, but unfortunately this doesn't answer my question.
Hello all,
I'm trying to understand the structure of the Entrez Gene XML. In the simplest scenario, I want to retrieve a list of mRNA GIs associated with a gene's reference sequences. Apparently I should rely on the "Entrezgene_locus" key for this, but I'm afraid I didn't grasp the structure of "Entrezgene_locus", and after trying to look around the net for explanations, I ended up here.
Say, for example: if I retrieve the record of human TP53:
from Bio import Entrez
Entrez.email="name@provider.com"
handle = Entrez.efetch(db="gene", id='7157', retmode="xml")
records = Entrez.read(handle)
I see that the first GI, 371502114 that corresponds to Isoform A, can be found in a series of entries:
records[0]['Entrezgene_locus'][0]['Gene-commentary_products'][2]['Gene-commentary_seqs'][0]['Seq-loc_whole']['Seq-id']['Seq-id_gi']
records[0]['Entrezgene_locus'][1]['Gene-commentary_products'][3]['Gene-commentary_seqs'][0]['Seq-loc_whole']['Seq-id']['Seq-id_gi']
records[0]['Entrezgene_locus'][2]['Gene-commentary_products'][3]['Gene-commentary_seqs'][0]['Seq-loc_whole']['Seq-id']['Seq-id_gi']
records[0]['Entrezgene_locus'][3]['Gene-commentary_products'][2]['Gene-commentary_seqs'][0]['Seq-loc_whole']['Seq-id']['Seq-id_gi']
records[0]['Entrezgene_comments'][5]['Gene-commentary_comment'][0]['Gene-commentary_products'][0]['Gene-commentary_source'][0]['Other-source_src']['Dbtag']['Dbtag_tag']['Object-id']['Object-id_id']
records[0]['Entrezgene_comments'][5]['Gene-commentary_comment'][0]['Gene-commentary_products'][0]['Gene-commentary_seqs'][0]['Seq-loc_whole']['Seq-id']['Seq-id_gi']
Of these, the first four are different ListElement
objects all under the Entrezgene_locus
key. The structure is identical, but three of them resolve into a 15-element list under Gene-commentary_products
, while one has only 8 elements, missing a few of the isoforms.
So, the question is, why four elements in this list? I mean, what do these four different elements refer to, biologically? And if I want to download all refseqs for a gene, should I make a set of all GIs listed in all the four elements of the list, or should I assume that element 0 is the most complete (but where is this written?), or should I rely on Entrezgene_comments
which seems more linear for this specific gene (but is it always so)?
Thank you very much! Sorry for the long post!
Roberto
Thank you... I had found this, but it doesn't really explain why there are multiple, only partially overlapping entries in the 'Entrezgene_locus' list. Or where does it?