Question: (Closed) How To Parse The Xml From Entrez Db=Protein Using Biopython
0
gravatar for heath
6.9 years ago by
heath20
United States
heath20 wrote:

I want to obtain a proteins gi no# along with its cd root no# and taxid from a proteins NP_xxx. 1 (id#) when I use the Biopython's entrez as following

 handle= Entrez.efetch(db='protein', id='NP_000368.1', retmode='xml')
 record= Entrez.read(handle)

the record is a list-elements object.and looks very complicated and I somehow can not really locate the "Tag" or "Key"??.How i parse the output to obtain the information I interested in?

BTW where I can see the overview for which DTD information has been retrieved from the ncbi?

xml protein biopython entrez • 4.0k views
ADD COMMENTlink written 6.9 years ago by heath20

what do you mean with the "tag" or the "key" ?

ADD REPLYlink written 6.9 years ago by Pierre Lindenbaum125k

if the rettype change to "gb"

handle= Entrez.efetch(db='protein', id='NP_000368.1', rettype='gb')
In [26]: ret=SeqIO.read(handle,'genbank')
In [27]: feats=set()
In [28]: for feat in ret.features:
   ....:     feats.add(feat.type)
   ....:    
In [29]: feats
Out[29]: set(['source', 'Protein', 'CDS', 'Region', 'Site'])

Keys. Tag might not be the accurate words but i meant the identities label/Tag name, I need to extract the Gi and CD information

ADD REPLYlink modified 6.9 years ago by Istvan Albert ♦♦ 82k • written 6.9 years ago by heath20
1

why do you need to parse the XML manually? When above you already have BioPython reading the file

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by Istvan Albert ♦♦ 82k

eh...i am not sure if.Biopython is able to pass the xml from db='protein'. biopython should be able to parse most of the xml from entrez but not each one as long as I know... i can somehow put the above information into SeqIO.read(), and convert it into the Bio.SeqRecord but i can only have following informations.

 In [23]: from Bio.Seq import Seq

 In [24]: from Bio.SeqRecord import SeqRecord

 In [25]: from Bio.Alphabet import IUPAC

 In [26]: record = SeqRecord (rec2)

In [27]: print record
ID: <unknown id>
 Name: <unknown name>
 Description: <unknown description>

Number of features: 0 SeqRecord(seq=Seq('MSGGPMGGRPGGRGAPAVQQNIPSTLLQDHENQRLFEMLGRKCLTLATAVVQLY...WDD', IUPACProtein( )), id='NP000368.1', name='NP000368', description='wiskott-Aldrich syndrome protein [Homo sap iens].', dbxrefs=[])

but I need the protein gi along the protein_cdd information

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by heath20
1

I am confused. If you want a SeqRecord from a Biopython parser, just ask Entrez for a GenBank format file (as you have shown). Currently the SeqIO parsing framework doesn't handle the equivalent XML file - it could be done but appears to add relative little benefit in terms of new functionality.

ADD REPLYlink written 6.9 years ago by Peter5.8k

Thanks a lot for replying and sorry for confusing.. It does not really matter the retmode/rettype is 'xml' or'gb' ..I think at the moment I should simplify the question into "how i use Refseq id of a protein to obtain its Gi and CDD information."

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by heath20

I think that makes more sense and please post it as a new question

ADD REPLYlink written 6.9 years ago by Istvan Albert ♦♦ 82k
Please log in to add an answer.
The thread is closed. No new answers may be added.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1164 users visited in the last hour