Parsing SRA summary using Biopython Entrez
Entering edit mode
8 months ago
kmyers2 ▴ 40

I am trying to parse the descriptions of SRA files in order to compile them into a table to export to a TXT file. I'm using Biopython Entrez module for this. Here is my code:

sraList = []
handle = Entrez.esearch(db="sra", term=searchTerm, retmax = '100000')
result =
for each in result['IdList']:
for each in sraList:
    test = Entrez.esummary(db="sra", id=each)
    record =
    for entry in record:
        with open(outFile, 'a') as f:

The issue arrises when I run the record = The record is a dictionary, but the entry with the experimental metadata I need is in an XML format:

for each in record:
dict_keys(['Item', 'Id', 'ExpXml', 'Runs', 'ExtLinks', 'CreateDate', 'UpdateDate'])

for each in record:

<Summary><Title>Zymomonas mobilis subsp. mobilis ZM4 ATCC 31821 - transcriptome</Title><Platform instrument_model="Illumina HiSeq 2500">ILLUMINA</Platform><Statistics total_runs="1" total_spots="26824861" total_bases="5364972200" total_size="2357069488" load_done="true" cluster_name="public"/></Summary><Submitter acc="SRA623091" center_name="JGI" contact_name="JGI SRA" lab_name=""/><Experiment acc="SRX3316534" ver="1" status="public" name="Zymomonas mobilis subsp. mobilis ZM4 ATCC 31821 - transcriptome"/><Study acc="SRP121239" name="Zymomonas mobilis mobilis ZM4 transcriptome - GS-26"/><Organism taxid="264203" ScientificName="Zymomonas mobilis subsp. mobilis ZM4 = ATCC 31821"/><Sample acc="SRS2622106" name=""/><Instrument ILLUMINA="Illumina HiSeq 2500"/><Library_descriptor><LIBRARY_NAME>ANSWP</LIBRARY_NAME><LIBRARY_STRATEGY>RNA-Seq</LIBRARY_STRATEGY><LIBRARY_SOURCE>TRANSCRIPTOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>other</LIBRARY_SELECTION><LIBRARY_LAYOUT> <PAIRED/> </LIBRARY_LAYOUT><LIBRARY_CONSTRUCTION_PROTOCOL>Low Input (RNA)</LIBRARY_CONSTRUCTION_PROTOCOL></Library_descriptor><Bioproject>PRJNA409960</Bioproject><Biosample>SAMN07686944</Biosample>

I have tried to parse this with xmltodict, but I get an error:

for entry in record:
    summary = entry["ExpXml"]
    parsed = xmltodict.parse(summary, xml_attribs=False)

ExpatError: junk after document element: line 1, column 304

I don't have much experience with XML files, but from what I can tell this suggests there is a problem with the XML formatting from NCBI. If that's the case, I don't have the experience to know how to fix it.

Does anyone have any suggestions on how to solve this problem?

python biopython entrez sra xml • 500 views
Entering edit mode
8 months ago

I have seen many problems with XML from NCBI, they don't work that well with tools that require well formed XML.

to parse this kind of output I would recommend using the command line version of tools chained up like so:

esearch -db genome -query "22954[uid]" | \
elink -target bioproject | \
efetch -format xml | \
xtract -pattern DocumentSummary -element Salinity OxygenReq OptimumTemperature TemperatureRange Habitat

will print:

eMesophilic     eAerobic        85      eHyperthermophilic      eAquatic

Example taken from:

The entrez direct manual has many more examples:


Login before adding your answer.

Traffic: 2411 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6