I am trying to parse the descriptions of SRA files in order to compile them into a table to export to a TXT file. I'm using Biopython Entrez module for this. Here is my code:
sraList = []
handle = Entrez.esearch(db="sra", term=searchTerm, retmax = '100000')
result = Entrez.read(handle)
for each in result['IdList']:
sraList.append(each)
for each in sraList:
test = Entrez.esummary(db="sra", id=each)
record = Entrez.read(test)
for entry in record:
with open(outFile, 'a') as f:
f.write()
The issue arrises when I run the record = Entrez.read(test)
. The record is a dictionary, but the entry with the experimental metadata I need is in an XML format:
for each in record:
print(each.keys())
dict_keys(['Item', 'Id', 'ExpXml', 'Runs', 'ExtLinks', 'CreateDate', 'UpdateDate'])
for each in record:
print(each["ExpXml"])
<Summary><Title>Zymomonas mobilis subsp. mobilis ZM4 ATCC 31821 - transcriptome</Title><Platform instrument_model="Illumina HiSeq 2500">ILLUMINA</Platform><Statistics total_runs="1" total_spots="26824861" total_bases="5364972200" total_size="2357069488" load_done="true" cluster_name="public"/></Summary><Submitter acc="SRA623091" center_name="JGI" contact_name="JGI SRA" lab_name=""/><Experiment acc="SRX3316534" ver="1" status="public" name="Zymomonas mobilis subsp. mobilis ZM4 ATCC 31821 - transcriptome"/><Study acc="SRP121239" name="Zymomonas mobilis mobilis ZM4 transcriptome - GS-26"/><Organism taxid="264203" ScientificName="Zymomonas mobilis subsp. mobilis ZM4 = ATCC 31821"/><Sample acc="SRS2622106" name=""/><Instrument ILLUMINA="Illumina HiSeq 2500"/><Library_descriptor><LIBRARY_NAME>ANSWP</LIBRARY_NAME><LIBRARY_STRATEGY>RNA-Seq</LIBRARY_STRATEGY><LIBRARY_SOURCE>TRANSCRIPTOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>other</LIBRARY_SELECTION><LIBRARY_LAYOUT> <PAIRED/> </LIBRARY_LAYOUT><LIBRARY_CONSTRUCTION_PROTOCOL>Low Input (RNA)</LIBRARY_CONSTRUCTION_PROTOCOL></Library_descriptor><Bioproject>PRJNA409960</Bioproject><Biosample>SAMN07686944</Biosample>
I have tried to parse this with xmltodict, but I get an error:
for entry in record:
summary = entry["ExpXml"]
parsed = xmltodict.parse(summary, xml_attribs=False)
print(parsed)
ExpatError: junk after document element: line 1, column 304
I don't have much experience with XML files, but from what I can tell this suggests there is a problem with the XML formatting from NCBI. If that's the case, I don't have the experience to know how to fix it.
Does anyone have any suggestions on how to solve this problem?