I am trying to parse the descriptions of SRA files in order to compile them into a table to export to a TXT file. I'm using Biopython Entrez module for this. Here is my code:
sraList =  handle = Entrez.esearch(db="sra", term=searchTerm, retmax = '100000') result = Entrez.read(handle) for each in result['IdList']: sraList.append(each) for each in sraList: test = Entrez.esummary(db="sra", id=each) record = Entrez.read(test) for entry in record: with open(outFile, 'a') as f: f.write()
The issue arrises when I run the
record = Entrez.read(test). The record is a dictionary, but the entry with the experimental metadata I need is in an XML format:
for each in record: print(each.keys()) dict_keys(['Item', 'Id', 'ExpXml', 'Runs', 'ExtLinks', 'CreateDate', 'UpdateDate']) for each in record: print(each["ExpXml"]) <Summary><Title>Zymomonas mobilis subsp. mobilis ZM4 ATCC 31821 - transcriptome</Title><Platform instrument_model="Illumina HiSeq 2500">ILLUMINA</Platform><Statistics total_runs="1" total_spots="26824861" total_bases="5364972200" total_size="2357069488" load_done="true" cluster_name="public"/></Summary><Submitter acc="SRA623091" center_name="JGI" contact_name="JGI SRA" lab_name=""/><Experiment acc="SRX3316534" ver="1" status="public" name="Zymomonas mobilis subsp. mobilis ZM4 ATCC 31821 - transcriptome"/><Study acc="SRP121239" name="Zymomonas mobilis mobilis ZM4 transcriptome - GS-26"/><Organism taxid="264203" ScientificName="Zymomonas mobilis subsp. mobilis ZM4 = ATCC 31821"/><Sample acc="SRS2622106" name=""/><Instrument ILLUMINA="Illumina HiSeq 2500"/><Library_descriptor><LIBRARY_NAME>ANSWP</LIBRARY_NAME><LIBRARY_STRATEGY>RNA-Seq</LIBRARY_STRATEGY><LIBRARY_SOURCE>TRANSCRIPTOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>other</LIBRARY_SELECTION><LIBRARY_LAYOUT> <PAIRED/> </LIBRARY_LAYOUT><LIBRARY_CONSTRUCTION_PROTOCOL>Low Input (RNA)</LIBRARY_CONSTRUCTION_PROTOCOL></Library_descriptor><Bioproject>PRJNA409960</Bioproject><Biosample>SAMN07686944</Biosample>
I have tried to parse this with xmltodict, but I get an error:
for entry in record: summary = entry["ExpXml"] parsed = xmltodict.parse(summary, xml_attribs=False) print(parsed) ExpatError: junk after document element: line 1, column 304
I don't have much experience with XML files, but from what I can tell this suggests there is a problem with the XML formatting from NCBI. If that's the case, I don't have the experience to know how to fix it.
Does anyone have any suggestions on how to solve this problem?