Question

XML from entrez not formatted correctly

0

Entering edit mode

5.7 years ago

noodle ▴ 650

Hi all, I'm hoping someone might advise on the XML text output from rentrez following the below workflow. It seems that the output does not open/close all tags properly and I'm not sure how I can clean this up. For example, in the below the open tag "Platform" isn't annotated > and therefore I can't gsub > to make it workable/readable.

#retrieve data from SRR
r_search <- entrez_search(db="sra", term="SRR10025068")
r_search.id <- r_search$ids
all_the_links <- entrez_link(dbfrom='sra', id=r_search.id, db='all')
r_summ <- entrez_summary(db="sra", id=all_the_links$links$sra_bioproject_all)
xml.data.dirty <- r_summ$expxml
xml.data.dirty
[1] "  &lt;Summary&gt;&lt;Title&gt;Mouse 57&lt;/Title&gt;&lt;Platform instrument_model=\"454 GS FLX Titanium\"&gt;LS454&lt;/Platform&gt;&lt;Statistics total_runs=\"1\" total_spots=\"6058\" total_bases=\"2449911\" total_size=\"1638287\" load_done=\"true\" cluster_name=\"public\"/&gt;&lt;/Summary&gt;&lt;Submitter acc=\"SRA115778\" center_name=\"Texas A&amp;amp;M University\" contact_name=\"Sean McCaffrey\" lab_name=\"Gastrointestinal Laboratory\"/&gt;&lt;Experiment acc=\"SRX390677\" ver=\"1\" status=\"public\" name=\"Mouse 57\"/&gt;&lt;Study acc=\"SRP033709\" name=\"Mice gut bacteria Targeted Locus (Loci)\"/&gt;&lt;Organism taxid=\"10090\" ScientificName=\"Mus musculus\"/&gt;&lt;Sample acc=\"SRS514105\" name=\"\"/&gt;&lt;Instrument LS454=\"454 GS FLX Titanium\"/&gt;&lt;Library_descriptor&gt;&lt;LIBRARY_NAME/&gt;&lt;LIBRARY_STRATEGY&gt;AMPLICON&lt;/LIBRARY_STRATEGY&gt;&lt;LIBRARY_SOURCE&gt;GENOMIC&lt;/LIBRARY_SOURCE&gt;&lt;LIBRARY_SELECTION&gt;unspecified&lt;/LIBRARY_SELECTION&gt;&lt;LIBRARY_LAYOUT&gt;                 &lt;SINGLE/&gt;               &lt;/LIBRARY_LAYOUT&gt;&lt;/Library_descriptor&gt;&lt;Bioproject&gt;PRJNA231086&lt;/Bioproject&gt;&lt;Biosample&gt;SAMN02440270&lt;/Biosample&gt;  "

#get usable XML file
xml.data.5knwn <- gsub("&gt;", ">", xml.data.dirty)
xml.data.5knwn <- gsub("&lt;", "<", xml.data.5knwn)
xml.data.5knwn <- gsub("&amp;", "&", xml.data.5knwn)
xml.data.5knwn <- gsub("&apos;", "'", xml.data.5knwn)
xml.data.5knwn <- gsub("&quot;", '"', xml.data.5knwn)
xml.data.5knwn.clean <- gsub(" ", "", xml.data.5knwn)
xml.data.5knwn.clean
[1] "<Summary><Title>Mouse57</Title><Platforminstrument_model=\"454GSFLXTitanium\">LS454</Platform><Statisticstotal_runs=\"1\"total_spots=\"6058\"total_bases=\"2449911\"total_size=\"1638287\"load_done=\"true\"cluster_name=\"public\"/></Summary><Submitteracc=\"SRA115778\"center_name=\"TexasA&amp;MUniversity\"contact_name=\"SeanMcCaffrey\"lab_name=\"GastrointestinalLaboratory\"/><Experimentacc=\"SRX390677\"ver=\"1\"status=\"public\"name=\"Mouse57\"/><Studyacc=\"SRP033709\"name=\"MicegutbacteriaTargetedLocus(Loci)\"/><Organismtaxid=\"10090\"ScientificName=\"Musmusculus\"/><Sampleacc=\"SRS514105\"name=\"\"/><InstrumentLS454=\"454GSFLXTitanium\"/><Library_descriptor><LIBRARY_NAME/><LIBRARY_STRATEGY>AMPLICON</LIBRARY_STRATEGY><LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>unspecified</LIBRARY_SELECTION><LIBRARY_LAYOUT><SINGLE/></LIBRARY_LAYOUT></Library_descriptor><Bioproject>PRJNA231086</Bioproject><Biosample>SAMN02440270</Biosample>"

Edit: typo

rentrez rXML entrez R XML • 2.4k views

ADD COMMENT • link 5.7 years ago by noodle ▴ 650

0

Entering edit mode

I am not sure what exactly you need to parse from this dataset but this looks clean enough.

$ efetch -db sra -id "SRR10025068" -format runinfo
Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
SRR10025068,2019-08-27 18:44:15,2019-08-27 16:31:34,12479737,3768880574,12479737,302,1154,,https://sra-download.ncbi.nlm.nih.gov/traces/sra25/SRR/009790/SRR10025068,SRX6762113,1168003_D6_S,WGS,RANDOM,METAGENOMIC,PAIRED,0,0,ILLUMINA,Illumina NovaSeq 6000,SRP219390,PRJNA561398,,561398,SRS5310604,SAMN12617402,simple,408170,human gut metagenome,30185,,,,,,,no,,,,,YALE SCHOOL OF PUBLIC HEALTH,SRA948009,,public,E799243BAFB62132C20AC9F550F70206,052064FDF091B79E9DA48242EF5F98A2

ADD REPLY • link 5.7 years ago by GenoMax 152k

0

Entering edit mode

Thanks, I tried the entrez e-utils as well, but the data returned between the two functions is similar but different, and unfortunately I'm looking at the different stuff.

ADD REPLY • link 5.7 years ago by noodle ▴ 650

0

Entering edit mode

Wow, so I just realized that the issue isn't the XML, it's also the data returned is incorrect - a much bigger issue.

ADD REPLY • link 5.7 years ago by noodle ▴ 650

0

Entering edit mode

it's also the data returned is incorrect

Could you please explain what data are incorrect?

ADD REPLY • link 5.7 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

I think this bit in the original post does not match information I obtained when using runinfo

<Summary><Title>Mouse 57</Title><Platform instrument_model=\"454 GS FLX Titanium\">LS454</Platform><Statistics total_runs=\"1\" total_spots=\"6058\" total_bases=\"2449911\" total_size=\"1638287\" load_done=\"true\" cluster_name=\"public\"/></Summary><Submitter acc=\"SRA115778\" center_name=\"Texas A&M University\" contact_name=\"Sean McCaffrey\" lab_name=\"Gastrointestinal Laboratory\"/><Experiment acc=\"SRX390677\" ver=\"1\" status=\"public\" name=\"Mouse 57\"/><Study acc=\"SRP033709\" name=\"Mice gut bacteria Targeted Locus (Loci)\"/><Organism taxid=\"10090\" ScientificName=\"Mus musculus\"/><Sample acc=\"SRS514105\" name=\"\"/><Instrument LS454=\"454 GS FLX Titanium\"/><Library_descriptor><LIBRARY_NAME/><LIBRARY_STRATEGY>AMPLICON</LIBRARY_STRATEGY><LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>unspecified</LIBRARY_SELECTION><LIBRARY_LAYOUT> <SINGLE/>
</LIBRARY_LAYOUT></Library_descriptor><Bioproject>PRJNA231086</Bioproject><Biosample>SAMN02440270</Biosample> "

ADD REPLY • link 5.7 years ago by GenoMax 152k

0

Entering edit mode

I think this bit in the original post does not match information I obtained when using runinfo

Is that not expected? Your command is fetching the runinfo table whereas @joe was downloading the Bioproject docsum. The corresponding edirect command for what @joe was doing:

esearch -db sra -query 'SRR10025068' \
  | elink -db sra -target bioproject -name sra_bioproject_all \
  | esummary

The XML of the command shown above is still not in the best XML format but it can be cleaned up by piping the output to xtract -format.

If I understand this correctly, the issue @joe has is related to encoding of html characters in the r_summ$expxml object, not the data itself.

ADD REPLY • link 5.7 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

Ah I see. I only looked at the accession OP was using and looked up the runinfo. That is a NovaSeq 6000 run.

If we look at the bioproject SRR10025068 belongs (as far as I can see from this SRA page) to where is the reference to 454 coming from from the output OP has?

ADD REPLY • link 5.7 years ago by GenoMax 152k

0

Entering edit mode

Good eyes! It was my (and the OP's) mistake. You see we both used -target sra for our target db in the elink. So, the data that was being fetched was for the identifier 561398 from SRA instead of BioProject. I now fixed my command to use -target bioproject to get the correct data out.

ADD REPLY • link 5.7 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

My (original) issue was that the xml output was not correctly formatted, and I later realized the data returned was not correct.

ADD REPLY • link 5.7 years ago by noodle ▴ 650

0

Entering edit mode

5.7 years ago

noodle ▴ 650

Thanks everyone for the responses. In the end I did the below, not exactly what I wanted but I got by...

this.runID <- "SRR10025068"
#
entrez.cmd <- paste0("esearch -db 'sra' -query '", this.runID,"' | esummary  -db 'all' -format runinfo")
entrez.cmd
[1] "esearch -db 'sra' -query 'SRR10025068' | esummary  -db 'all' -format runinfo"
#
entrez.intern <- system(entrez.cmd, intern=TRUE)
entrez.intern
[1] "Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash"
[2] "SRR10025068,2019-08-27 18:44:15,2019-08-27 16:31:34,12479737,3768880574,12479737,302,1154,,https://sra-download.ncbi.nlm.nih.gov/traces/sra25/SRR/009790/SRR10025068,SRX6762113,1168003_D6_S,WGS,RANDOM,METAGENOMIC,PAIRED,0,0,ILLUMINA,Illumina NovaSeq 6000,SRP219390,PRJNA561398,,561398,SRS5310604,SAMN12617402,simple,408170,human gut metagenome,30185,,,,,,,no,,,,,YALE SCHOOL OF PUBLIC HEALTH,SRA948009,,public,E799243BAFB62132C20AC9F550F70206,052064FDF091B79E9DA48242EF5F98A2"                                             
[3] ""
#
entrez.colnames <- unlist(strsplit(entrez.intern[1], ","))
entrez.data <- unlist(strsplit(entrez.intern[2], ","))
this.entrez.data <- t(data.frame(entrez.data))
colnames(this.entrez.data) <- as.character(entrez.colnames)
rownames(this.entrez.data) <- this.runID
this.entrez.data
            Run           ReleaseDate           LoadDate              spots      bases        spots_with_mates avgLength
SRR10025068 "SRR10025068" "2019-08-27 18:44:15" "2019-08-27 16:31:34" "12479737" "3768880574" "12479737"       "302"    
            size_MB AssemblyName download_path                                                               Experiment  
SRR10025068 "1154"  ""           "https://sra-download.ncbi.nlm.nih.gov/traces/sra25/SRR/009790/SRR10025068" "SRX6762113"
            LibraryName    LibraryStrategy LibrarySelection LibrarySource LibraryLayout InsertSize InsertDev Platform  
SRR10025068 "1168003_D6_S" "WGS"           "RANDOM"         "METAGENOMIC" "PAIRED"      "0"        "0"       "ILLUMINA"
            Model                   SRAStudy    BioProject    Study_Pubmed_id ProjectID Sample       BioSample     
SRR10025068 "Illumina NovaSeq 6000" "SRP219390" "PRJNA561398" ""              "561398"  "SRS5310604" "SAMN12617402"
            SampleType TaxID    ScientificName         SampleName g1k_pop_code source g1k_analysis_group Subject_ID Sex
SRR10025068 "simple"   "408170" "human gut metagenome" "30185"    ""           ""     ""                 ""         "" 
            Disease Tumor Affection_Status Analyte_Type Histological_Type Body_Site CenterName                    
SRR10025068 ""      "no"  ""               ""           ""                ""        "YALE SCHOOL OF PUBLIC HEALTH"
            Submission  dbgap_study_accession Consent  RunHash                           
SRR10025068 "SRA948009" ""                    "public" "E799243BAFB62132C20AC9F550F70206"
            ReadHash                          
SRR10025068 "052064FDF091B79E9DA48242EF5F98A2"

ADD COMMENT • link 5.7 years ago by noodle ▴ 650

score 2 · Accepted Answer · 2019-11-06

This appears to be unnecessarily complicated to me. For a given list of SRA accessions, you should be able to just download the comma-separated runinfo table from the command line (without going through R) and then parse the output file as a CSV from within R. Do you need to do everything from within R? If you do need a parsable XML from within R, you can do the following:

> r1 <- entrez_fetch(db='sra', id='SRR10025068', rettype='runinfo', retmode='xml', parsed=TRUE)
> r1
[1] "\n<SraRunInfo>\n<Row>\n<Run>SRR10025068</Run>\n<ReleaseDate>2019-08-27 18:44:15</ReleaseDate>\n<LoadDate>2019-08-27 16:31:34</LoadDate>\n<spots>12479737</spots>\n<bases>3768880574</bases>\n<spots_with_mates>12479737</spots_with_mates>\n<avgLength>302</avgLength>\n<size_MB>1154</size_MB>\n<download_path>https://sra-download.ncbi.nlm.nih.gov/traces/sra25/SRR/009790/SRR10025068</download_path>\n<Experiment>SRX6762113</Experiment>\n<LibraryName>1168003_D6_S</LibraryName>\n<LibraryStrategy>WGS</LibraryStrategy>\n<LibrarySelection>RANDOM</LibrarySelection>\n<LibrarySource>METAGENOMIC</LibrarySource>\n<LibraryLayout>PAIRED</LibraryLayout>\n<InsertSize>0</InsertSize>\n<InsertDev>0</InsertDev>\n<Platform>ILLUMINA</Platform>\n<Model>Illumina NovaSeq 6000</Model>\n<SRAStudy>SRP219390</SRAStudy>\n<BioProject>PRJNA561398</BioProject>\n<ProjectID>561398</ProjectID>\n<Sample>SRS5310604</Sample>\n<BioSample>SAMN12617402</BioSample>\n<SampleType>simple</SampleType>\n<TaxID>408170</TaxID>\n<ScientificName>human gut metagenome</ScientificName>\n<SampleName>30185</SampleName>\n<Tumor>no</Tumor>\n<CenterName>YALE SCHOOL OF PUBLIC HEALTH</CenterName>\n<Submission>SRA948009</Submission>\n<Consent>public</Consent>\n<RunHash>E799243BAFB62132C20AC9F550F70206</RunHash>\n<ReadHash>052064FDF091B79E9DA48242EF5F98A2</ReadHash>\n</Row>\n\n</SraRunInfo>\n"