Question: XML from entrez not formatted correctly
0
gravatar for joe
14 days ago by
joe120
joe120 wrote:

Hi all, I'm hoping someone might advise on the XML text output from rentrez following the below workflow. It seems that the output does not open/close all tags properly and I'm not sure how I can clean this up. For example, in the below the open tag "Platform" isn't annotated > and therefore I can't gsub > to make it workable/readable.

#retrieve data from SRR
r_search <- entrez_search(db="sra", term="SRR10025068")
r_search.id <- r_search$ids
all_the_links <- entrez_link(dbfrom='sra', id=r_search.id, db='all')
r_summ <- entrez_summary(db="sra", id=all_the_links$links$sra_bioproject_all)
xml.data.dirty <- r_summ$expxml
xml.data.dirty
[1] "  &lt;Summary&gt;&lt;Title&gt;Mouse 57&lt;/Title&gt;&lt;Platform instrument_model=\"454 GS FLX Titanium\"&gt;LS454&lt;/Platform&gt;&lt;Statistics total_runs=\"1\" total_spots=\"6058\" total_bases=\"2449911\" total_size=\"1638287\" load_done=\"true\" cluster_name=\"public\"/&gt;&lt;/Summary&gt;&lt;Submitter acc=\"SRA115778\" center_name=\"Texas A&amp;amp;M University\" contact_name=\"Sean McCaffrey\" lab_name=\"Gastrointestinal Laboratory\"/&gt;&lt;Experiment acc=\"SRX390677\" ver=\"1\" status=\"public\" name=\"Mouse 57\"/&gt;&lt;Study acc=\"SRP033709\" name=\"Mice gut bacteria Targeted Locus (Loci)\"/&gt;&lt;Organism taxid=\"10090\" ScientificName=\"Mus musculus\"/&gt;&lt;Sample acc=\"SRS514105\" name=\"\"/&gt;&lt;Instrument LS454=\"454 GS FLX Titanium\"/&gt;&lt;Library_descriptor&gt;&lt;LIBRARY_NAME/&gt;&lt;LIBRARY_STRATEGY&gt;AMPLICON&lt;/LIBRARY_STRATEGY&gt;&lt;LIBRARY_SOURCE&gt;GENOMIC&lt;/LIBRARY_SOURCE&gt;&lt;LIBRARY_SELECTION&gt;unspecified&lt;/LIBRARY_SELECTION&gt;&lt;LIBRARY_LAYOUT&gt;                 &lt;SINGLE/&gt;               &lt;/LIBRARY_LAYOUT&gt;&lt;/Library_descriptor&gt;&lt;Bioproject&gt;PRJNA231086&lt;/Bioproject&gt;&lt;Biosample&gt;SAMN02440270&lt;/Biosample&gt;  "

#get usable XML file
xml.data.5knwn <- gsub("&gt;", ">", xml.data.dirty)
xml.data.5knwn <- gsub("&lt;", "<", xml.data.5knwn)
xml.data.5knwn <- gsub("&amp;", "&", xml.data.5knwn)
xml.data.5knwn <- gsub("&apos;", "'", xml.data.5knwn)
xml.data.5knwn <- gsub("&quot;", '"', xml.data.5knwn)
xml.data.5knwn.clean <- gsub(" ", "", xml.data.5knwn)
xml.data.5knwn.clean
[1] "<Summary><Title>Mouse57</Title><Platforminstrument_model=\"454GSFLXTitanium\">LS454</Platform><Statisticstotal_runs=\"1\"total_spots=\"6058\"total_bases=\"2449911\"total_size=\"1638287\"load_done=\"true\"cluster_name=\"public\"/></Summary><Submitteracc=\"SRA115778\"center_name=\"TexasA&amp;MUniversity\"contact_name=\"SeanMcCaffrey\"lab_name=\"GastrointestinalLaboratory\"/><Experimentacc=\"SRX390677\"ver=\"1\"status=\"public\"name=\"Mouse57\"/><Studyacc=\"SRP033709\"name=\"MicegutbacteriaTargetedLocus(Loci)\"/><Organismtaxid=\"10090\"ScientificName=\"Musmusculus\"/><Sampleacc=\"SRS514105\"name=\"\"/><InstrumentLS454=\"454GSFLXTitanium\"/><Library_descriptor><LIBRARY_NAME/><LIBRARY_STRATEGY>AMPLICON</LIBRARY_STRATEGY><LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>unspecified</LIBRARY_SELECTION><LIBRARY_LAYOUT><SINGLE/></LIBRARY_LAYOUT></Library_descriptor><Bioproject>PRJNA231086</Bioproject><Biosample>SAMN02440270</Biosample>"

Edit: typo

xml rentrez entrez rxml R • 139 views
ADD COMMENTlink modified 13 days ago • written 14 days ago by joe120

I am not sure what exactly you need to parse from this dataset but this looks clean enough.

$ efetch -db sra -id "SRR10025068" -format runinfo
Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
SRR10025068,2019-08-27 18:44:15,2019-08-27 16:31:34,12479737,3768880574,12479737,302,1154,,https://sra-download.ncbi.nlm.nih.gov/traces/sra25/SRR/009790/SRR10025068,SRX6762113,1168003_D6_S,WGS,RANDOM,METAGENOMIC,PAIRED,0,0,ILLUMINA,Illumina NovaSeq 6000,SRP219390,PRJNA561398,,561398,SRS5310604,SAMN12617402,simple,408170,human gut metagenome,30185,,,,,,,no,,,,,YALE SCHOOL OF PUBLIC HEALTH,SRA948009,,public,E799243BAFB62132C20AC9F550F70206,052064FDF091B79E9DA48242EF5F98A2
ADD REPLYlink written 14 days ago by genomax74k

Thanks, I tried the entrez e-utils as well, but the data returned between the two functions is similar but different, and unfortunately I'm looking at the different stuff.

ADD REPLYlink written 14 days ago by joe120

Wow, so I just realized that the issue isn't the XML, it's also the data returned is incorrect - a much bigger issue.

ADD REPLYlink written 14 days ago by joe120

it's also the data returned is incorrect

Could you please explain what data are incorrect?

ADD REPLYlink written 13 days ago by vkkodali1.4k

I think this bit in the original post does not match information I obtained when using runinfo

<Summary><Title>Mouse 57</Title><Platform instrument_model=\"454 GS FLX Titanium\">LS454</Platform><Statistics total_runs=\"1\" total_spots=\"6058\" total_bases=\"2449911\" total_size=\"1638287\" load_done=\"true\" cluster_name=\"public\"/></Summary><Submitter acc=\"SRA115778\" center_name=\"Texas A&amp;M University\" contact_name=\"Sean McCaffrey\" lab_name=\"Gastrointestinal Laboratory\"/><Experiment acc=\"SRX390677\" ver=\"1\" status=\"public\" name=\"Mouse 57\"/><Study acc=\"SRP033709\" name=\"Mice gut bacteria Targeted Locus (Loci)\"/><Organism taxid=\"10090\" ScientificName=\"Mus musculus\"/><Sample acc=\"SRS514105\" name=\"\"/><Instrument LS454=\"454 GS FLX Titanium\"/><Library_descriptor><LIBRARY_NAME/><LIBRARY_STRATEGY>AMPLICON</LIBRARY_STRATEGY><LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>unspecified</LIBRARY_SELECTION><LIBRARY_LAYOUT> <SINGLE/>
</LIBRARY_LAYOUT></Library_descriptor><Bioproject>PRJNA231086</Bioproject><Biosample>SAMN02440270</Biosample> "

ADD REPLYlink modified 13 days ago • written 13 days ago by genomax74k

I think this bit in the original post does not match information I obtained when using runinfo

Is that not expected? Your command is fetching the runinfo table whereas @joe was downloading the Bioproject docsum. The corresponding edirect command for what @joe was doing:

esearch -db sra -query 'SRR10025068' \
  | elink -db sra -target bioproject -name sra_bioproject_all \
  | esummary

The XML of the command shown above is still not in the best XML format but it can be cleaned up by piping the output to xtract -format.

If I understand this correctly, the issue @joe has is related to encoding of html characters in the r_summ$expxml object, not the data itself.

ADD REPLYlink modified 13 days ago • written 13 days ago by vkkodali1.4k

Ah I see. I only looked at the accession OP was using and looked up the runinfo. That is a NovaSeq 6000 run.

If we look at the bioproject SRR10025068 belongs (as far as I can see from this SRA page) to where is the reference to 454 coming from from the output OP has?

ADD REPLYlink modified 13 days ago • written 13 days ago by genomax74k

Good eyes! It was my (and the OP's) mistake. You see we both used -target sra for our target db in the elink. So, the data that was being fetched was for the identifier 561398 from SRA instead of BioProject. I now fixed my command to use -target bioproject to get the correct data out.

ADD REPLYlink written 13 days ago by vkkodali1.4k

My (original) issue was that the xml output was not correctly formatted, and I later realized the data returned was not correct.

ADD REPLYlink written 13 days ago by joe120
2
gravatar for vkkodali
13 days ago by
vkkodali1.4k
United States
vkkodali1.4k wrote:

This appears to be unnecessarily complicated to me. For a given list of SRA accessions, you should be able to just download the comma-separated runinfo table from the command line (without going through R) and then parse the output file as a CSV from within R. Do you need to do everything from within R? If you do need a parsable XML from within R, you can do the following:

> r1 <- entrez_fetch(db='sra', id='SRR10025068', rettype='runinfo', retmode='xml', parsed=TRUE)
> r1
[1] "\n<SraRunInfo>\n<Row>\n<Run>SRR10025068</Run>\n<ReleaseDate>2019-08-27 18:44:15</ReleaseDate>\n<LoadDate>2019-08-27 16:31:34</LoadDate>\n<spots>12479737</spots>\n<bases>3768880574</bases>\n<spots_with_mates>12479737</spots_with_mates>\n<avgLength>302</avgLength>\n<size_MB>1154</size_MB>\n<download_path>https://sra-download.ncbi.nlm.nih.gov/traces/sra25/SRR/009790/SRR10025068</download_path>\n<Experiment>SRX6762113</Experiment>\n<LibraryName>1168003_D6_S</LibraryName>\n<LibraryStrategy>WGS</LibraryStrategy>\n<LibrarySelection>RANDOM</LibrarySelection>\n<LibrarySource>METAGENOMIC</LibrarySource>\n<LibraryLayout>PAIRED</LibraryLayout>\n<InsertSize>0</InsertSize>\n<InsertDev>0</InsertDev>\n<Platform>ILLUMINA</Platform>\n<Model>Illumina NovaSeq 6000</Model>\n<SRAStudy>SRP219390</SRAStudy>\n<BioProject>PRJNA561398</BioProject>\n<ProjectID>561398</ProjectID>\n<Sample>SRS5310604</Sample>\n<BioSample>SAMN12617402</BioSample>\n<SampleType>simple</SampleType>\n<TaxID>408170</TaxID>\n<ScientificName>human gut metagenome</ScientificName>\n<SampleName>30185</SampleName>\n<Tumor>no</Tumor>\n<CenterName>YALE SCHOOL OF PUBLIC HEALTH</CenterName>\n<Submission>SRA948009</Submission>\n<Consent>public</Consent>\n<RunHash>E799243BAFB62132C20AC9F550F70206</RunHash>\n<ReadHash>052064FDF091B79E9DA48242EF5F98A2</ReadHash>\n</Row>\n\n</SraRunInfo>\n"
ADD COMMENTlink written 13 days ago by vkkodali1.4k

Nice! ...next time. Now I'm more familiar with rentrez. I was calling this as part of a bigger function on a list of a few hundred SRA accessions, so yes, in the case I needed (wanted) to work from R.

ADD REPLYlink written 13 days ago by joe120
1

joe : I moved @vkkodali's comment to an answer since it seems to do what you need efficiently. Feel free to accept that (and your own) answer to provide closure to this thread.

ADD REPLYlink written 13 days ago by genomax74k

I'll just comment that the original issue of the incorrectly formatted XML was not addressed.

ADD REPLYlink written 11 days ago by joe120
0
gravatar for joe
13 days ago by
joe120
joe120 wrote:

Thanks everyone for the responses. In the end I did the below, not exactly what I wanted but I got by...

this.runID <- "SRR10025068"
#
entrez.cmd <- paste0("esearch -db 'sra' -query '", this.runID,"' | esummary  -db 'all' -format runinfo")
entrez.cmd
[1] "esearch -db 'sra' -query 'SRR10025068' | esummary  -db 'all' -format runinfo"
#
entrez.intern <- system(entrez.cmd, intern=TRUE)
entrez.intern
[1] "Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash"
[2] "SRR10025068,2019-08-27 18:44:15,2019-08-27 16:31:34,12479737,3768880574,12479737,302,1154,,https://sra-download.ncbi.nlm.nih.gov/traces/sra25/SRR/009790/SRR10025068,SRX6762113,1168003_D6_S,WGS,RANDOM,METAGENOMIC,PAIRED,0,0,ILLUMINA,Illumina NovaSeq 6000,SRP219390,PRJNA561398,,561398,SRS5310604,SAMN12617402,simple,408170,human gut metagenome,30185,,,,,,,no,,,,,YALE SCHOOL OF PUBLIC HEALTH,SRA948009,,public,E799243BAFB62132C20AC9F550F70206,052064FDF091B79E9DA48242EF5F98A2"                                             
[3] ""
#
entrez.colnames <- unlist(strsplit(entrez.intern[1], ","))
entrez.data <- unlist(strsplit(entrez.intern[2], ","))
this.entrez.data <- t(data.frame(entrez.data))
colnames(this.entrez.data) <- as.character(entrez.colnames)
rownames(this.entrez.data) <- this.runID
this.entrez.data
            Run           ReleaseDate           LoadDate              spots      bases        spots_with_mates avgLength
SRR10025068 "SRR10025068" "2019-08-27 18:44:15" "2019-08-27 16:31:34" "12479737" "3768880574" "12479737"       "302"    
            size_MB AssemblyName download_path                                                               Experiment  
SRR10025068 "1154"  ""           "https://sra-download.ncbi.nlm.nih.gov/traces/sra25/SRR/009790/SRR10025068" "SRX6762113"
            LibraryName    LibraryStrategy LibrarySelection LibrarySource LibraryLayout InsertSize InsertDev Platform  
SRR10025068 "1168003_D6_S" "WGS"           "RANDOM"         "METAGENOMIC" "PAIRED"      "0"        "0"       "ILLUMINA"
            Model                   SRAStudy    BioProject    Study_Pubmed_id ProjectID Sample       BioSample     
SRR10025068 "Illumina NovaSeq 6000" "SRP219390" "PRJNA561398" ""              "561398"  "SRS5310604" "SAMN12617402"
            SampleType TaxID    ScientificName         SampleName g1k_pop_code source g1k_analysis_group Subject_ID Sex
SRR10025068 "simple"   "408170" "human gut metagenome" "30185"    ""           ""     ""                 ""         "" 
            Disease Tumor Affection_Status Analyte_Type Histological_Type Body_Site CenterName                    
SRR10025068 ""      "no"  ""               ""           ""                ""        "YALE SCHOOL OF PUBLIC HEALTH"
            Submission  dbgap_study_accession Consent  RunHash                           
SRR10025068 "SRA948009" ""                    "public" "E799243BAFB62132C20AC9F550F70206"
            ReadHash                          
SRR10025068 "052064FDF091B79E9DA48242EF5F98A2"
ADD COMMENTlink written 13 days ago by joe120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1582 users visited in the last hour