Question: Extracting all information about a sample when using xtract from e-utilities
0
gravatar for An Ignorant Wanderer
6 weeks ago by
An Ignorant Wanderer0 wrote:

I would like to extract all information about each SAMPLE after running the following query (run the query and add a | grep SAMPLE for clarification on what I mean by SAMPLE):

esearch -db sra -query PRJNA514750 | efetch -format xml

I tried the following: esearch -db sra -query PRJNA514750 | efetch -format xml | xtract -pattern EXPERIMENT -element SAMPLE

but this returns nothing (PS: SAMPLEs are within an EXPERIMENT tag). I read in the e-utilities guide that -pattern will divide the data into rows, and -element into columns, so I'm presuming that this didn't work because SAMPLE has multiple tags within it. So I then tried: esearch -db sra -query PRJNA514750 | efetch -format xml | xtract -pattern SAMPLE -element random_SAMPLE_tag where random_SAMPLE_tag is any tag within SAMPLE.

Here's a concrete example: esearch -db sra -query PRJNA514750 | efetch -format xml | xtract -pattern SAMPLE -element TITLE This works, but I want to get all the information about each SAMPLE, and I do not know beforehand what the tags within it are (I manually got TITLE in this case), and since I want to get this info for a quite a few studies, I can't manually check this.

e-utilities ncbi • 127 views
ADD COMMENTlink modified 6 weeks ago by genomax89k • written 6 weeks ago by An Ignorant Wanderer0
0
gravatar for Istvan Albert
6 weeks ago by
Istvan Albert ♦♦ 84k
University Park, USA
Istvan Albert ♦♦ 84k wrote:

First save the search output into a file:

esearch -db sra -query PRJNA514750 | efetch -format xml > out.xml

that way you don't need to rerun the query. You can the structure of the file with:

cat out.xml | xtract -outline

it prints:

SAMPLE
  IDENTIFIERS
    PRIMARY_ID
    EXTERNAL_ID
    EXTERNAL_ID
  TITLE
  SAMPLE_NAME
    TAXON_ID
    SCIENTIFIC_NAME
  SAMPLE_LINKS
    SAMPLE_LINK
      XREF_LINK
        DB
        ID
        LABEL
  SAMPLE_ATTRIBUTES
    SAMPLE_ATTRIBUTE
      TAG
      VALUE

You can also view the XML file in a browser to see the actual content of the file.

Now xtract has some crazy constructs, see a seemingly never-ending stream of more and more complex examples here https://www.ncbi.nlm.nih.gov/books/NBK179288/

I don't know of a construct that flattens the entire file into text, but as you can imagine that process is not nearly as simple as one might think. There is usually a lot of redundant information that would be useless if full flattened. It is typically better to leave that as XML and just figure out the way to get the fields you need/want with extract when you do need them.

ADD COMMENTlink written 6 weeks ago by Istvan Albert ♦♦ 84k
0
gravatar for genomax
6 weeks ago by
genomax89k
United States
genomax89k wrote:

Perhaps this would help. I have truncated information to include only two samples here.

$ esearch -db sra -query PRJNA514750 | efetch -format runinfo
Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
SRR8435655,2019-01-23 17:34:09,2019-01-11 15:13:41,20246690,1032581190,0,51,455,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-2/SRR8435655/SRR8435655.1,SRX5243190,,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP178555,PRJNA514750,3,514750,SRS4245865,SAMN10734300,simple,6239,Caenorhabditis elegans,GSM3560682,,,,,,,no,,,,,GEO,SRA833758,,public,5581E5CC4A0EFFEDADC3BEAE797E0A38,C7501C5F9F0424FB05F81C48477BE7E4
SRR8435656,2019-01-23 17:34:09,2019-01-11 15:14:13,22222562,1133350662,0,51,498,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos2/sra-pub-run-15/SRR8435656/SRR8435656.1,SRX5243191,,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP178555,PRJNA514750,3,514750,SRS4245866,SAMN10734299,simple,6239,Caenorhabditis elegans,GSM3560683,,,,,,,no,,,,,GEO,SRA833758,,public,E03956EFF39BF1150F5E08C1303BBA4E,65078A40AA70B0F6E4B5142130FA9586
ADD COMMENTlink written 6 weeks ago by genomax89k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1664 users visited in the last hour