Question: How to download sample attributes (sample metadata) file from the European nucleotide archive (EMBL-EBI)?
0
gravatar for alowi33
9 months ago by
alowi330
alowi330 wrote:

Project PRJEB99111 has 147 samples. I want to download the metadata (age, sex, disease status, etc) of each sample, not fastq. The only way I can download the metadata is by downloading the xml file of each sample accession one by one - is there a way to bulk download all 147 metadata files? I can work with xml files if I have to.

You can view the metadata for a specific sample accession by clicking on the"attributes" tab. Here is an example for one sample: https://www.ebi.ac.uk/ena/data/view/SAMEA104228123

ADD COMMENTlink modified 9 months ago by genomax54k • written 9 months ago by alowi330
5
gravatar for Pierre Lindenbaum
9 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum111k wrote:

with the following xslt stylesheet:

$ wget -q  -O - "https://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=PRJEB99111&result=read_run&fields=study_accession,sample_accession,secondary_sample_accession,experiment_accession,run_accession,tax_id,scientific_name,instrument_model,library_layout,fastq_ftp,fastq_galaxy,submitted_ftp,submitted_galaxy,sra_ftp,sra_galaxy,cram_index_ftp,cram_index_galaxy&download=txt" | grep -v sample_accession | cut -f 3 | awk '{printf("https://www.ebi.ac.uk/ena/data/view/%s&display=xml\n",$0);}' | while read U; do wget -O - -q "$U" | xsltproc transform.xsl - ; done


ERS1887136|age|61
ERS1887136|age_units|years
ERS1887136|body_habitat|UBERON:feces
ERS1887136|body_product|UBERON:feces
ERS1887136|body_site|UBERON:feces
ERS1887136|collection_site|UCSF
ERS1887136|collection_timestamp|2013-10-08
ERS1887136|day_in_timeseries|Missing: Not provided
ERS1887136|disease_course|RRMS
ERS1887136|disease_state|MS
ERS1887136|dna_extracted|TRUE
ERS1887136|elevation|124
ERS1887136|env_biome|urban biome
ERS1887136|env_feature|human-associated habitat
ERS1887136|env_material|feces
ERS1887136|env_package|human-gut
ERS1887136|flare|No
ERS1887136|geo_loc_name|USA:CA:San Francisco
ERS1887136|height|Missing: Not provided
ERS1887136|height_units|Missing: Not provided
ERS1887136|host_common_name|human
ERS1887136|host scientific name|Homo sapiens
ERS1887136|host_subject_id|34
ERS1887136|host_taxid|9606
ERS1887136|household|H1004
ERS1887136|investigation_type|mimarks-survey
ERS1887136|latitude|37.76
ERS1887136|life_stage|adult
ERS1887136|longitude|-122.46
ERS1887136|physical_specimen_location|UCSF
ERS1887136|physical_specimen_remaining|FALSE
ERS1887136|repeated_sequencing|1
ERS1887136|sample_type|stool
ERS1887136|sequencing_set|2
ERS1887136|sex|female
ERS1887136|sinai_unmarked_rep|Missing: Not provided
ERS1887136|submission_number|1
(...)
ADD COMMENTlink written 9 months ago by Pierre Lindenbaum111k

Exquisite solution. However, it only works for the first 3 samples and then the following error code is repeated many times:

unable to parse -
-:1: parser error : Document is empty

^
-:1: parser error : Start tag expected, '<' not found

Perhaps the site stoped granting us access thinking that we were not human. I dont know.

ADD REPLYlink written 9 months ago by alowi330
1

However, it only works for the first 3 samples

works on my machine

https://pastebin.com/sq6dzSKX

ADD REPLYlink written 9 months ago by Pierre Lindenbaum111k

Astounding! Much appreciated. I wonder why It didnt fully work for my machine....

ADD REPLYlink written 9 months ago by alowi330

How can you make the xslt stylesheet so that sample names are rows and sample attributes are columns, and tab delimited? Example:

               age       age_units   ...
ERS1887136     61        years     ...
ERS1887137     61        years     ...
ERS1887138     44        years     ...
...            ...         ...
ADD REPLYlink modified 9 months ago • written 9 months ago by alowi330

use datamash ? https://www.gnu.org/software/datamash/

or something in sqlite: A: formatting problem (awk/bash)

ADD REPLYlink written 9 months ago by Pierre Lindenbaum111k
3
gravatar for piet
9 months ago by
piet1.5k
planet earth
piet1.5k wrote:

Unfortunately NCBI does not contain metadata for this project.

This is not true. You can easily download a XML file containing all of the attributes of all the biosamples from NCBI. Since the procedure may also be useful in other contexts, I will describe it step by step.

First go to the page of the project (the bioproject database in NCBI speach):

https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJEB99111

Next, get a list of all biosamples which are linked to this project. There is a section entitled "Related information" on the right site of the page. To get the list of biosamples, click on the hyperlink "Biosample".

This will open an new page which list the first 20 biosamples in the project. The URL of that page is:

https://www.ncbi.nlm.nih.gov/biosample?LinkName=bioproject_biosample_all&from_uid=400734

On the top of this page (on the right site) is a pull-down menu entitled "Send to:". Click on this menu, then select "File", then select format "Full XML (text)", and finally click on the buttom "Create File". Store the XML file on your local disk and parse it with your favorite XML tool.

ADD COMMENTlink modified 9 months ago • written 9 months ago by piet1.5k

That is what I was looking for. Usually bioprojects in NCBI contain a file with all metadata. This file is available in other bioprojects but I couldn't find it in this project. I didn't know about the option you described. Very simple yet useful. Many thanks.

ADD REPLYlink written 9 months ago by alowi330
2
gravatar for genomax
9 months ago by
genomax54k
United States
genomax54k wrote:

Using NCBI eUtils: esearch -db bioproject -query "PRJEB99111" | elink -target biosample | efetch -format docsum | xtract -pattern DocumentSummary -block Attribute -element Attribute

produces (only a sample below)

2017-08-28  2017-08-26  ERS1887138  female  44  years   UBERON:feces    UBERON:feceUBERON:feces UCSF    2013-09-25  Missing: Not provided   RRMS    MS  TRUE    124 urban biome human-associated habitat    feces   human-gut   No  USA:CA:San Francisco    Missing: Not provided   Missing: Not provided   Homo sapiens    111 9606    Missing: Not provided   mimarks-survey  37.76   adult   -122.46 UCSF    FALSE   1   stool   1   Missing: Not provided   1   1_a No  Gut dysbiosis in patients with multiple sclerosis is characterized by bacteria that regulate T lymphocyte differentiation in vitro  No_Treatment    Off Missing: Not provided   Missing: Not provided   dry 1990
2017-08-28  2017-08-26  ERS1887137  male    61  years   UBERON:feces    UBERON:feceUBERON:feces UCSF    Missing: Not provided   Missing: Not provided   RRMS    MS  TRUE    124urban biome  human-associated habitat    feces   human-gut   No  USA:CA:San FranciscMissing: Not provided    Missing: Not provided   Homo sapiens    62  9606    Missing: Not provided   mimarks-survey  37.76   adult   -122.46 UCSF    FALSE   1   stool   2   Missing: Not provided   1   1_a No  Gut dysbiosis in patients with multiple sclerosis is characterized by bacteria that regulate T lymphocyte differentiation in vitro  No_Treatment    Off Missing: Not provided   Missing: Not provided   dry 1984
ADD COMMENTlink modified 9 months ago • written 9 months ago by genomax54k

Worked great :) Anyway to include each attribute's category in the first line?

ADD REPLYlink written 9 months ago by alowi330
1
gravatar for piet
9 months ago by
piet1.5k
planet earth
piet1.5k wrote:

I my opinion, NCBI Entrez/Eutils is more versatile than EBI for downloads like this. If you want to stick with EBI, you can run the loop over all entries of the project on your local computer. There are only 147 samples. Since tasks like this are usually run only once, do not worry to much about computational efficiency.

First download the list of all sample accessions in the project:

wget 'https://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=PRJEB99111&result=read_run&fields=sample_accession&download=txt' -O - | tee /tmp/acc.lst | less

A single biosample with all attributes can be fetched in this way:

wget 'https://www.ebi.ac.uk/ena/data/view/SAMEA104228123&display=xml' -O - | less

To fetch all samples, loop over all of the sample accessions in the list:

foreach a (`cat /tmp/acc.lst`)
      wget "https://www.ebi.ac.uk/ena/data/view/$a&display=xml" -O $a.xml
end

The above shows how to accomplish it with C shell. It should also be easy to achieve this with python and requests.

ADD COMMENTlink modified 9 months ago • written 9 months ago by piet1.5k
1

it worked fine by me. I modified the code a little bit:

code:

$ wget 'https://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=PRJEB99111&result=read_run&fields=sample_accession&download=txt' -O samples.lst
$ sed 1d samples.lst | parallel --delay 2 'wget "https://www.ebi.ac.uk/ena/data/view/{}&display=xml" -O {}.xml'

Note:

  1. List of samples file comes with a header. Hence removed first line
  2. I used a delay of 2 second in parallel. This can be removed or lessened further. Output will be in xml format and will have sample name as name and xml as extension.
ADD REPLYlink written 9 months ago by cpad01128.3k

Unfortunately NCBI does not contain metadata for this project. I get the error "Unable to establish SSL connection" using your codes. I have tried pythons request function but after one successful xml reading the connection fails when I try to read again. You can see my sample codes here: python stopped opening xml url, connection closed.

ADD REPLYlink written 9 months ago by alowi330

Are you behind a HTTP proxy?

ADD REPLYlink modified 9 months ago • written 9 months ago by piet1.5k

I am ssh-ed in to a remote server. I didnt ssh using a key - that is the "key" to solving my problem ;)

ADD REPLYlink written 9 months ago by alowi330
0
gravatar for LLTommy
9 months ago by
LLTommy1.2k
LLTommy1.2k wrote:

ENA meta data you can also get from EBI's Biosample database, so e.g for the Sample SAMEA104228123 you mentioned you should find under https://www.ebi.ac.uk/biosamples/samples/SAMEA104228123. You can get the data in xml but also in JSON (find the button in the right corner) via the api (e.g. https://www.ebi.ac.uk/biosamples/api/samples/SAMEA104228123

ADD COMMENTlink modified 9 months ago • written 9 months ago by LLTommy1.2k

I did not know about the JSON, that is interesting.

ADD REPLYlink written 9 months ago by piet1.5k
1

Glad if I could help you. If you are interested in the API and json, have a look at the API documentation for biosamples - https://www.ebi.ac.uk/biosamples/help/api

ADD REPLYlink written 9 months ago by LLTommy1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 664 users visited in the last hour