download all metadata from SRA
1
0
Entering edit mode
3.1 years ago

From SRA, how would you get the number of DNAseq samples per year for top 10 most frequently sequenced species? Or alternatively how to download all SRA metadata? This source does not contain species info ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Metadata/SRA_Accessions.tab..

SRA • 2.6k views
ADD COMMENT
2
Entering edit mode
3.1 years ago
GenoMax 141k

Using EntrezDirect to get you started.
This is likely not a perfect query. I will think about this some more later. Adjust date range as needed.

$ esearch -db sra -query "2021/1/1:2021/1/2[Publication Date]"  | elink -target biosample | esummary | xtract -pattern DocumentSummary -element Organism | sort | uniq -c | sort -k1,1nr
 170 Glycine max
 121 Rhodeus ocellatus kurumeus
  83 air metagenome
  62 Culex bitaeniorhynchus
  48 Culex tritaeniorhynchus
  39 Escherichia coli
  37 Kalanchoe laxiflora
  36 Homo sapiens
  35 soil metagenome
  32 Mus musculus
  22 Rhodeus ocellatus ocellatus
  20 feces metagenome
  13 Salmonella enterica subsp. enterica serovar Infantis
   9 Cardamine flexuosa
   8 Salmonella enterica subsp. enterica serovar Kentucky
   7 Arabidopsis thaliana
   7 Campylobacter jejuni
   7 Salmonella enterica subsp. enterica serovar Enteritidis
   6 Zea mays
   4 Salmonella enterica subsp. enterica serovar Typhimurium
   3 Salmonella enterica
   3 Salmonella enterica subsp. enterica
   3 Salmonella enterica subsp. enterica serovar Newport
   2 Salmonella enterica subsp. enterica serovar Agona
   2 Salmonella enterica subsp. enterica serovar Eko
   2 Salmonella enterica subsp. enterica serovar London
   2 Salmonella enterica subsp. enterica serovar Schwarzengrund
   2 Vicia sativa
   2 mixed culture
   1 Abeliophyllum distichum f. lilacinum
   1 Aspergillus aculeatinus
   1 Campylobacter jejuni subsp. jejuni
   1 Fagus sylvatica
   1 Nicotiana
   1 Physalis pubescens
   1 Polygonatum kingianum
   1 Rhus punjabensis var. sinica
   1 Salmonella enterica subsp. enterica serovar 4,[5],12:i:-
   1 Salmonella enterica subsp. enterica serovar Anatum
   1 Salmonella enterica subsp. enterica serovar Brandenburg
   1 Salmonella enterica subsp. enterica serovar Derby
   1 Salmonella enterica subsp. enterica serovar Johannesburg
   1 Salmonella enterica subsp. enterica serovar Senftenberg
   1 Shigella sonnei
   1 freshwater sediment metagenome
   1 riverine metagenome

If you are willing to write some code you can extract lot more info from a query like this

$ esearch -db sra -query "2021/1/1:2021/1/2[Publication Date]"  | esummary | head -100

https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130524/esummary_sra.dtd">

<DocumentSummarySet status="OK">
<DocumentSummary>
<Id>11835626</Id>
    <ExpXml>  <Summary><Title>RNA-Seq of early induced cardiac progenitors (Day-7)</Title><Platform instrument_model="Illumina HiSeq 2500">ILLUMINA</Platform><Statistics total_runs="1" total_spots="36730347" total_bases="11019104100" total_size="4562391529" load_done="true" cluster_name="public"/></Summary><Submitter acc="SRA1123873" center_name="University of Cincinnati" contact_name="Jialiang Liang" lab_name="Department of Pathology"/><Experiment acc="SRX9106574" ver="4" status="public" name="RNA-Seq of early induced cardiac progenitors (Day-7)"/><Study acc="SRP282054" name="Activation of endogenous genes by CRISPR enables conversion of mouse fibroblasts into cardiac progenitor cells"/><Organism taxid="10090" ScientificName="Mus musculus"/><Sample acc="SRS7349991" name=""/><Instrument ILLUMINA="Illumina HiSeq 2500"/><Library_descriptor><LIBRARY_NAME>T4</LIBRARY_NAME><LIBRARY_STRATEGY>RNA-Seq</LIBRARY_STRATEGY><LIBRARY_SOURCE>TRANSCRIPTOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>Oligo-dT</LIBRARY_SELECTION><LIBRARY_LAYOUT>                 <PAIRED/>               </LIBRARY_LAYOUT></Library_descriptor><Bioproject>PRJNA662934</Bioproject><Biosample>SAMN16109872</Biosample>  </ExpXml>
    <Runs>                                <Run acc="SRR12623858" total_spots="36730347" total_bases="11019104100" load_done="true" is_public="true" cluster_name="public" static_data_available="true"/>                                </Runs>
    <ExtLinks></ExtLinks>
    <CreateDate>2021/01/01</CreateDate>
    <UpdateDate>2021/02/02</UpdateDate>
</DocumentSummary>
ADD COMMENT
0
Entering edit mode

thanks, seems like a good starting point!

ADD REPLY
0
Entering edit mode

Hi GenoMax, I'd like to revive this thread. For some reason this command

esearch -db sra -query "2019/1/1:2020/1/1[Publication Date]"  | elink
-target biosample | esummary | xtract -pattern DocumentSummary -element Organism | sort | wc -l

retrieves only 952 entries, which is obviously wrong give the command is correct. In esearch documentation I didn't find this way of date specification, so I am wondering if there is any idea how to fix the command.

ADD REPLY
0
Entering edit mode

You may want to go back to the metadata file folder and get the SRA_accessions.tab (LINK, 10G download) file. Extract accessions for date range you need and then look up organisms. That may be more foolproof.

$ esearch -db sra -query "SAMN20125464"  | elink -target biosample | esummary | xtract -pattern DocumentSummary -element Organism
Homo sapiens
ADD REPLY
0
Entering edit mode

thanx I am already downloading the data I need from SRA, lets see which is one is faster

ADD REPLY

Login before adding your answer.

Traffic: 1981 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6