Question: Find NGS datasets on GEO that have a certain read length?
0
gravatar for justin.nelligan
4.3 years ago by
USA
justin.nelligan60 wrote:

Hi everyone,

I plan to do cross correlation of a large dataset we are generating with public data from GEO for example. The only problem is that I want experiments done with a certain read length (i.e. exclude what has been sequenced for only 36 bp, for mappability reason).

I haven't found a way to use advanced filters or some other way to filter GEO or SRA or DDBJ. Is it possible to do such a restricted search?

Thanks!

chip-seq ngs • 1.4k views
ADD COMMENTlink modified 4.3 years ago by Pierre Lindenbaum116k • written 4.3 years ago by justin.nelligan60

many thanks

ADD REPLYlink written 4.3 years ago by Elnaaz40
5
gravatar for Istvan Albert
4.3 years ago by
Istvan Albert ♦♦ 79k
University Park, USA
Istvan Albert ♦♦ 79k wrote:

You could try fetching the runinfo for all samples then filter on the Average Length column. I think that is either the read length or twice of it in paired runs. Also look at the other columns. For a single run it would be :

esearch -db sra -query SRR1613384 | efetch -format runinfo | cut -d ',' -f 1,7

you can expand the query to match more entries

esearch -db sra -query SRR1613* | efetch -format runinfo | cut -d ',' -f 1,7

it may be possible to search for this average length directly but it is not clear how

Edit: the code above requires Ncbi Releases Entrez Direct, The Entrez Utilities On The Unix Command Line

ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by Istvan Albert ♦♦ 79k
3
gravatar for Pierre Lindenbaum
4.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum116k wrote:

the data in  SRA are listed today in  ftp://ftp-trace.ncbi.nih.gov/sra/reports/Datalist/NCBI_SRA_Datalist_20140926

using the firefox console, I see a NCBI web service loading a few read for a given run using the following URL= "http://trace.ncbi.nlm.nih.gov/Traces/sra/?run_spot=(id)" . the result is a json file

all in one , a quick'n dirty (i don't parse json) solution is:

 

 curl -s "ftp://ftp-trace.ncbi.nih.gov/sra/reports/Datalist/NCBI_SRA_Datalist_20140926" | cut -f2  | grep -v ERR | while read L; do curl -s  "http://trace.ncbi.nlm.nih.gov/Traces/sra/?run_spot=${L}" |  grep read_len | sed "s/^/${L}  /" ; done | uniq
DRR000001  read_len:[36,36],
DRR000002  read_len:[36,36],
DRR000003  read_len:[36],
DRR000004  read_len:[36],
DRR000005  read_len:[36],
DRR000006  read_len:[36],
DRR000007  read_len:[36],
DRR000008  read_len:[36],
DRR000009  read_len:[36],
DRR000010  read_len:[36],
DRR000011  read_len:[36],
DRR000012  read_len:[36],
DRR000013  read_len:[36],

(...)

 

ADD COMMENTlink written 4.3 years ago by Pierre Lindenbaum116k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1008 users visited in the last hour