Question

Find NGS datasets on GEO that have a certain read length?

0

Entering edit mode

9.5 years ago

justin.nelligan ▴ 70

Hi everyone,

I plan to do cross correlation of a large dataset we are generating with public data from GEO for example. The only problem is that I want experiments done with a certain read length (i.e. exclude what has been sequenced for only 36 bp, for mappability reason).

I haven't found a way to use advanced filters or some other way to filter GEO or SRA or DDBJ. Is it possible to do such a restricted search?

Thanks!

NGS ChIP-Seq • 2.6k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by justin.nelligan ▴ 70

0

Entering edit mode

many thanks

ADD REPLY • link 9.5 years ago by Elnaaz ▴ 40

Ram · Accepted Answer · 2014-10-17

You could try fetching the runinfo for all samples then filter on the Average Length column. I think that is either the read length or twice of it in paired runs. Also look at the other columns. For a single run it would be:

esearch -db sra -query SRR1613384 | efetch -format runinfo | cut -d ',' -f 1,7

you can expand the query to match more entries

esearch -db sra -query SRR1613* | efetch -format runinfo | cut -d ',' -f 1,7

it may be possible to search for this average length directly but it is not clear how

Edit: the code above requires Ncbi Releases Entrez Direct, The Entrez Utilities On The Unix Command Line

Ram · Accepted Answer · 2014-10-17

The data in SRA are listed today in ftp://ftp-trace.ncbi.nih.gov/sra/reports/Datalist/NCBI_SRA_Datalist_20140926

Using the firefox console, I see a NCBI web service loading a few read for a given run using the following URL= "http://trace.ncbi.nlm.nih.gov/Traces/sra/?run_spot=(id)". The result is a json file

All in one , a quick'n dirty (I don't parse json) solution is:

 curl -s "ftp://ftp-trace.ncbi.nih.gov/sra/reports/Datalist/NCBI_SRA_Datalist_20140926" | cut -f2  | grep -v ERR | while read L; do curl -s  "http://trace.ncbi.nlm.nih.gov/Traces/sra/?run_spot=${L}" |  grep read_len | sed "s/^/${L}  /" ; done | uniq
DRR000001  read_len:[36,36],
DRR000002  read_len:[36,36],
DRR000003  read_len:[36],
DRR000004  read_len:[36],
DRR000005  read_len:[36],
DRR000006  read_len:[36],
DRR000007  read_len:[36],
DRR000008  read_len:[36],
DRR000009  read_len:[36],
DRR000010  read_len:[36],
DRR000011  read_len:[36],
DRR000012  read_len:[36],
DRR000013  read_len:[36],

(...)