Question: Find NGS datasets on GEO that have a certain read length?
4.3 years ago by
justin.nelligan60 wrote:

I plan to do cross correlation of a large dataset we are generating with public data from GEO for example. The only problem is that I want experiments done with a certain read length (i.e. exclude what has been sequenced for only 36 bp, for mappability reason).

I haven't found a way to use advanced filters or some other way to filter GEO or SRA or DDBJ. Is it possible to do such a restricted search?


4.3 years ago by
Istvan Albert ♦♦ 79k
University Park, USA
Istvan Albert ♦♦ 79k wrote:

You could try fetching the runinfo for all samples then filter on the Average Length column. I think that is either the read length or twice of it in paired runs. Also look at the other columns. For a single run it would be :

esearch -db sra -query SRR1613384 | efetch -format runinfo | cut -d ',' -f 1,7

you can expand the query to match more entries

esearch -db sra -query SRR1613* | efetch -format runinfo | cut -d ',' -f 1,7

it may be possible to search for this average length directly but it is not clear how

Edit: the code above requires Ncbi Releases Entrez Direct, The Entrez Utilities On The Unix Command Line

4.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum116k wrote:

the data in  SRA are listed today in

using the firefox console, I see a NCBI web service loading a few read for a given run using the following URL= "" . the result is a json file

all in one , a quick'n dirty (i don't parse json) solution is:


 curl -s "" | cut -f2  | grep -v ERR | while read L; do curl -s  "${L}" |  grep read_len | sed "s/^/${L}  /" ; done | uniq
DRR000001  read_len:[36,36],
DRR000002  read_len:[36,36],
DRR000003  read_len:[36],
DRR000004  read_len:[36],
DRR000005  read_len:[36],
DRR000006  read_len:[36],
DRR000007  read_len:[36],
DRR000008  read_len:[36],
DRR000009  read_len:[36],
DRR000010  read_len:[36],
DRR000011  read_len:[36],
DRR000012  read_len:[36],
DRR000013  read_len:[36],



