Find NGS datasets on GEO that have a certain read length?
2
0
Entering edit mode
9.5 years ago

Hi everyone,

I plan to do cross correlation of a large dataset we are generating with public data from GEO for example. The only problem is that I want experiments done with a certain read length (i.e. exclude what has been sequenced for only 36 bp, for mappability reason).

I haven't found a way to use advanced filters or some other way to filter GEO or SRA or DDBJ. Is it possible to do such a restricted search?

Thanks!

NGS ChIP-Seq • 2.6k views
ADD COMMENT
0
Entering edit mode

many thanks

ADD REPLY
5
Entering edit mode
9.5 years ago

You could try fetching the runinfo for all samples then filter on the Average Length column. I think that is either the read length or twice of it in paired runs. Also look at the other columns. For a single run it would be:

esearch -db sra -query SRR1613384 | efetch -format runinfo | cut -d ',' -f 1,7

you can expand the query to match more entries

esearch -db sra -query SRR1613* | efetch -format runinfo | cut -d ',' -f 1,7

it may be possible to search for this average length directly but it is not clear how

Edit: the code above requires Ncbi Releases Entrez Direct, The Entrez Utilities On The Unix Command Line

ADD COMMENT
3
Entering edit mode
9.5 years ago

The data in SRA are listed today in ftp://ftp-trace.ncbi.nih.gov/sra/reports/Datalist/NCBI_SRA_Datalist_20140926

Using the firefox console, I see a NCBI web service loading a few read for a given run using the following URL= "http://trace.ncbi.nlm.nih.gov/Traces/sra/?run_spot=(id)". The result is a json file

All in one , a quick'n dirty (I don't parse json) solution is:

 curl -s "ftp://ftp-trace.ncbi.nih.gov/sra/reports/Datalist/NCBI_SRA_Datalist_20140926" | cut -f2  | grep -v ERR | while read L; do curl -s  "http://trace.ncbi.nlm.nih.gov/Traces/sra/?run_spot=${L}" |  grep read_len | sed "s/^/${L}  /" ; done | uniq
DRR000001  read_len:[36,36],
DRR000002  read_len:[36,36],
DRR000003  read_len:[36],
DRR000004  read_len:[36],
DRR000005  read_len:[36],
DRR000006  read_len:[36],
DRR000007  read_len:[36],
DRR000008  read_len:[36],
DRR000009  read_len:[36],
DRR000010  read_len:[36],
DRR000011  read_len:[36],
DRR000012  read_len:[36],
DRR000013  read_len:[36],

(...)
ADD COMMENT

Login before adding your answer.

Traffic: 1719 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6