How To Retrieve Data From Jgi Automatically Given A Set Of Ids?
1
2
Entering edit mode
7.5 years ago

Hi, I need to retrieve genomes and metagenomes (assemblies or raw sequences) from JGI DBs. It is doable (not easy though) using HTML forms and following links. However, I need to repeat this process hundreds of times, and I would appreciate to not waste my time anymore.

JGI does provide users with an (very brief) API documentation and an API XML schema (XSD) (usually understood only by Pierre alias @yokofakun ;-) ) and I cannot even make the curl "signing on" command work. Do you know a way to process this task automatically (e.g. using R, python, or any GNU tool...) given some IDs (like project or sample ID)?

Thanks, Manu

r python api xml • 3.9k views
0
Entering edit mode

Hi glarue,

I have read the script "jgi-query.py" from https://github.com/glarue/jgi-query, but I don't understand it yet.

Best, Bing

0
Entering edit mode

Geez, sorry to have missed this for so long—my notification settings must not be set up correctly.

The answer to your question depends on what you mean by "metagenome", and the way in which JGI structures its databases, although I fear the answer may be "no". Basically, you have to provide a category to jgi-query, and all of the files organized under that category will be listed. If you are interested in multiple fungal genomes, for example, you can use the query fungi to retrieve a (huge) list of all available files, and then download individual files from within that set (probably using the regex option r at the prompt). If the species you are interested in are not in fungi, you will have to experiment to identify a sufficiently broad query that includes everything you're interested in.

jgi-query was originally designed for grabbing files on a per-species basis. It can download large file sets, however, but how well that will work depends on your specific needs. Hope that helps clarify things.

3
Entering edit mode
5.9 years ago
glarue ▴ 50

I know this is a late response, and it may not do exactly what you need, but feel free to check out a script I wrote to do something similar here: https://github.com/glarue/jgi-query

It's written in Python and runs from the command line. I haven't tested it on Mac or Windows, but it should (theoretically) work there as well as long as cURL and Python are installed.

Hope it helps, if you still need it!

EDIT: while jgi-query was designed primarily to download various files for a single organism, you can download very large datasets with it as well by using higher-level phylum names and range-formatted file selection syntax. For example, you can retrieve the entire fungal database with the command "jgi-query fungi", although selecting specific subsets of files can become onerous with large databases.