2
2
Entering edit mode
8.6 years ago
bioinfo ▴ 830

parsing • 14k views
4
Entering edit mode
8.6 years ago
5heikki 10k

I did something like this a while back:

cat keywordTableSortedUniqueIds.txt
mgm4440036.3
mgm4440037.3
mgm4440038.3
mgm4440039.3
mgm4440040.3
mgm4440041.3
mgm4440055.3
mgm4440056.3
..

do
curl http://api.metagenomics.anl.gov/1/download/"$line"?file=425.1 >$line.gz
done

The "file=XXX" part specifies what exactly you want to download from the given metagenome, e.g. 425.1 here specifies predicted rRNA.

0
Entering edit mode

that was very helpful. I have just gone through the MG-RAST manual but didn't get much info about the "download stages or file=xxx/stage=xxx". As you mentioned, file=425.1 for predicted rRNA, Do you know what file no. should I use for raw original submitted metagenome fasta sequences? I tried "file=100.2" but not sure if it is right..!!

0
Entering edit mode

Hey, I'm not sure you can gain access to the raw data by the api, however, I think file=100.2 contains the reads/contigs that passed quality filtering. There's probably also a file that contains the reads/contigs that didn't pass QC, so you could combine those if you really wanted them. You could always ask at the mg-rast mailing list..

0
Entering edit mode

Thanks. Now I have decided to go for reads that passed QC filtering and dereplication stages..!!

0
Entering edit mode

Hi, Thanks for these details about how to download data from the MG-RAST api. Did you add your webkey to access data that is not public yet? Or were these public metagenomes? I tried adding my webkey:

but I just get a summary of the file info (bp_count etc), and I'm unable to download the fasta file.

Thank you!
Katrine

0
Entering edit mode

0
Entering edit mode
File 050.2 - This is the unfiltered metagenome that was originally uploaded to MG-RAST
File 100.1 - preprocess.passed.fna
File 100.2 - preprocess.removed (low quality)
File 350.2 & 350.3 - These are the protein coding genes (amino acids and nucleotides)
File 440.1 - These are predicted rRNA sequences (I do not recommend using MG-RAST for sensitive rRNA annotation. It does not use the internal structure of the gene, which other programs appropriately use for classification)
File 550.1 - This file shows clustered sequences which are 90% identical, to reduce the number of sequences that need to be annotated. Many folks don’t even know that this happens within MG-RAST.
File 650.1 & 650.2 - These files are essentially the blat tabular output from comparing your sequence to the database.

0
Entering edit mode

Hello, what language is this besides curl? I have a windows computer and am using curl through the command prompt, and would like to do something similar to what you described.

1
Entering edit mode

It's Bash. I presume you could get Bash working on Windows with Cygwin or something but I haven't used Windows since XP so I can't really say. Alternatively, it probably doesn't take much effort to make a simple while read line loop with something that works on Windows by default like Python (?) or Java (?). If you plan to do lots of bioinformatics in the future, I suggest you ditch Windows for Linux or OS X.

0
Entering edit mode

Update- I downloaded Git Bash for windows, and I think I am having success using bash and curl with the command you listed. I am not very experienced with the Bash language yet, so I haven't added in any echo tests to see if the script is working the way I think it is. I do agree with you that windows is a hassle, but many times there are work-arounds. I will continue down this line for the people who use Windows and can not afford to/do not want to switch operating systems.

0
Entering edit mode
4.2 years ago

0
Entering edit mode