Thoughts On Galaxy For Next Generation Sequence Data Analysis
5
5
Entering edit mode
10.3 years ago
Travis ★ 2.8k

Hi,

We would like to be able to save workflows for use by non-bioinformatics users in-house. Whilst going through the laborious task of stringing together workflows into individual scripts I began to wonder if Galaxy might represent a less labor-intensive option...

Therefore I am trying to get a feel for who uses Galaxy for their NGS analyses (or who doesn't) and why.

How are upload/download speeds? I can only assume that uploading large datasets would be a nightmare and we plan to have some very large data sets. Data processing speeds would also be important - what kind of power does the public version offer?

Do people use the public instance or create local versions? I assume to have custom workflows immediately available it would be necessary to have a local instance. This would eliminate the uploads and may mean faster processing too.

I would like to get people's opinions before I dedicate real time to it.

next-gen sequencing galaxy • 9.5k views
8
Entering edit mode
10.3 years ago

Your question is very much in the spirit of the times, e.g. from Peter Cock's IMSB twitter feed:

Do you want to #usegalaxy too? RT@passDan Getting the feeling that Galaxy is the cool kid and everyone wants to be his friend. #bosc2011

Farhat's legitimate space considerations (which are also true to some extent for non-Galaxy based workflows), I would say that the answer to your query is a definitive "yes", Galaxy is a very serious contender for remote and local NGS analyses. I would even venture to say that the is was the implementation of NGS tools and Cloud images by Galaxy in 2010 that has led to the explosive growth in users of Galaxy over the last 12 months, as evidenced by posts to the Galaxy users mailing list:

I know a half or dozen wet-lab biologists who use Galaxy to do their own NGS analyses because: 1) they don't have to install code, grok UNIX or program; 2) they get free storage and compute; and 3) they can share their results with supervisors/collaborators, etc.

There is also a big push from bioinformaticians to use Galaxy. You can get a reals sense for this on the galaxy developers mailing list. We are currently rolling out a local Galaxy installing in our bioinformatics core facilty so we can provide NGS results to users via an interface they understand. The aim is to use Galaxy to cut-down on the time required to help explain results/protocols and allow users to perform their own follow-up analyses. We are just trialling this now and though local version on a desktop are easy to get going and customize for NGS work, we have not launched a production instance so I can't report on this yet.

Lastly, there is going to be really productive interaction in the future between Galaxy and Taverna, the two major players in the bioinformatics workflows market. I predict a synergistic co-evolution between Galaxy and Taverna, similar to what was observed between the UCSC and Ensembl Browsers, that will generate a lot of new functionality specifically in the area of NGS.

2
Entering edit mode

I forgot to add one major advantage of Galaxy, the automatic recording of all the steps in a workflow along with the parameters. It is quite easy to forget to record how a one-off analysis was performed if one is not used to doing that.

0
Entering edit mode

+1 for the answer but I wish I could do another +1 for quoting my tweet from ISMB!

0
Entering edit mode

in Galaxy will end up storing the uncompressed fastq file, sam file resulting from alignment, bam file and the sorted bam file. This can lead to heavy disk activity which can slow down the analysis unless you have fast and lots of storage. Another thing I noticed with Galaxy (though it may be my install) was that simple tasks like uploading a file would peg one core of the CPU at 100%. http://methoo.com

7
Entering edit mode
10.3 years ago

I would say that Galaxy is definitely a very good option for this! If you want to use their online version, it might be impossible for large datasets (they have a 1Gb limit on the upload size, so for many NGS, it's not an option).

But I have been using some of their Python scripts for my own pipelines and it is very useful! The whole Galaxy project is open source ans easily installable. I use it in combination with my own scripts and build my pipelines this way.

If you have a webserver, I would recommend you install your own version of Galaxy on it and then add your own scripts to it. You would then be able to build the complete pipeline and have it ready online for your users.

3
Entering edit mode

Yes, actually, any language should work as long as it's installed on your server. Galaxy is "just" a front-end to a lot of scripts, with the appropriate documentation in XML files.

Here are some details on how to incorporate a new script: http://wiki.g2.bx.psu.edu/Admin/Tools/Add%20Tool%20Tutorial

0
Entering edit mode

Can custom scripts be written in Perl?

0
Entering edit mode

The file size limit is imposed by what can uploaded to a Browser. You can also upload to Galaxy main (and a local instance after configuration) by FTP: http://wiki.g2.bx.psu.edu/Learn/Upload%20via%20FTP

6
Entering edit mode
10.3 years ago
Mnkyboy ▴ 60

I am a big fan of using Galaxy on the cloud for RNA-seq. I use it often to offload from our local servers and to test out data sets. Easy to use and very cost effective.

4
Entering edit mode
10.3 years ago
Farhat ★ 2.9k

Galaxy is a good option for workflows especially for nonprogrammers. I have a local install as well as occasionally use the main one. I do not use the main one for NGS as data transfer is a huge bottleneck. Also, occasionally you may have to wait before your analysis starts if they are busy. The local install is not very difficult to set up but one serious issue I faced with Galaxy is that it stores the results of every step in uncompressed format. Thus, e.g. a command like

bwa samse ~/genomes/hsap/hg19.fa sampleTF8.sai sampleTF8.de.fastq.gz |samtools view -bS -|samtools sort - sampleTF8


in Galaxy will end up storing the uncompressed fastq file, sam file resulting from alignment, bam file and the sorted bam file. This can lead to heavy disk activity which can slow down the analysis unless you have fast and lots of storage. Another thing I noticed with Galaxy (though it may be my install) was that simple tasks like uploading a file would peg one core of the CPU at 100%.