17 months ago

Dear all,

What are good bioinformatics servers where user data can be uploaded and then analyzed using pre-existing pipelines or user-defined custom pipelines or both?

I was alerted to Galaxy Server. Is this free to use with my academic undergraduate researcher based in the US? And are there limits? How do I register? Please share any relevant links (I am overwhelmed by galaxy related info online)

One graduate alumnus also pointed out XSEDE allocation that can also be used on iPlant / CyVerse. But he said this is old info and is not sure how easy it would be for me with "minimal ssh setup and experience". An that I would need to submit a proposal etc. Your thoughts?

Some context: I am looking for a resource with high RAM because I need to perform some genome assemblies (>100GB memory) and also genome-genome alignment with something like Cactus (also early > 100GB memory). I do not need > 500GB or 1TB of storage because I can back up input and output locally, with 2-3X redundancy.

I am surprised that as an undergraduate you are in charge of organizing the computational infrastructure for your research. No offense, but this is not something an undergraduate can or should do but something that an experienced supervisor must provide who knows what the planned analysis require in terms of storage, CPU and memory. I suggest you talk to your supervisor and ask for help. You most likely will need local guidance because as an undergraduate you simply lack the experience (which is totally normal, don't get me wrong) to be on your own for a research project. Big data do not make the job any easier as everything simply takes longer, even trivial things such as upload and download between the remote instance and local backup. I say this as someone who had to learn all bioinformatics required for my current job on my own...it takes time, lots of time, much frustration, many mistakes will be made.

=> Get local guidance if you want to be productive rather than reinventing every wheel in the world, especially as an undergraduate. Otherwise it can well be that weeks and months pass, and despite you solved a lot of computational problems the actual science came up short.

Yes, it's more fun to perform the analyses and interpret the results, versus spend time in just setting up the resources! I understand your advice, thanks for that, and I also agree. Which is why I am on the forum asking for advice :)

A local supervisor who is familiar with what you do is 100 times more valuable than strangers on the internet :)

I was aiming for consensus advice which is less likely to be wrong than 1 person's advice. But I see your point about a local sup's familiarity with my skillset, abilities, and goals.

Often local institutions have hardware available that one is not even aware of. I just recently learned that my university has a decent cloud computing environment which simply did never reach my attention. Check that with the local IT department, maybe there is some server open to you that you do not even know about.

If you are dealing with human data then you should not use any public resources without approval from your institutional IT/research czar's. Unauthorized transfer of human data to public resources is simply asking for trouble.

It's not human data I work with...

Besides CyVerse there are probably no other servers that are going to be free. Sounds like you need to find a collaborator who would have the necessary expertise and access to internal infrastructure. Paying for cloud resources is other option.

• Unfortunately, paying for cloud services is not option since that is not budgeted in
• By "internal infrastructure" do you mean knowledge of how to upload, access, download, setup jobs, execute, monitor etc.?
• And are there reasons why UseGalaxy.org is not a viable option for me?
Cuz of RAM requirements and/or my need for software tools like CACTUS not yet available there?
You can use Galaxy within the limits stipulated by the resources described on this page. You are not going to get SSH access to galaxy servers. This is meant to be a GUI based tool.

Super useful link, thanks a TON!

When I came to this dilemma, I ended up just buying a 16-core rack server from, I think amazon. I also bought like 128GB of RAM for the server from eBay, and just installed Ubuntu to do all my analysis. I bargained for the RAM on eBay, and got a decent deal.

Since you are undergrad, there may be a bioinformatics course at the graduate or undergraduate level you could take, to learn some of the skills. I'm pretty sure most universities allow undergrads to take grad courses. You could probably even use your university cluster if they have one through the course. There are plenty of tutorials online for analysis pipelines that you could go through.

The above would definitely be an investment on time and energy, but could be worth it?

Could you please shareAmazon links to your purchases? It's something my research group may be interested in (or not, I'm not sure...) Also, my univ's HPCC caps RAM at 100GB for free users - so that's my limitation for now

See, this is what I mean. You have a local HPC but consider buying stuff on your own or using external solutions? Are the costs for the runs on the HPC so aberrantly high? You could set up scripts and workflows on a local PC using a small but representative dataset and then simply pay for the computation time on the HPC. Otherwise you will have to work yourself into using Galaxy (or whatever you use) or spend time installing and setting up a local workstation. I would go for the HPC if given the opportunity. Everything else takes time not spent on science.

ATpoint has a good point. Learning to work with HPC clusters will be a good skill to have as well esp. when/if you search for jobs in the field : )

I, personally, didn't have access to my university cluster at the time, lol, so I just purchased a server myself. Definitely granted me flexibility and freedom.

Not for an experimental bench scientist. You should leave system administration task to qualified system admins who's day job fits that description. In this day and age, one data breach (god forbid if it is human/HIPAA data) can lead to a substantial financial hit (fines etc) to your institution. They are not going to look kindly on you if you were responsible for it. Always follow institutional policy related to IT and research data.

HPC is free to use. But not when my RAM requirements > 100GB... should have made that clear in my original post, sorry

HP Proliant DL360p G8 8 Bays 2.5 Server - 2X Intel Xeon E5-2680 2.7GHz 8 Core...

HP 731761-B21 8GB 1Rx4 PC3-14900R 1x8GB 731657-081 Server Memory (bought 16 of these and made an offer for a bargained price)

https://www.ebay.com/itm/373149113930

good to know, thanks! - should be plan B or C. Prefer to keep # of moving parts to a minimum, but storing away this info for future use, just in case :)

17 months ago

In the US you can use usegalaxy.org, the main European version is at usegalaxy.eu but there are also a few national instances. Check the learning material, maybe start with the Galaxy 101 tutorial. There's also a guided tour of the interface. Public instances of Galaxy are free to use, you just need to register an account on the server you want to use.
Then there are also some academic clouds that can be free to use though I don't know about the US. You can find European resources by browsing the European Open Science Cloud marketplace catalog.

Thanks for your suggestions and the links. I will check them out shortly.

Are the EU servers IP restricted to EU users? Or can I access / use them from the US?

I cannot imagine any pros or cons to using EU vs. US local servers, but I am not yet a Galaxy user, so I don't want to assume anything.

I am not aware of IP-based restrictions but there are definitely restrictions on the amount of resources one can get and these vary from one service to the next. However, sometimes it is also possible to ask for more resources than the default.
Reading the discussion above, I definitely agree with ATpoint that you shouldn't be alone on this and a well thought out research project should have planned for the necessary resources. I second the suggestion of going with your local HPC. This is almost always going to be cheaper in the long run than many alternatives, including building your own server if you include the time spent on building and maintaining it. Familiarize yourself with the environment and set up pilot runs on the free tier and upgrade the resources when you're ready to go but don't expect a smooth run even at this stage as scaling up is not always easy.

17 months ago
Dunois ★ 2.2k

I propose something slightly different: there are a lot of bioinformatics groups out there, and even more departments, cloud service providers, etc., with a couple of servers lying around. If you ask nicely, you might be able to get some time on one of these machines; or a price discount of some sort if you're asking a business. For some outfit with a couple of thousand CPU cores and hundreds of terabytes of RAM, sparing a small fraction as an act of goodwill shouldn't be too much of a concern.

If you're going with the BYOC route that Pratik Mehta suggested, look for deals for individual parts on eBay also: plenty of ex-server room stuff shows up there regularly. ECC DDR3 RAM (maybe DDR4 also) can be had on the cheap if you contact sellers operating out of mainland China or Hong Kong. You might actually end up with a cheaper build getting the mainboard, CPU(s), and RAM separately (and throwing them into a tower case with a good PSU) as opposed to buying a pre-built server, for instance. With a server, you also have a bit of a disadvantage in that if the PSU or the mainboard goes, getting spares might be a pain. Plus you don't need that kind of noise in your apartment.

thanks for your insights and inputs

