Forum:I am really pissed off by the bioinformatics software world. Do/can we have a better solution?
8
4
Entering edit mode
4.7 years ago
moxu ▴ 480

I have been a professional software developer since 1998 after I got my masters degree in computer science. Yet, when I try to do some simple next-gen analysis, there are so many software tools with different flavors which require different input file formats, different options, with different pros and cons, and most of which are poorly documented. It's really a pain in the ass and a shame for me not being able to just "hookup" (pipelining) such programs together.

My question is: do we have a better solution? Say, one stop shop for NGS analysis? If not, who is interested in building up such a shop? Maybe we can figure out a better way to do NGS for everyone.

software error sequencing next-gen Forum • 3.2k views
7
Entering edit mode

Did you look at Galaxy? This isn't as relevant to what you are directly asking but in a sense, it is a "one stop for NGS analysis" for people who don't program. I think the presence of different tools is due to the necessity for different purposes/projects you are working on (i.e. are you looking for SNP? are you doing genome assembly? etc). MSU has a nice NGS analysis workshop that you can look into if you are frustrated.

0
Entering edit mode

The problem of galaxy is that you cannot or it's very hard to pipeline with galaxy?

1
Entering edit mode

It's in fact very easy to do so and requires 0 computer knowledge. We have our bench scientists doing it.

0
Entering edit mode

Yeah? I mean to pipeline with python/perl/sh scripts.

0
Entering edit mode

That's certainly doable (the convenient python interface is called bioblend), but most people prefer to create a workflow and run it all within Galaxy.

4
Entering edit mode

Yet, when I try to do some simple next-gen analysis

Maybe, just maybe all this is not as simple as you think it is. Since when are in in bioinformatics? And if you are a pro and don't like it, go and develop software somewhere else, there are many fields that need software developers. I don't know, I just have the feeling you might be a software developer since 1998 but have - sorry - no idea about bioinformatics, could that be the case? And if so - believe me, all this is more complicated than it might seem and not because everybody in the bioinformatics software world is stupid.

3
Entering edit mode

How to really help... Don't make another 'better' solution. Use what already exists and improve it! Try really hard not to re-invent the wheel.

Yes. Sometimes a fundamentally different approach will require a new tool (alignment graph calling for example), but most of the proliferation in tools and especially formats is folks reinventing the wheel to implement rather marginal gains/improvements. We'd be far better off if that work went into common existing tools than creating new ones.

BTW: If a tool/project doesn't make improving it relatively easy, then it should die. There are perhaps some exceptions for particular commercial (human) analysis pipelines. However, for research, if you can't fix or improve the code and/or there aren't responsive maintainers, just don't use it. That narrows the choices down quite a bit.

2
Entering edit mode

There's diversity of software because there's diversity of requirements. That doesn't excuse poor documentation though but explains (some of) the proliferation. To me, Galaxy is as close to a one stop shop as it's going to get. Have you also read Scientific coding and software engineering: what's the difference and Why bad scientific code beats code following "best practices" ?

18
Entering edit mode
4.7 years ago
1
Entering edit mode

Yeah, that's another problem. e.g. UCSC & NCBI builds.

5
Entering edit mode

My question is: do we have a better solution? Say, one stop shop for NGS analysis? If not, who is interested in building up such a shop? Maybe we can figure out a better way to do NGS for everyone.

I think you missed the comic strip irony.

0
Entering edit mode

Yes, I don't think this is a bioinformatics problem, it's true of computer programming/IT in general.

8
Entering edit mode
4.7 years ago
Benn 8.1k

If you don't like searching for the right tool every time you have to solve a biological problem, maybe bioinformatics isn't just something for you.

One size fits all just doesn't work here...

6
Entering edit mode

I agree.

Bioinformaticians must have an in depth knowledge of the biology and biotechnology they are working with. Otherwise they are vulnerable to making mistakes and using the wrong tools. This knowledge can only be obtained after years of debugging, troubleshooting, and decision making.

0
Entering edit mode

... and publishing -_-;

1
Entering edit mode

that's a whole different can of worms.

1
Entering edit mode

In a lot of rising areas (e.g. Hi-C data processing), we do have to choose the right tools. In more established areas (e.g. variant calling and RNA-seq expression), having too many choices is a bad thing IMHO. Few users, in particular biologists, have the relevant experiences to make the right choice. When I move onto a new area and read reviews, I don't like those giving a page long list of available software. I much prefer the review tells me: use this tool; it is widely adopted, easy to use and has been shown to have good performance and accuracy. Similarly, one of my design rationales is to let users not make choices unless have to. A developer compared one of my tools to his and kindly asked me if I wanted to provide settings that would work better on his data set. I told him something like: "if the default does not work well, that is my fault; feel free to report it in your paper".

6
Entering edit mode
4.7 years ago

The closest one stop shop is Galaxy, which is wrapping many of the common tools. There are many formats and programs with pros and cons because there are many different goals and priorities. Documentation is generally poor because there's little if any incentive to write it (welcome to academia). I should note that at least the wide variety of formats and programs is not different from the commercial software world.

Anyway, in many cases there are common format (e.g., fastq, BAM and VCF) used across programs, so pipelining is mostly a matter of choosing the parameters you want to use.

0
Entering edit mode

Although FASTQ is fairly standard at this point (thanks Illumina, I guess), I would argue that BAM and VCF are not standardized. A significant portion of the tools that require BAM/VCF input will produce an error with a random BAM/VCF. You frequently have to go back and add additional parameters to a previous step or just modify the file manually to make it actually work. Even very common workflows do not work with default settings (for example, BWA+GATK or STAR+Cufflinks).

1
Entering edit mode

not agree : BAM and VCF are standards https://samtools.github.io/hts-specs/.

while fastq is not. http://maq.sourceforge.net/fastq.shtml (you said quality +33 ? header name ?)

But the tools are wrongly implemented.

1
Entering edit mode

Sure, there is a BAM standard, but it allows for optional tags. Different tools require different tags and may implement them in different ways. Technically, the BAMs may still be valid, but what good is a standard if the results are not really comparable?

4
Entering edit mode

I think sometimes people are a little harsh on these formats. Designing a format is intrinsically hard. You have to balance a lot of points. If you require too much, few would like to write the format; if you require too less, few would like to read the format. If you consider too few use cases in the format, the format is not generic enough; if you consider too many, the format becomes too complicated for people to comprehend or to use. In addition, some data models are just too difficult to represent. I have heard many complaining the VCF format, but after years of discussions in the GA4GH circle, we still have not reached a good data model to elegantly describe variants – it's just too hard. Honestly, I used to complain about VCF as well, but then I asked myself: "can I fix it?" I stopped.

1
Entering edit mode

ah ok, I see your point.

1
Entering edit mode

More on your specific problem: interop of BAM files. BAM encourages tools to only look at required fields. Samtools was implemented this way. Requiring optional tags is often a fault of the downstreaming tool IMO. Sometimes a tool really has to read a tag for it to work, but this is domain-specific. We can't design SAM to demand all domain-specific knowledges. In fact, quite a few concepts were not available when SAM was designed. For example, TopHat was not published by then. I am not sure there is a solution to the interop problem.

0
Entering edit mode

I agree there is not an easy solution (or any solution). I only meant to highlight the problem.

I guess one option is something like the BED format where there are clear variants, such as BED3, BED6, BED12, bedGraph, and BEDPE. Yes, the BED format has its own problems, but at least there is some attempt to keep it somewhat organized.

Even with the current implementation, it's very rare that a tool specifies exactly what is required. They usually only say BAM. Then you give it a BAM and you get an error that might just be a number and you have to scour the web to find other people who ran into the same problem only to find out that you had to pre-process your BAM with some obscure flag. This part is easily fixable by tool developers.

0
Entering edit mode

This (not always being able to daisy-chain things) is true of everything in biology, especially the wet-lab side.

5
Entering edit mode
4.7 years ago
John 13k

I agree with the answers above. Another project won't help. We already have Galaxy trying to unify all the different tools.

I think your frustration moushengxu - and it's a frustration many of us have felt at some point - probably has more to do with the lack of documentation, support and guidelines for using the tools. It is totally unacceptable when a tool's documentation says something like:

-p, --peterson : Use the Peterson method rather than the default Rutherford method. Requires --noflux to not be unset.

This can seem like a very precise bit of documentation to the person who wrote the program, but means literally nothing to anyone else. In a commercial situation where someone other than the software developer has to OK all the user-interface decisions, it would be removed or improved, but in Bioinformatics we can't afford big development teams most of the time. And it leads towards an environment where people don't question why such-an-such a program is always run with -p. Particularly if the person comes from a biological background. I think it comes back to the Law of Triviality somehow. There's a severe lack of 'why' in bioinformatic documentation, particularly the more complicated things get.

So in my humble opinion more 'streamlined' tools won't help much, because the complexity is real. The only way you'll improve usability is to either hide that complexity (bad idea), or start the enormous task of improving the education surrounding the problem. Better tutorials, video explanations, etc etc - those sorts of colourful documentations are what Biologists really want, but the knowledgable Bioinformaticians don't really need. And that's probably the root of the problem.

Only the people who don't need good tutorials can typically write them - and there's rarely any incentive to do so.

4
Entering edit mode

So in my humble opinion more 'streamlined' tools won't help much, because the complexity is real. The only way you'll improve usability is to either hide that complexity (bad idea), or start the enormous task of improving the education surrounding the problem.

Well, I agree mostly, yes, everything should be well documented (in general) but seriously....in the end it comes down to a really simple reason - money. Every bioinformatic tool should be free, there should be tons of documentation, video tutorials and best would be if these open source developers reply to emails within an hour or offer online support. And don't get me wrong, many people do care about their open source software, but it only goes that far. Nobody is paying for the super cool documentation. If you write a grant it is about the new cool piece of software that you're coding, but not about the documentation.

1
Entering edit mode

Documentation and reproducible research - the most under-appreciated items in our research checklist.

2
Entering edit mode

any good guidelines or write-ups on how to write good documentation?

1
Entering edit mode

As a short intro to think about maybe this: Creating great documentation for bioinformatics software

0
Entering edit mode

I think this is going to be the last time I talk bio-info-politics, but that paper on how documentation in bioinformatics should be is really everything wrong with bioinformatics documentation :P I mean, it's almost funny. I really respect that the authors went there and tried to improve what is a bad situation, but I don't think they say anything other than "this is how the big names in bioinformatics do it, so I guess we should do it like this too."

For example, they say that you should look to the MEME suite for good documentation, because they employ a hierarchical layout with programs with similar functions clustered together, but sorted on the most important software at the top. Hehehe, sorry I just think that is so funny. Pedagogy is a full blown field of scientific research - the science of learning. The idea that if we just arrange the information with the most important stuff at the top, then we're teaching it as effectively as possible, is a very computer-science outlook of how to program humans. Not to mention completely wrong. I'd say the MEME suite was terribly un-userfriendly, given that all the tools in the suite had their names drawn from a Scrabble bag.

But whatever. I don't see John Longinotto writing any papers about how to write documentation, so maybe I should just keep my mouth shut.

1
Entering edit mode

I just quickly read this paper and have to say, it reminds me more of a blog post than of a paper. Or is that too harsh?

1
Entering edit mode

On the other hand, they also higlight bedtools, which is probably the most user-friendly bioinformatic CLI tool. I think it's the only one where I always know exactly what will happen when I run a command for the first time.

3
Entering edit mode
4.7 years ago
Satyajeet Khare ★ 1.6k

Yes, there is a need for such one stop shop. Our biologist colleagues would also agree. But the problem is not lack of willingness or ideas or programming skills. The problem is the vast biology field. Even our colleagues in labs don't have a single standard protocol to prepare samples that we analyze on our computers. Of course there is no harm in giving it a try. But even if such one stop shop is created it will ask so many questions to our biologist friend that s/he will end up hiring someone to operate it! Which is already the case.

2
Entering edit mode
4.7 years ago

I for one am with the OP.

We have many workarounds and band-aid type of solutions - none addresses the real problem - bioinformatics software is not built to the standards that we have come to expect from most open source software.

The reasons for this situation are often expressed in many ways yet all boil down to the same root cause - in life sciences resources are primarily spent towards "scientific discovery" and software is a second if not third/fourth class citizen of the discovery. The net effect of this misguided policy is a waste of epic proportions where people's time is wasted at unimaginable rates. It is sort of an "invisible" waste - no one sees when I and tens of thousands like me lose one, two or ten hours on trying to track down some stupid little corner case and can be easily blamed on the user. Why did you not know that what flag -XUSJKKDSADASd*&$@!&&^$# did? Time is the highest cost yet it is the easiest to ignore.

Up until the point one can get reasonable, reliable and quick funding for bioinformatics software these problems will never get solved. The current state of affairs where on submits a grant that takes nine months to be reviewed and even then may or many not support a piece of software is fundamentally inappropriate for software maintenance.

So to summarize - yes this is a valid problem. It is a major problem and the way we deal with it is that we call the skill of dealing with all that nonsense that we shouldn't have to deal with as "bioinformatics".

Now whereas the sorry state of bioinformatics is bad for society and science there is actually a positive side at the individual level. It is so complicated and senseless that as a job it does actually pay well since few people can deal with this complexity - so once you get good at it there is not that much competition.

1
Entering edit mode

I agree. Just like I said earlier 'You don't get funding for documentation' - you are right, it is also very hard to get funding for software maintenance. And that is a problem, grants support the shiny things - often not the boring but necessary parts of the whole process. Many people support their open source project as good as they can, but at a certain point you need to get paid for it to really move forward. (see above. video tutorials, guides etc etc. at a certain point people can NOT do that no more in their leisure time).

Therefore, I think we will have to deal with that situation for a while. I don't see that change anytime soon.

1
Entering edit mode
4.7 years ago
Biogeek ▴ 400

You could always use CLC-Bio if you want a one stop workshop. The amount of software out there is great, but I agree on the poor documentation. I literally come here, or email software developers for more info. If only documents were written up nicely. When I download software to my directory and see that I don't need to compile it, I also smile.

0
Entering edit mode
4.7 years ago
moxu ▴ 480

The solution is there or almost there, it's just not publicly available.

Example 1: several of you mentioned Galaxy. If it can be done with a website, it can be done at command line.

Example 2: there are companies like seven bridges who offer online NGS analysis for a fee. Same logic as for Galaxy above.

And think about some use case scenarios:

Scenario 1: you have a fastq RNAseq file, and you simply want to map to the genome and get the expression level for each of the genes, right? Then you have the freedom to choose with program to choose.

Scenario 2: you have a 23andme genomic profile, you simply want to impute the ungenotyped SNPs, and why should you worry about converting the format to VCF, bgzip it, google for the reference files, etc? Again, you only need to worry about which imputation method is better suited to your own biological questions -- and you only need to choose one after you read the docs.

May I propose we setup such an effort, first to draft up the usecases/scenarios, and then implement them with java/python/perl?

I think it's very doable.

1
Entering edit mode

I think saying "just use Galaxy" or some other consolidating tool may be missing the point.

The problem is often that tools do not actually do what we think they should, or what they claim they do. Often few people (if any) know exactly what actually happens in there. When used via Galaxy or other interface we may "think" that the person that created that interface knows but they don't. If anything these aggregators sweep the complexity under the rug.

As an example use two different tools that count reads that overlap a transcript. It is the fundamental first step of every RNA-Seq analysis.

• First observation: the counts will NEVER match when using different counters. Most often the counts are similar - but in some cases could be wildly different! Yet both tools are published in the most "selective" journals.
• Second observation I was never able to get the the same results if I tried to re-implement myself what I believed that the tool did. Mine comes up with a third and different answer - to me that means that the specification of what the tool does is incomplete.

I might have the "freedom" to use any tool that I want - but it is not freedom I am after, I want a correct answer, or at least know why tools A computes a different value than tool B.

0
Entering edit mode

Scenario 1: you want a workflow manager like galaxy (web), or snakemake, gnu-make... (cmd-line)

Scenario 2: you can wrap everything into a script shell (=Scenario 1) or you want to build an amalgam (like excel) . For the later this is not the way linux works were all simple tasks are independant.