Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics

Question

Forum:Unbelievably rapid NGS analysis by Churchill

1

Entering edit mode

9.2 years ago

Ram 43k

A friend of mine shared this paper on new tech that promises reads to VCF in ~2 hours thanks to massive MASSIVE parallelization. I find it a little bit hard to believe.

What does the community here think of it?

Paper: http://genomebiology.com/2015/16/1/6/abstract (provisional PDF available)

Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics

Abstract (provisional)

While advances in genome sequencing technology make population-scale genomics a possibility, current approaches for analysis of this data rely upon parallelization strategies that have limited scalability, complex implementation and lack reproducibility. Churchill, a balanced regional parallelization strategy, overcomes these challenges, fully automating the multiple steps required to go from raw sequencing reads to variant discovery. Through implementation of novel deterministic parallelization techniques, Churchill allows computationally efficient analysis of a high-depth whole genome sample in less than two hours. The method is highly scalable, enabling full analysis of the 1000 Genomes raw sequence dataset in a week using cloud resources. http://churchill.nchri.org/.

sequencing NGS churchill • 5.3k views

ADD COMMENT • link 2.0 years ago by Ram 43k

1

Entering edit mode

9.2 years ago

Brian Bushnell 20k

Sounds plausible to me; the problem is mostly perfectly parallelizable.

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Brian Bushnell 20k

0

Entering edit mode

I'm gonna wait and watch this one. Looks really interesting, but I'm still a bit skeptical on the time claims.

ADD REPLY • link 2.0 years ago by Ram 43k

0

Entering edit mode

Well, most claims of huge performance increases are bogus or involve unacceptable tradeoffs, so I am skeptical of them as well, which includes this case. But I think 2 hours from raw data to variant calls would be quite easy to achieve with good programmers using distributed computing, and reimplementing most tools from scratch to avoid current very slow programs like GATK/Picard.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Exactly. You'll have to rework from scratch, and unless I see it with my own eyes, I cannot believe this claim. Like you said, it might involve unacceptable trade-offs.

ADD REPLY • link 9.2 years ago by Ram 43k

0

Entering edit mode

This pipeline still uses bwa for mapping and gatk/freebayes for variant calling.

ADD REPLY • link 9.2 years ago by lh3 33k

0

Entering edit mode

I wonder how they gain so much time if GATK is involved.

ADD REPLY • link 9.2 years ago by Ram 43k

1

Entering edit mode

The mentions of "balanced regional parallelization" is probably the key here. I'm reminded of a post by Brad Chapman talking about parallelizing things by subsections of chromosomes/contigs bounded by things like repetitive elements or any low-mappability region. That general concept combined with good algorithm design could probably lead to the speed improvements they're claiming.

ADD REPLY • link 9.2 years ago by Devon Ryan 104k

0

Entering edit mode

I want to believe, but I'll hold off on judgement until I see it working on my data.

ADD REPLY • link 9.2 years ago by Ram 43k

0

Entering edit mode

Well that also means that the speedup is species-dependent. While showing great timing with human genome, there could be no speedup for plant genomes.

ADD REPLY • link 9.2 years ago by mikhail.shugay 3.5k

3

Entering edit mode

The approach isn't species specific. The general idea is to find regions where we can reliably break analysis steps into chunks that are smaller than whole chromosomes and as balanced as possible. We do this in bcbio and it has been used across a wide variety of species without issues.

Another tool that uses a similar parallelization approach is SpeedSeq:

https://github.com/cc2qe/speedseq#speedseq

ADD REPLY • link 9.2 years ago by Brad Chapman 9.7k

0

Entering edit mode

9.2 years ago

alartin • 0

Brad leads a great project of bcbio-nextgen. I suggest it could create a standard benchmark using human genome and make performance comparisons more transparent and reliable. Hope nobody just uses some skeptical numbers to attract our eyeballs.

ADD COMMENT • link 9.2 years ago by alartin • 0

0

Entering edit mode

The authors did compare to bcbio-nextgen. But they did not optimize for bwa parallelization, so let's see how they respond to Brad, now that he's trying to get them to optimize bcbio for the benchmarking.

ADD REPLY • link 9.2 years ago by Ram 43k

Ram · Accepted Answer · 2015-02-02

15

Entering edit mode

9.2 years ago

Istvan Albert 100k

OK I've downloaded it and took a look here is my 1 minute summary: a bioinformatics tool paper that seems to lack basic software engineering principles. It seems to provide a black-box binary library that was purposefully obfuscated

No documentation of any kind provided on the website. You have to agree to the licencing agreement download and look at a PDF to realize that it requires a large variety of software that is considered obsolete. java 1.7, samtools.0.1.19 and so on.
The 'software' consist of a naive python script of frankly very low quality. It does does not even make use of built in parsing modules and most of the program consists of endless cascading if/elif constructs to parse a config file.
Then there is a dubious shared library (.so) that seems to originally been written in python, then compiled with cython into a .so library (I will make a note that this is the recommended practice when trying to maximally obfuscate a python)

There is an inherent conflict in this paper one is that of presenting it as open research paper - the other is a desire to hide, obfuscate it as much as possible.

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Istvan Albert 100k

0

Entering edit mode

It appears that this is not a requirement in GB:

Genome Biology recommends, but does not require, that the source code of the software should be made available under a suitable open-source license that will entitle other researchers to further develop and extend the software if they wish to do so.

Anyways I agree that unfortunately this paper provides almost no detailed description of the algorithm and there is a huge COI related to Genomenext.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by mikhail.shugay 3.5k

3

Entering edit mode

I for one don't see anything wrong with commercialization. That is perfectly fine to pursue.

It is the willful obfuscation that, in my opinion, runs counter to the spirit of science and should render a paper inappropriate to be published as research.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Istvan Albert 100k

0

Entering edit mode

I completely agree, unfortunately having commercialization in mind frequently leads to being overly cautious with source code, algorithms etc. I have some examples from my personal practice, when I had to convince people that releasing software as open-source is ok to show proof-of-principle. One can always try to make a commercial "out of the box" alternative that is highly optimized and easy to use. I think that the dilemma of how to support the superiority of some method while leaving a room for commercially viable software is quite common.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by mikhail.shugay 3.5k

1

Entering edit mode

Sorry for reviving this nine-month-old thread, but I have been working with Churchill for a while now and have yet to run it to completion with sample data both on a single computer and in a cluster environment. I've tried a lot of different things but nothing has seemed to work. There is little discussion on the successful usage of Churchill across the internet and the development team at the Research Institute at Nationwide Children's Hospital in Ohio have yet to respond to my email. At this point, I would really like to attempt to reconstruct Churchill perhaps in a different language.

So basically what I'm asking: is there anyone that would be so kind as to help me white-box this project? I imagine it will involve examining what commands are executed, in what order and what the output is for all of the steps. Of course examining the output is difficult in my situation considering I cannot get Churchill to run to completion. I am most curious to about how this step is accomplished:

< image not found >

Thanks

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 8.4 years ago by a.aiezza ▴ 30

2

Entering edit mode

As I mentioned above in my mini review I don't believe that this tool works at all. It is an example of what is wrong with bioinformatics - complete nonsense can be published and claimed to work like magic - move on try something else.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 8.4 years ago by Istvan Albert 100k

0

Entering edit mode

@Istvan I'm slowly realizing that... Thank you for your input. This product is absolute garbage. The team that created it should be ashamed to have their name on something of such poor quality. I'm going to attempt to switch to bcbio-nextgen. It seems to have incredible documentation and a much larger community supporting it.

Cheers

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 8.4 years ago by a.aiezza ▴ 30

Ram · Accepted Answer · 2015-02-02

7

Entering edit mode

9.2 years ago

Brad Chapman 9.7k

This is absolutely possible, and as Devon mentioned similar to approaches we use in bcbio. In bcbio we haven't yet worked on single sample optimization. Practically we don't have users with a large number of idle cores and the use case of pushing through a sample as fast as possible. This may change with future work.

What we have worked on is optimizing for lots of concurrent samples and can run 75 WGS samples in 100 hours on 560 cores, for an effective time of 1 1/2 hours per sample. This presentation from last year has full timing details on this from slides 29 to 33:

The Churchill paper has a full comparison to bcbio, although there was unfortunately a mistake in setting the bcbio configuration so it only used 16 cores for bwa. I'm in discussion with the authors to see if we can fix that.

In summary, this is definitely doable with existing tools and there is no reason to be especially skeptical. We're excited to see other folks working on these problems and looking forward to collaborating with the Churchill developers in the future.

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Brad Chapman 9.7k

0

Entering edit mode

I did see the comparison to bcbio and I was wondering how existing established platforms could overlook something so game-changing (at least in terms of time taken).

I wonder what the monetary difference will be, though - given the dramatic increase in the number of cores + a major decrease in time per core.

ADD REPLY • link 2.0 years ago by Ram 43k

0

Entering edit mode

I notice that they've filed a patent for the parallelization strategy. Do you have any idea how close their strategy is to what you've previously implemented in bcbio-nextgen?

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Devon Ryan 104k

0

Entering edit mode

I wish I had patented "The concept of parallelizing an existing process that could be trivially parallelized." I'm just not evil enough no matter how hard I try; these things don't occur to me until I hear about the latest Supreme Court decisions.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Brian Bushnell 20k