News: Google announces DeepVariant
7
gravatar for Hussain Ather
12 months ago by
Hussain Ather890
National Institutes of Health, Bethesda, MD
Hussain Ather890 wrote:

Google announced the release of DeepVariant, a deep learning tool for constructing true genome sequences with greater accuracy than classical methods. It only works on somatic calls, but very interesting to see the uses of image recognition in genome reconstruction.

DeepVariant is the first of what we hope will be many contributions that leverage Google's computing infrastructure and ML expertise to both better understand the genome and to provide deep learning-based genomics tools to the community.

deep learning news google genome • 2.0k views
ADD COMMENTlink modified 12 months ago by Chris Miller20k • written 12 months ago by Hussain Ather890
5

One (more) step towards "Ok Google .. analyze this dataset, predict the downstream consequences".

ADD REPLYlink modified 12 months ago • written 12 months ago by genomax59k
1

haha, sounds familiar but I was expecting this from Google, finally its out and to be honest seems pretty impressive with Open Source availability as well.

ADD REPLYlink written 12 months ago by vchris_ngs4.6k
1

We implemented the DeepVariant pipeline with Docker and Nextflow here

Lifebit integrate it the pipeline with example parameters

Would love your feedback on this.

Thanks!

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by alaincoletta110
13
gravatar for Chris Miller
12 months ago by
Chris Miller20k
Washington University in St. Louis, MO
Chris Miller20k wrote:

Some thoughts:

1) DeepVariant does not work on somatic calls - only germline.

2) Yes, it beat GATK, but only barely (don't quote me on the numbers, but it was something like 98% vs 98.5%)

3) The method is insane, in that they actually create millions of images, encoding read information as colors and alpha, and then use their image-processing neural network to do pattern recognition for calling.

4) it is quite computationally expensive for running, not even to mention training the NN.

5) It absolutely requires new training data for each platform that you're going to run it on. Chemistry changed slightly? Got a new type of instrument? Doing targeted regions instead of WGS? You'll need a new gold standard run and you'll need to retrain the algorithm from scratch. They used the Genome in a Bottle dataset. That's limited to ~80% of the genome, and their TPs are only calls validated on at least two sequencing technologies.

Don't get me wrong - it's cool to see someone enter the space with a really crazy orthogonal method, but it's not a panacea, and the hype about AI solving all of our variant calling problems is pretty clearly overblown. That doesn't mean that this won't be useful in the future, just that it's not there yet.

ADD COMMENTlink modified 12 months ago • written 12 months ago by Chris Miller20k
1

Thanks Chris for such concise & informative review. I have a question/comment about your point 5 about the need to retrain the model every time something changes (bear with me: I haven't read the DeepVariant method in any detail).

First, I wonder to what extent it is that necessary to retrain the parameters even for small changes in the library preparations. Presumably (big if), small changes in, say, chemistry should still give good results.

But most importantly, I don't think DeepVariant is conceptually different from other methods when it comes to using training and test data. I mean, DeepVariant makes the need of training data explicit. But implicitly other methods also need training data that in theory should be re-analysed every time something changes. For example, when we ("we" meaning us or the program we use) decide to filter out variants supported by less than 3 reads, effectively we are saying "given the training data I've seen until now, 3 is a good threshold".

ADD REPLYlink written 12 months ago by dariober9.7k
1

My assumption is that whether running GATK with data produced on a HiSeq 2500, a NovaSeq patterned flow-cell, or an amplicon-based technology, you'll get reasonable results. This is thanks to lots of effort that went into making their model (and it's heuristics) general. (In essence, yes, using all the training data we've seen up until now).

The NN picks out artifact patterns automatically, which is impressive, but that makes it very susceptible to changes. Given a large and diverse training corpus, there's no reason why it can't learn general patterns too! My point is that these large, highly validated training sets don't exist, so if you hop to a new (or older) technology, you can't expect DeepVariant to just work. (again, for now)

It's also a contrast to current callers, where you can often look at your new type of data, see "oh, it looks like I'm overcalling at homopolymer runs", and then tweak some parameters to fix the problem. NN is a total black box and has to be retrained from scratch.

So yeah, I absolutely think that NN-based variant calling will be useful (and probably better!) in the future. I'm just trying to inject some reality into the proceedings here. :)

ADD REPLYlink written 12 months ago by Chris Miller20k

Chris, from my experience, SAMtools / BCFtools mpileup achieves higher sensitivity / specificity than GATK when compared to the gold standard in clinical genetics, Sanger. I imagine that it also beats Deep Variant, in this regard. Variant calling need not be so complex / convoluted.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Kevin Blighe33k

While essentially saying that germline calling is a solved problem is probably a little bit of a stretch, it's absolutely not true for somatic calling. Tumor ploidy and purity come into play, FFPE may be involved, or you might be looking for very low-frequency events, etc. There's a lot of complexity there, and a lot of places where a NN might offer substantial improvements if designed correctly.

ADD REPLYlink written 9 weeks ago by Chris Miller20k

Yes, I should have stated that mpileup beats everything else (from my experience) where germline variants are concerned. Never benchmarked it for somatic. You are correct: a lot of extra factors go into somatic variant calling.

ADD REPLYlink written 9 weeks ago by Kevin Blighe33k
1

Just to clarify, DeepVariant does not use images. Their first implementation was based on inception and it used images.

But now deepvariant doesn't use images, but rather tensor representations of genome data.

ADD REPLYlink modified 7 months ago • written 7 months ago by danutempyrium10

Just want to be clear - the DV team deserves kudos. Variant calling is a hard problem, their method is interesting, and their performance is admirable. If I'm negative about anything, it's the breathless "Google AI has solved genomics!" press coverage, which you can't blame the authors for!

ADD REPLYlink written 12 months ago by Chris Miller20k
1

I totally agree how the hoopla over "Google AI solved genomics!" is on. At the end of the day it is a product they are bringing and pretty sure the buzz will be more than what it actually preaches. Having said that, I will feel it is worth taking a look at it as to how germline calls are made and improved but to what extent it can be useful will be a matter of time. For somatic calls am sure they will bring up something soon. I still need to get an understanding of the algorithm though as to how they implemented. But am happy that this kind of work also pushes one step ahead of making genomics as a research service product, and I support that.

ADD REPLYlink written 12 months ago by vchris_ngs4.6k
5
gravatar for mdepristo
12 months ago by
mdepristo60
mdepristo60 wrote:

Hi all,

Glad to see a post here on Biostars about DeepVariant's open source release. If you'd like more information on accuracy and runtime of DeepVariant across a variety of datasets, have a look at the blog post from DNANexus about DeepVariant on their internal benchmark datasets.

ADD COMMENTlink written 12 months ago by mdepristo60

Did you not develop it? Should probably state a disclaimer.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Kevin Blighe33k
1

Mark isn't trying to hide that fact - consider your post the disclaimer!

ADD REPLYlink written 9 weeks ago by Chris Miller20k
3
gravatar for Chris Miller
12 months ago by
Chris Miller20k
Washington University in St. Louis, MO
Chris Miller20k wrote:

Steven Salzberg has a nice response to the hype:

No, Google's AI Program Can't Build Your Genome Sequence

https://www.forbes.com/sites/stevensalzberg/2017/12/11/no-googles-new-ai-cant-build-your-genome-sequence/#5e35eefb5774

ADD COMMENTlink written 12 months ago by Chris Miller20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1768 users visited in the last hour