News: Google announces DeepVariant
7
gravatar for Hussain Ather
6 days ago by
Hussain Ather500
National Institutes of Health, Bethesda, MD
Hussain Ather500 wrote:

Google announced the release of DeepVariant, a deep learning tool for constructing true genome sequences with greater accuracy than classical methods. It only works on somatic calls, but very interesting to see the uses of image recognition in genome reconstruction.

DeepVariant is the first of what we hope will be many contributions that leverage Google's computing infrastructure and ML expertise to both better understand the genome and to provide deep learning-based genomics tools to the community.

ADD COMMENTlink modified 1 hour ago by Chris Miller19k • written 6 days ago by Hussain Ather500
4

One (more) step towards "Ok Google .. analyze this dataset, predict the downstream consequences".

ADD REPLYlink modified 6 days ago • written 6 days ago by genomax39k
1

haha, sounds familiar but I was expecting this from Google, finally its out and to be honest seems pretty impressive with Open Source availability as well.

ADD REPLYlink written 6 days ago by vchris_ngs4.2k
9
gravatar for Chris Miller
6 days ago by
Chris Miller19k
Washington University in St. Louis, MO
Chris Miller19k wrote:

Some thoughts:

1) DeepVariant does not work on somatic calls - only germline.

2) Yes, it beat GATK, but only barely (don't quote me on the numbers, but it was something like 98% vs 98.5%)

3) The method is insane, in that they actually create millions of images, encoding read information as colors and alpha, and then use their image-processing neural network to do pattern recognition for calling.

4) it is quite computationally expensive for running, not even to mention training the NN.

5) It absolutely requires new training data for each platform that you're going to run it on. Chemistry changed slightly? Got a new type of instrument? Doing targeted regions instead of WGS? You'll need a new gold standard run and you'll need to retrain the algorithm from scratch. They used the Genome in a Bottle dataset. That's limited to ~80% of the genome, and their TPs are only calls validated on at least two sequencing technologies.

Don't get me wrong - it's cool to see someone enter the space with a really crazy orthogonal method, but it's not a panacea, and the hype about AI solving all of our variant calling problems is pretty clearly overblown. That doesn't mean that this won't be useful in the future, just that it's not there yet.

ADD COMMENTlink modified 6 days ago • written 6 days ago by Chris Miller19k

Just want to be clear - the DV team deserves kudos. Variant calling is a hard problem, their method is interesting, and their performance is admirable. If I'm negative about anything, it's the breathless "Google AI has solved genomics!" press coverage, which you can't blame the authors for!

ADD REPLYlink written 5 days ago by Chris Miller19k

I totally agree how the hoopla over "Google AI solved genomics!" is on. At the end of the day it is a product they are bringing and pretty sure the buzz will be more than what it actually preaches. Having said that, I will feel it is worth taking a look at it as to how germline calls are made and improved but to what extent it can be useful will be a matter of time. For somatic calls am sure they will bring up something soon. I still need to get an understanding of the algorithm though as to how they implemented. But am happy that this kind of work also pushes one step ahead of making genomics as a research service product, and I support that.

ADD REPLYlink written 5 days ago by vchris_ngs4.2k

Thanks Chris for such concise & informative review. I have a question/comment about your point 5 about the need to retrain the model every time something changes (bear with me: I haven't read the DeepVariant method in any detail).

First, I wonder to what extent it is that necessary to retrain the parameters even for small changes in the library preparations. Presumably (big if), small changes in, say, chemistry should still give good results.

But most importantly, I don't think DeepVariant is conceptually different from other methods when it comes to using training and test data. I mean, DeepVariant makes the need of training data explicit. But implicitly other methods also need training data that in theory should be re-analysed every time something changes. For example, when we ("we" meaning us or the program we use) decide to filter out variants supported by less than 3 reads, effectively we are saying "given the training data I've seen until now, 3 is a good threshold".

ADD REPLYlink written 3 hours ago by dariober8.3k

My assumption is that whether running GATK with data produced on a HiSeq 2500, a NovaSeq patterned flow-cell, or an amplicon-based technology, you'll get reasonable results. This is thanks to lots of effort that went into making their model (and it's heuristics) general. (In essence, yes, using all the training data we've seen up until now).

The NN picks out artifact patterns automatically, which is impressive, but that makes it very susceptible to changes. Given a large and diverse training corpus, there's no reason why it can't learn general patterns too! My point is that these large, highly validated training sets don't exist, so if you hop to a new (or older) technology, you can't expect DeepVariant to just work. (again, for now)

It's also a contrast to current callers, where you can often look at your new type of data, see "oh, it looks like I'm overcalling at homopolymer runs", and then tweak some parameters to fix the problem. NN is a total black box and has to be retrained from scratch.

So yeah, I absolutely think that NN-based variant calling will be useful (and probably better!) in the future. I'm just trying to inject some reality into the proceedings here. :)

ADD REPLYlink written 2 hours ago by Chris Miller19k
3
gravatar for mdepristo
5 days ago by
mdepristo30
mdepristo30 wrote:

Hi all,

Glad to see a post here on Biostars about DeepVariant's open source release. If you'd like more information on accuracy and runtime of DeepVariant across a variety of datasets, have a look at the blog post from DNANexus about DeepVariant on their internal benchmark datasets.

ADD COMMENTlink written 5 days ago by mdepristo30
2
gravatar for Chris Miller
1 hour ago by
Chris Miller19k
Washington University in St. Louis, MO
Chris Miller19k wrote:

Steven Salzberg has a nice response to the hype:

No, Google's AI Program Can't Build Your Genome Sequence

https://www.forbes.com/sites/stevensalzberg/2017/12/11/no-googles-new-ai-cant-build-your-genome-sequence/#5e35eefb5774

ADD COMMENTlink written 1 hour ago by Chris Miller19k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1496 users visited in the last hour