Looking for DeepVariant training data on yeast strains
1
0
Entering edit mode
16 months ago
Michael 54k

I would like to try DeepVariant to predict variants in Illumina WGS data from some yeast strains using the S. cerevisiae R64 reference genome. For that, I need training data to use with dv_make_examples.py. If I understand correctly, these should best be from the same species (Saccharomyces cerevisiae). If I am further not mistaken we need VCF files with "true" or validated variants and corresponding sequencing data (which I can get from SRA). I was unable to find such variant data in SGD for download. Or does the species not matter so that I could simply use human data to train the model?

data training DeepVariant • 996 views
ADD COMMENT
0
Entering edit mode

You could download the VCF files from https://www.nature.com/articles/s41586-018-0030-5 which are available here: http://1002genomes.u-strasbg.fr/files/

ADD REPLY
0
Entering edit mode

Yes, I have seen this paper. I was just wondering if I wouldn't simply replicate the GATK pipeline the authors have used. The point is, there was some filtering involved but no other manual curation or validation in my understanding.

ADD REPLY
1
Entering edit mode
16 months ago
Michael 54k

I have come to the conclusion that using DeepVariant does not make sense in my case, therefore I will not bother with it instead we will use GATK4 in a similar fashion as in Peter et al. 2018 who used an older version. Using deep learning methods makes a lot of sense in the presence of curated and manually validated training data which doesn't seem to be the case here. The Gold Truth matters and was obtained for human data by generations of scientists using more traditional methods of de-novo variant calling and labor-intensive validation. The presence of the sheer amount of training data has allowed Google to outperform sequence-based variant callers in competitions but one should not forget that this success would not have been possible without de-novo variant callers. So, for other organisms where training data are sparse, using DeepVariant makes no sense. (Alphabet folks: feel free to provide evidence to the contrary)

ADD COMMENT

Login before adding your answer.

Traffic: 1817 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6