Question: Variant Callers for deep sequencing
gravatar for Rad
4.9 years ago by
Rad800 wrote:


I have a deep sequencing experiment to analyze and I am hesitant about the variant caller algorithm/program to use as I have some doubts concerning scalability

For a small experiment, not a deep-seq one, we usually rely on the gatk recommendations and best practicies guides, by combining a couple of tasks such as deduplication, this makes the analysis time more or less acceptable. For a deep sequencing experiment, there is no rational about removing duplicates, which makes the variant call part very long and relatively not scalable.

I didnt try the gatk variant caller yet on deep-seq data but I guess it will take a lot of time, I was wondering what is the best option to do the variant call on a deep seq data in term of scalability, if anyone has tried that in the past, I would be grateful if he can give some hints on the best way of doing this.



ADD COMMENTlink modified 3.9 years ago by Biostar ♦♦ 20 • written 4.9 years ago by Rad800

Do you have a BAM that's already sorted and aligned? Besides the added I/O burden, I wouldn't expect variant calling on deep sequencing data to take that much longer.  I would expect the time needed would scale more with the size of target region.  Maybe I'm missing something?

ADD REPLYlink written 4.9 years ago by Katie D'Aco1000

Yes I have bams sorted and indexed, am on a stage where I need to call variants on them but still yet not decided which variant caller can handle such a sequencing depth, it is a MiSeq so even when running that on a cluster that would be a long shot I guess. No the question, what variant caller is the best bet for such a coverage ! I don't find any comparison in that sense

ADD REPLYlink written 4.9 years ago by Rad800

two questions: What kind of coverage do you have? (10-90x coverage and you should just stick to gatk-bp and be patient or parallelize. >200x coverage and the smart callers will be too slow.) What kind of information do you want to end up with? A mammalian diploid sequence could be seen with high probability by sampling down to 30x. Do it twice if you're not certain. Metagenomes or heterogeneous tumor sequencing need alternate-allele percentage precision and can't be downsampled so far. 

ADD REPLYlink written 4.9 years ago by karl.stamm3.4k

Thanks Karl, yes I have a coverage in about 10-90X. Let's precise 'slow' for people reading this thread, I talk about weeks of doing a variant call on single run, on an SGE cluster :) I am not doing it, but this is what I want to avoid actually and this is why I asked the question. Besides, I don't want any program to crash because it is not scalable to support high coverage, so I want to avoid those before planning to run my analysis pipeline

ADD REPLYlink written 4.9 years ago by Rad800
gravatar for Sean Davis
4.9 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

For 90x coverage, GATK works fine for us.  You can parallelize GATK by running it per chromosome.  If you want to go even further, freebayes can be safely parallelized to non-overlapping regions of chromosomes.  In any case, for most callers, 90x shouldn't be too bad.  Do make sure that any high-depth filters that are inherent in the defaults are either turned off or set to more sane numbers for your data.


ADD COMMENTlink written 4.9 years ago by Sean Davis25k

Thank you sean, And what about 1000x lets say (datasets I receive are variable in coverage) is there anu recommendation for such runs ?

ADD REPLYlink written 4.9 years ago by Rad800

Exomes can definitely achieve that level of coverage with modern sequencers, so I'd give it a try.  Variant-calling is almost embarrassingly parallel, so for many callers, you can simply run on a per-chromosome or per-region analysis and combine results.

ADD REPLYlink written 4.9 years ago by Sean Davis25k

Cool thanks Sean, much appreciated

ADD REPLYlink written 4.9 years ago by Rad800
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1064 users visited in the last hour