6.0 years ago by
Five years ago, we need multiple libraries and and multiple Illumina runs to sequence a human individual to high coverage. At that time, the reads were short. The base quality was not good. The indel models used by the SNP callers were primitive. The best practice was appropriate in the old time.
Now is very different. We typically sequence one human sample from one library in one run. Reads are longer. Quality is more calibrated. The indel models of modern callers are much better. Nowadays, I usually recommend:
- Use Unix pipes and avoid generating temporary files as much as possible. Brad Chapman and several others do pre filtering, mapping, SAM-to-BAM conversion, mark duplicate and sorting with one command line.
- Use samblaster for mark duplicate, which is much easier and faster when two ends of a pair are grouped together. Note that for samblaster to work, you need to map one library in one go.
- Personally, I found sambamba is faster for sorting than samtools, but I have not done careful comparisons. Sambamba also generates the BAM index while sorting. This could save a wall-clock hour or so.
- Skip base recalibration. The gatk implementation is slow and most of times, it is not necessary any more. I have also seen recalibration lead to slightly unexpected results.
- Skip gatk indel realignment if you use gatk-hc, freebayes and platypus. These modern callers realign reads on the fly while calling SNPs/INDELs. The gatk team have shown that realignment has almost no effect when we use gatk-hc. In addition, for >100bp reads, false variants caused by indels are usually not a big concern. Given 100bp reads, even older callers like samtools and gatk-ug work reasonably well without realignment. It should be noted that I actually think a more sophisticated indel realigner might still be able to improve indel calling, but no public tools are available for now.
Following these practices, you can typically get an "analysis-ready" 30X BAM in <24 hours on 8-16 CPU cores.
As to other tools, I think vt is for VCF processing not for BAM processing (right?). elPrep sounds very interesting, but it seems to require quite a lot of RAM not everyone has. In addition, BGI has a GPU-powered pipeline which is much faster. You need to use SOAP for mapping. ADAM provides massively parallelized MarkDuplicates, realignment and variant calling, though probably it does not save CPU time. The SNAP group from UC Berkeley has shown me a pipeline that does SNAP mapping, deduping and sorting in one go in very short time.
6.0 years ago by
lh3 ♦ 32k