Question: sam/bam file handling -> new tools?
5
gravatar for Richard
3.5 years ago by
Richard550
Canada
Richard550 wrote:

Hi all,

We are thinking about ways to make our production pipeline run faster.   Right now we're settled on aligner, but everything to get us from the SAM/BAM creation up to a sorted, merged, duplicate marked BAM could be updated.

We'll need tools to help us:

  • Sort
  • merge
  • mark duplicates
  • flagstats
  • make bam indices (.bai)

On my list of tools to evaluate, I have some combination of the following:

  • samtools
  • picard
  • sambamba
  • samblaster

Are there any tools that I am missing? Are there any combinations of tools that people find particularly effective?

Right now, our current workflow involves

  1. Align with Bwa
  2. Convert to BAM and sort (samtools)
  3. Duplicate Mark BAM (Picard)
  4. Merge and Duplicate mark all the lanes for a sample (Picard)

Looking forward to your suggestions!

sam markduplicates sort bam • 2.5k views
ADD COMMENTlink modified 3.5 years ago by lh331k • written 3.5 years ago by Richard550
9
gravatar for lh3
3.5 years ago by
lh331k
United States
lh331k wrote:

Interesting question.

Five years ago, we need multiple libraries and and multiple Illumina runs to sequence a human individual to high coverage. At that time, the reads were short. The base quality was not good. The indel models used by the SNP callers were primitive. The best practice was appropriate in the old time.

Now is very different. We typically sequence one human sample from one library in one run. Reads are longer. Quality is more calibrated. The indel models of modern callers are much better. Nowadays, I usually recommend:

  1. Use Unix pipes and avoid generating temporary files as much as possible. Brad Chapman and several others do pre filtering, mapping, SAM-to-BAM conversion, mark duplicate and sorting with one command line.
  2. Use samblaster for mark duplicate, which is much easier and faster when two ends of a pair are grouped together. Note that for samblaster to work, you need to map one library in one go.
  3. Personally, I found sambamba is faster for sorting than samtools, but I have not done careful comparisons. Sambamba also generates the BAM index while sorting. This could save a wall-clock hour or so.
  4. Skip base recalibration. The gatk implementation is slow and most of times, it is not necessary any more. I have also seen recalibration lead to slightly unexpected results.
  5. Skip gatk indel realignment if you use gatk-hc, freebayes and platypus. These modern callers realign reads on the fly while calling SNPs/INDELs. The gatk team have shown that realignment has almost no effect when we use gatk-hc. In addition, for >100bp reads, false variants caused by indels are usually not a big concern. Given 100bp reads, even older callers like samtools and gatk-ug work reasonably well without realignment. It should be noted that I actually think a more sophisticated indel realigner might still be able to improve indel calling, but no public tools are available for now.

Following these practices, you can typically get an "analysis-ready" 30X BAM in <24 hours on 8-16 CPU cores.

As to other tools, I think vt is for VCF processing not for BAM processing (right?). elPrep sounds very interesting, but it seems to require quite a lot of RAM not everyone has. In addition, BGI has a GPU-powered pipeline which is much faster. You need to use SOAP for mapping. ADAM provides massively parallelized MarkDuplicates, realignment and variant calling, though probably it does not save CPU time. The SNAP group from UC Berkeley has shown me a pipeline that does SNAP mapping, deduping and sorting in one go in very short time.

ADD COMMENTlink written 3.5 years ago by lh331k
1
gravatar for Pierre Lindenbaum
3.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum112k wrote:

GATK : realign , recalibrate ... see Best Practices: http://gatkforums.broadinstitute.org/discussion/1186/best-practice-variant-detection-with-the-gatk-v4-for-release-2-0-retired

ADD COMMENTlink written 3.5 years ago by Pierre Lindenbaum112k
1
gravatar for Ying W
3.5 years ago by
Ying W3.8k
South San Francisco, CA
Ying W3.8k wrote:

You might also consider elPrep as a replacement for samtools/Picard (supposed to be fast, have not used it myself)

I would also recommend you look into what speedseq does since it might be doing something similar to what you are trying to achieve

Keep in mind that using Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them and Piping With Samtools, Bwa And Bedtools could also be help you increase speed.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Ying W3.8k

+1 for GNU parallel.  Great tool.

I'll check out elPrep and speedseq
 

ADD REPLYlink written 3.5 years ago by Richard550
0
gravatar for brentp
3.5 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

you should also use vt normalize on your BAM files to left-align and trim.

See this paper: http://bioinformatics.oxfordjournals.org/content/early/2015/02/19/bioinformatics.btv112.abstract

for the difference it can make.

ADD COMMENTlink written 3.5 years ago by brentp22k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 609 users visited in the last hour