Question

speed of bam-read count is too slow

1

Entering edit mode

8.0 years ago

fizer ▴ 30

Hi all,

I am running bam-readcount program on one of my bam files of size 7Gb on a machine with 24G memory. I provided 2.5M positions and reference genome (hg19) and it's been 12 hours now and output has resulted only for 30k position. Why is it that much slow? Is there a way to increase the efficiency or run the program with multiple cores? Please help me out. Thanks

bam-readcount multi threading slow speed • 3.0k views

ADD COMMENT • link updated 8.0 years ago by ernfrid ▴ 220 • written 8.0 years ago by fizer ▴ 30

0

Entering edit mode

If you just want basic stats use samtools idxstats. It very fast because it only reads the bam index.

ADD REPLY • link 8.0 years ago by Zev.Kronenberg 12k

score 3 · Answer 1 · 2016-04-01

bam-readcount generates many stats at a given position. As shown in the other two answers, if you are looking for general or simple readcounts then other strategies will be much faster. If you do want all the stats then one common cause of excessive bam-readcount slowness is if the sites you provide are not sorted by reference sequence. In addition, if you have consecutive positions you can also reduce the number of queries by collapsing them into ranges in your file of positions.

For example, instead of:

1   1   1
1   2   2
1   3   3

use a file like:

1   1   3

score 2 · Answer 2 · 2016-04-01

You could do:

$ bedmap --echo --count regions-of-interest.bed <(bam2bed < reads.bam) > roi-with-read-counts.bed

The --echo option reports the region of interest, and the --count operation counts the number of reads that overlap each region of interest. The results are written to a file called roi-with-read-counts.bed.

To speed this up further, if you have a computational cluster, you can split the task by chromosome and run each chromosome on separate nodes, e.g., convert the BAM to BED via a SLURM cluster:

$ bam2bed_slurm indexed-reads.bam reads.bed

Or convert via an SGE cluster:

$ bam2bed_sge indexed-reads.bam reads.bed

Or convert via GNU Parallel:

$ bam2bed_gnuParallel indexed-reads.bam reads.bed

Then do per-chromosome mapping on each node:

$ bedmap --echo --count --chrom ${chromosome} regions-of-interest.bed reads.bed > roi-with-read-counts.${chromosome}.bed

The --chrom option takes a chromosome name ${chromosome} and uses the sort order of the input BED files to quickly jump to the right place in the BED file.