Question

Optimizing samtools shell scripts. How to make code run faster

2

Entering edit mode

8.8 years ago

QVINTVS_FABIVS_MAXIMVS ★ 2.5k

I would like to count the number of reads within a genomic region.

This is the command I would like to use

samtools view -f 0x02 -q 40 in.bam 1:10000000-10001000 | sort | uniq | wc -l

I add the sort and uniq because in some instances I might be counting the same read twice.

My problem is that I need to count the reads for this positions and 500k more positions for over 200 samples.

As it stands my code is written like this

fork_process(16); # perl Parallel::ForkManager module to fork 16 processes
foreach my $bam_file (@bam_files) {
    start->new_process
    open OUT, ">".$bam_file."_out.txt";

   foreach my $pos (@positions) { 
        my $command = "samtools view -f 0x02 -q 40 $bam_file $pos | sort | uniq | wc -l";
        my $count = ($command);
       print OUT $pos,"\t",$count,"\n";
   }
   end->process;
   close OUT;
}

wait_all_child_process;

I am paralleling the process in sets of 16 samples and writing out to separate files. For some reason my code runs fast and then runs slow.

I tried working with MPI but no avails (samtools is not on MPI on my cluster). I know some C and python. Any advice?

I want my code to run fast as possible, please help!

samtools shell • 3.3k views

ADD COMMENT • link updated 17 months ago by Ram 43k • written 8.8 years ago by QVINTVS_FABIVS_MAXIMVS ★ 2.5k

Ram · Answer 1 · 2015-07-16

1

Entering edit mode

8.8 years ago

Alex Reynolds 35k

If you have a cluster, consider an approach that splits work up into smaller pieces.

Let's say you have indexed, non-redundant BAM files (via samtools rmdup or rmdup -s and samtools index), an SGE/OGE cluster, and a BED file containing regions of interest.

You can use a parallelized version of bam2starch to make a compressed Starch archive:

$ bam2starch_sge < nodup_reads.bam > nodup_reads.starch

Then you can farm out per-chromosome mapping tasks to nodes on a cluster.

On the first node, for instance:

$ bedmap --chrom 1 --echo --count regions-of-interest.bed nodup_reads.starch > 1.roi.counts.bed

On the second node:

$ bedmap --chrom 10 --echo --count regions-of-interest.bed nodup_reads.starch > 10.roi.counts.bed

And so on, for subsequent nodes and chromosomes.

Each of the *.roi.counts.bed files contains the region of interest and the number of unique reads across that region, for the specified chromosome.

At the end, you can collate this into one BED file via:

$ bedops --everything *.roi.counts.bed > roi.counts.bed

This is a basic map-reduce approach. With parallelization, the time cost would be in pre-processing the BAM file into a non-redundant Starch file, along with the time cost in applying map operations on the largest chromosome.

This should be an order of reduction in processing time, compared with searching for one region at a time over the whole BAM file, and repeating until all regions are done.

ADD COMMENT • link updated 17 months ago by Ram 43k • written 8.8 years ago by Alex Reynolds 35k

0

Entering edit mode

I don't think my cluster is SGE/OGE? I'm not sure, they say to use MPI for parallel processing across nodes.

ADD REPLY • link 8.8 years ago by QVINTVS_FABIVS_MAXIMVS ★ 2.5k

0

Entering edit mode

What grid or job scheduler software do you use? This could be Sun Grid Engine, Oracle Grid Engine, PBS, etc.

ADD REPLY • link 8.8 years ago by Alex Reynolds 35k

0

Entering edit mode

SLURM on the fast cluster and PBS on the slightly slower one. 24 cpus per node vs 16

ADD REPLY • link 8.8 years ago by QVINTVS_FABIVS_MAXIMVS ★ 2.5k

0

Entering edit mode

What you could do is review the SGE script and replace the qsub statements with equivalent commands to start jobs via SLURM. http://www.sdsc.edu/~hocks/FG/PBS.slurm.html

ADD REPLY • link updated 17 months ago by Ram 43k • written 8.8 years ago by Alex Reynolds 35k

Ram · Answer 2 · 2015-07-16

Use a Makefile: the following Makefile define a set of bam files, a set of positions, there are two loops for the BAMs and for the positions. Invoke this Makefile with make and option -j num_of_parallel_taks

SHELL=/bin/bash
.PHONY:all
BAMS = file1.bam file2.bam file3.bam file4.bam
POSITIONS = 1_100_200  1_300_400 1_500_600 1_700_800

define loop_pos

$$(addprefix $(2)_,$$(addsuffix .count,$$(basename $(1)))) : $(1)
    echo -en "$$(word 1,$$(subst _, ,$2)):$$(word 2,$$(subst _, ,$2))-$$(word 3,$$(subst _, ,$2))\t" > $$@
    samtools view -f 0x02 -q 40 $$< $$(word 1,$$(subst _, ,$2)):$$(word 2,$$(subst _, ,$2))-$$(word 3,$$(subst _, ,$2)) | sort | uniq | wc -l >> $$@

endef

define loop_bam

$$(addsuffix .count,$$(basename $(1))): $(foreach P,${POSITIONS},$$(addsuffix .count,$$(addprefix ${P}_,$$(basename $(1)) )))
    cat $$^ > $$@

$(eval $(foreach P,${POSITIONS},$(call loop_pos,$(1),${P})))

endef

all: $(addsuffix .count,$(basename ${BAMS}))

$(eval $(foreach B,${BAMS},$(call loop_bam,${B})))

output:

echo -n "1:100-200\t" > 1_100_200_file1.count
samtools view -f 0x02 -q 40 file1.bam 1:100-200 | sort | uniq | wc -l >> 1_100_200_file1.count
echo -n "1:300-400\t" > 1_300_400_file1.count
samtools view -f 0x02 -q 40 file1.bam 1:300-400 | sort | uniq | wc -l >> 1_300_400_file1.count
echo -n "1:500-600\t" > 1_500_600_file1.count
samtools view -f 0x02 -q 40 file1.bam 1:500-600 | sort | uniq | wc -l >> 1_500_600_file1.count
echo -n "1:700-800\t" > 1_700_800_file1.count
samtools view -f 0x02 -q 40 file1.bam 1:700-800 | sort | uniq | wc -l >> 1_700_800_file1.count
cat 1_100_200_file1.count 1_300_400_file1.count 1_500_600_file1.count 1_700_800_file1.count > file1.count
echo -n "1:100-200\t" > 1_100_200_file2.count
samtools view -f 0x02 -q 40 file2.bam 1:100-200 | sort | uniq | wc -l >> 1_100_200_file2.count
echo -n "1:300-400\t" > 1_300_400_file2.count
samtools view -f 0x02 -q 40 file2.bam 1:300-400 | sort | uniq | wc -l >> 1_300_400_file2.count
echo -n "1:500-600\t" > 1_500_600_file2.count
samtools view -f 0x02 -q 40 file2.bam 1:500-600 | sort | uniq | wc -l >> 1_500_600_file2.count
echo -n "1:700-800\t" > 1_700_800_file2.count
samtools view -f 0x02 -q 40 file2.bam 1:700-800 | sort | uniq | wc -l >> 1_700_800_file2.count
cat 1_100_200_file2.count 1_300_400_file2.count 1_500_600_file2.count 1_700_800_file2.count > file2.count
echo -n "1:100-200\t" > 1_100_200_file3.count
samtools view -f 0x02 -q 40 file3.bam 1:100-200 | sort | uniq | wc -l >> 1_100_200_file3.count
echo -n "1:300-400\t" > 1_300_400_file3.count
samtools view -f 0x02 -q 40 file3.bam 1:300-400 | sort | uniq | wc -l >> 1_300_400_file3.count
echo -n "1:500-600\t" > 1_500_600_file3.count
samtools view -f 0x02 -q 40 file3.bam 1:500-600 | sort | uniq | wc -l >> 1_500_600_file3.count
echo -n "1:700-800\t" > 1_700_800_file3.count
samtools view -f 0x02 -q 40 file3.bam 1:700-800 | sort | uniq | wc -l >> 1_700_800_file3.count
cat 1_100_200_file3.count 1_300_400_file3.count 1_500_600_file3.count 1_700_800_file3.count > file3.count
echo -n "1:100-200\t" > 1_100_200_file4.count
samtools view -f 0x02 -q 40 file4.bam 1:100-200 | sort | uniq | wc -l >> 1_100_200_file4.count
echo -n "1:300-400\t" > 1_300_400_file4.count
samtools view -f 0x02 -q 40 file4.bam 1:300-400 | sort | uniq | wc -l >> 1_300_400_file4.count
echo -n "1:500-600\t" > 1_500_600_file4.count
samtools view -f 0x02 -q 40 file4.bam 1:500-600 | sort | uniq | wc -l >> 1_500_600_file4.count
echo -n "1:700-800\t" > 1_700_800_file4.count
samtools view -f 0x02 -q 40 file4.bam 1:700-800 | sort | uniq | wc -l >> 1_700_800_file4.count
cat 1_100_200_file4.count 1_300_400_file4.count 1_500_600_file4.count 1_700_800_file4.count > file4.count