Optimizing samtools shell scripts. How to make code run faster
2
2
Entering edit mode
8.8 years ago

I would like to count the number of reads within a genomic region.

This is the command I would like to use

samtools view -f 0x02 -q 40 in.bam 1:10000000-10001000 | sort | uniq | wc -l

I add the sort and uniq because in some instances I might be counting the same read twice.

My problem is that I need to count the reads for this positions and 500k more positions for over 200 samples.

As it stands my code is written like this

fork_process(16); # perl Parallel::ForkManager module to fork 16 processes
foreach my $bam_file (@bam_files) {
    start->new_process
    open OUT, ">".$bam_file."_out.txt";

   foreach my $pos (@positions) { 
        my $command = "samtools view -f 0x02 -q 40 $bam_file $pos | sort | uniq | wc -l";
        my $count = ($command);
       print OUT $pos,"\t",$count,"\n";
   }
   end->process;
   close OUT;
}

wait_all_child_process;

I am paralleling the process in sets of 16 samples and writing out to separate files. For some reason my code runs fast and then runs slow.

I tried working with MPI but no avails (samtools is not on MPI on my cluster). I know some C and python. Any advice?

I want my code to run fast as possible, please help!

samtools shell • 3.3k views
ADD COMMENT
1
Entering edit mode
8.8 years ago

If you have a cluster, consider an approach that splits work up into smaller pieces.

Let's say you have indexed, non-redundant BAM files (via samtools rmdup or rmdup -s and samtools index), an SGE/OGE cluster, and a BED file containing regions of interest.

You can use a parallelized version of bam2starch to make a compressed Starch archive:

$ bam2starch_sge < nodup_reads.bam > nodup_reads.starch

Then you can farm out per-chromosome mapping tasks to nodes on a cluster.

On the first node, for instance:

$ bedmap --chrom 1 --echo --count regions-of-interest.bed nodup_reads.starch > 1.roi.counts.bed

On the second node:

$ bedmap --chrom 10 --echo --count regions-of-interest.bed nodup_reads.starch > 10.roi.counts.bed

And so on, for subsequent nodes and chromosomes.

Each of the *.roi.counts.bed files contains the region of interest and the number of unique reads across that region, for the specified chromosome.

At the end, you can collate this into one BED file via:

$ bedops --everything *.roi.counts.bed > roi.counts.bed

This is a basic map-reduce approach. With parallelization, the time cost would be in pre-processing the BAM file into a non-redundant Starch file, along with the time cost in applying map operations on the largest chromosome.

This should be an order of reduction in processing time, compared with searching for one region at a time over the whole BAM file, and repeating until all regions are done.

ADD COMMENT
0
Entering edit mode

I don't think my cluster is SGE/OGE? I'm not sure, they say to use MPI for parallel processing across nodes.

ADD REPLY
0
Entering edit mode

What grid or job scheduler software do you use? This could be Sun Grid Engine, Oracle Grid Engine, PBS, etc.

ADD REPLY
0
Entering edit mode

SLURM on the fast cluster and PBS on the slightly slower one. 24 cpus per node vs 16

ADD REPLY
0
Entering edit mode

What you could do is review the SGE script and replace the qsub statements with equivalent commands to start jobs via SLURM. http://www.sdsc.edu/~hocks/FG/PBS.slurm.html

ADD REPLY
1
Entering edit mode
8.8 years ago

Use a Makefile: the following Makefile define a set of bam files, a set of positions, there are two loops for the BAMs and for the positions. Invoke this Makefile with make and option -j num_of_parallel_taks

SHELL=/bin/bash
.PHONY:all
BAMS = file1.bam file2.bam file3.bam file4.bam
POSITIONS = 1_100_200  1_300_400 1_500_600 1_700_800

define loop_pos

$$(addprefix $(2)_,$$(addsuffix .count,$$(basename $(1)))) : $(1)
    echo -en "$$(word 1,$$(subst _, ,$2)):$$(word 2,$$(subst _, ,$2))-$$(word 3,$$(subst _, ,$2))\t" > $$@
    samtools view -f 0x02 -q 40 $$< $$(word 1,$$(subst _, ,$2)):$$(word 2,$$(subst _, ,$2))-$$(word 3,$$(subst _, ,$2)) | sort | uniq | wc -l >> $$@

endef

define loop_bam

$$(addsuffix .count,$$(basename $(1))): $(foreach P,${POSITIONS},$$(addsuffix .count,$$(addprefix ${P}_,$$(basename $(1)) )))
    cat $$^ > $$@

$(eval $(foreach P,${POSITIONS},$(call loop_pos,$(1),${P})))

endef

all: $(addsuffix .count,$(basename ${BAMS}))

$(eval $(foreach B,${BAMS},$(call loop_bam,${B})))

output:

echo -n "1:100-200\t" > 1_100_200_file1.count
samtools view -f 0x02 -q 40 file1.bam 1:100-200 | sort | uniq | wc -l >> 1_100_200_file1.count
echo -n "1:300-400\t" > 1_300_400_file1.count
samtools view -f 0x02 -q 40 file1.bam 1:300-400 | sort | uniq | wc -l >> 1_300_400_file1.count
echo -n "1:500-600\t" > 1_500_600_file1.count
samtools view -f 0x02 -q 40 file1.bam 1:500-600 | sort | uniq | wc -l >> 1_500_600_file1.count
echo -n "1:700-800\t" > 1_700_800_file1.count
samtools view -f 0x02 -q 40 file1.bam 1:700-800 | sort | uniq | wc -l >> 1_700_800_file1.count
cat 1_100_200_file1.count 1_300_400_file1.count 1_500_600_file1.count 1_700_800_file1.count > file1.count
echo -n "1:100-200\t" > 1_100_200_file2.count
samtools view -f 0x02 -q 40 file2.bam 1:100-200 | sort | uniq | wc -l >> 1_100_200_file2.count
echo -n "1:300-400\t" > 1_300_400_file2.count
samtools view -f 0x02 -q 40 file2.bam 1:300-400 | sort | uniq | wc -l >> 1_300_400_file2.count
echo -n "1:500-600\t" > 1_500_600_file2.count
samtools view -f 0x02 -q 40 file2.bam 1:500-600 | sort | uniq | wc -l >> 1_500_600_file2.count
echo -n "1:700-800\t" > 1_700_800_file2.count
samtools view -f 0x02 -q 40 file2.bam 1:700-800 | sort | uniq | wc -l >> 1_700_800_file2.count
cat 1_100_200_file2.count 1_300_400_file2.count 1_500_600_file2.count 1_700_800_file2.count > file2.count
echo -n "1:100-200\t" > 1_100_200_file3.count
samtools view -f 0x02 -q 40 file3.bam 1:100-200 | sort | uniq | wc -l >> 1_100_200_file3.count
echo -n "1:300-400\t" > 1_300_400_file3.count
samtools view -f 0x02 -q 40 file3.bam 1:300-400 | sort | uniq | wc -l >> 1_300_400_file3.count
echo -n "1:500-600\t" > 1_500_600_file3.count
samtools view -f 0x02 -q 40 file3.bam 1:500-600 | sort | uniq | wc -l >> 1_500_600_file3.count
echo -n "1:700-800\t" > 1_700_800_file3.count
samtools view -f 0x02 -q 40 file3.bam 1:700-800 | sort | uniq | wc -l >> 1_700_800_file3.count
cat 1_100_200_file3.count 1_300_400_file3.count 1_500_600_file3.count 1_700_800_file3.count > file3.count
echo -n "1:100-200\t" > 1_100_200_file4.count
samtools view -f 0x02 -q 40 file4.bam 1:100-200 | sort | uniq | wc -l >> 1_100_200_file4.count
echo -n "1:300-400\t" > 1_300_400_file4.count
samtools view -f 0x02 -q 40 file4.bam 1:300-400 | sort | uniq | wc -l >> 1_300_400_file4.count
echo -n "1:500-600\t" > 1_500_600_file4.count
samtools view -f 0x02 -q 40 file4.bam 1:500-600 | sort | uniq | wc -l >> 1_500_600_file4.count
echo -n "1:700-800\t" > 1_700_800_file4.count
samtools view -f 0x02 -q 40 file4.bam 1:700-800 | sort | uniq | wc -l >> 1_700_800_file4.count
cat 1_100_200_file4.count 1_300_400_file4.count 1_500_600_file4.count 1_700_800_file4.count > file4.count
ADD COMMENT
0
Entering edit mode

This is along the lines of what I'm thinking but in the end there will be countless number of files. The scratch space on my cluster can (and has) slowed down because of I/O burden.

I think I can do this but have one file and write out

echo -n "1:500-600\tfile1.bam\t"

Is there a way to open the file once and keep appending to it in bash? Similar to

open OUT, ">".ofh
while(<IN>) { print OUT "foo\n"; }
close OUT;

as opposed to

while(<IN>){
    open OUT, ">>".$ofh;
    print OUT "foo\n";
    close OUT;
}
ADD REPLY

Login before adding your answer.

Traffic: 1934 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6