Question: How To Extract Reads From Bam That Overlap With Specific Regions?
gravatar for Stew
9.6 years ago by
Stew1.4k wrote:

This question is related to this one, but I would like to know if anyone knows of any methods of quickly extracting reads from a BAM file that overlap with a list of many regions (e.g. a BED file), but I would like the reads separately for each region.

My eventual requirements are that for each region individually, I would like to know:

  1. The number of overlapping reads
  2. The location of the highest pileup
  3. The height of the highest pileup

However I am happy to calculate these, I just need a tool to return the reads for each region in a way that I can parse.

I am currently using Rsamtools, which does everything I require, as it returns a list, with each element in the list containing the reads that overlap each region. However, is really slow for a large number of regions. It's run time is dependent on the number of regions and largely independent of the number of reads in the BAM. I am using Rsamtools like this currently:

which <- RangedData(space=bed[,1],IRanges(bed[,2],bed[,3]))
param <- ScanBamParam(which=which,what=c("pos"))
bam <- scanBam(file=bamFile,param=param)

I can split the list of regions and run it as many parallel jobs and merge at the end, however first I wanted to check if there was another tool which was quicker for this task.

I have looked at intersectBed, but I can not get this to return reads in a way I can easily parse to give information for each region. The run time for this method is dependent on the number of reads in the BAM and largely independent of the number of regions, which would be good for my needs.

ADD COMMENTlink modified 6.1 years ago by Alex Reynolds30k • written 9.6 years ago by Stew1.4k

I coded this in python using pysam. For 166k regions and a bamfile with 100 million alignments, runtime was 412 seconds to count the number of reads in each region.

ADD REPLYlink written 9.6 years ago by Sean Davis26k

Hi, it's slow because scanBam with param is not meant to be used as repeatedly as you do it. for many regions it's advisable to use countOverlaps.

ADD REPLYlink written 9.6 years ago by Michael Dondrup47k

The java code by Pierre is incomplete. Hi Pierre, is there any chance that you could post the entire java code? Thx!

ADD REPLYlink modified 4 months ago by RamRS27k • written 6.1 years ago by roscas090
gravatar for Michael Dondrup
9.6 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

Hi Stew,

as Sean suggested you should take advantage of the efficient overlap computation in the IRanges package. What is slowing down your computation is probably the post-processing of the list you are making not the bam parsing itself. I'm going to provide a little recipe here.

Please have a look at the documentation of countOverlaps and Views in particular. Also note that you can import bed filed using the rtracklayer package.

I will edit this answer as soon as I find some more time to test the examples.

Edit: I promised some recipes in R so here is the first one.


  • gff file with the genomic regions
  • bam file with the aligned reads
  • fairly recent R & bioconductor with libraries: ShortRead, rtracklayer + dependencies

reads = readAligned("~/small.bam", type="BAM")
genes = import("~/bsubtilis.gff")
# easiest way to make a ranged data object out of the reads
reads = as(as(reads, "GRanges"), "RangedData")
# make sure the chromosome names in reads and genes are identical
# otherwise you will get an error
countOverlaps(genes, reads)
CompressedIntegerList of length 1
[["AL009126.3"]] 93 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 
0 0 0 0 0 0 0  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Result: A list with an entry for each chromosome ('space') with the number of reads mapped for each gene in the chromosome.

ADD COMMENTlink modified 8 months ago by RamRS27k • written 9.6 years ago by Michael Dondrup47k

Thanks for the reply and the code. I will try this. Since posting I have tried a new strategy which is a perl script that calculates overlap using the output from samtools view piped to stdin. In this case the overlaps are calculated as the file is streamed which has the advantage of not using any memory and only taking as long as it takes to view the file once. It will be interesting to compare your method to my perl hack. I like the idea of your method better as I end up having to save the results of my perl code to a file and read it into R anyway. Thanks again.

ADD REPLYlink written 9.5 years ago by Stew1.4k

I would assume that when comparing both ways 'my' solution might be faster while using also much more memory. This is due to the R philosophy of loading everything into memory. Also, make sure you got the latest Bioconductor. import.gff has been updated recently and is now more efficient.

ADD REPLYlink written 9.5 years ago by Michael Dondrup47k

This is really good. Is there any way of getting the same result for Novoalign output, that is in the native novoalign format?

ADD REPLYlink written 8.7 years ago by Sameet290

"import" is changing the order of chromosomes (not same order in genes and reads) and I am getting incorrect overlaps. Is there anyway to fix this? Thanks!

ADD REPLYlink written 8.3 years ago by Kavitha0

Kavita, please provide some evidence for that, then we can reproduce the problem.

ADD REPLYlink written 8.3 years ago by Michael Dondrup47k

@Sameet, as long as you can convert the output of novoalign to SAM or BAM it is equally easy, isn't there a switch in novoalign that allows that? Otherwise, parsing the textual output into a GRanges object should work.

ADD REPLYlink written 8.3 years ago by Michael Dondrup47k

I used the exact same code provided however when I import the gff file it reorders the space(genes) to follow this order

names(genes)= "chr1" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chrX" "chrY". But my BAM file reads are in this order: names(reads)= "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chrX" "chrY"

Upon doing countOverlaps- I get correct overlap counts for chr1,X andY, the others are wrong.

ADD REPLYlink modified 8 months ago by RamRS27k • written 8.3 years ago by Kavitha0

Here is the snippet of the output

CompressedIntegerList of length 21
[["chr1"]] 84 54 162 65 38 42 142 95 185 31 91 49 67 92 33 53 92 80 22 203 40 85 19 49 43 19 49 21 ... 142 711 140 365 44 119 74 202 144 85 227 67 32 44 79 125 50 63 39 53 92 473 402 168 75 45 51
[["chr10"]] 2 2 1 4 2 40 4 2 5 12 5 10 5 4 1 1 5 3 7 0 2 3 2 3 1 1 1 1 1 0 0 6 3 3 6 3 0 2 2 2 0 2 0 4 ... 1 4 3 2 4 4 3 9 1 1 5 2 1 0 0 2 4 2 0 2 2 2 1 0 2 0 3 1 4 2 2 2 3 3 3 4 1 6 1 1 1 0 0 0
ADD REPLYlink modified 8 months ago by RamRS27k • written 8.3 years ago by Kavitha0

It has been fixed. The RangedData object (genes) was reordered to match the read names using the command genes[names(reads)]

ADD REPLYlink modified 8 months ago by RamRS27k • written 8.3 years ago by Kavitha0
gravatar for Sean Davis
9.6 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

Using a bamfile as the source of data will be efficient for some tasks, but for others, it is actually perhaps more efficient to actually read in a larger chunk of the file and process it in memory. For example, you may want to consider processing all the reads and regions in each chromosome in memory.

  1. Read in all regions
  2. Read in large sections of bamfile (chromosome at a time)
  3. Process all regions for the section read in in (2)
  4. Repeat 2-3 for all chromosomes

Particularly since you are talking about pileups, etc., these are going to be done efficiently by the IRanges operations in R for a whole chromosome and then you can use Views to extract your regions of interest.

I would think that this method would be largely independent of the number of regions and, instead, would depend on the number of reads.

ADD COMMENTlink modified 9.6 years ago • written 9.6 years ago by Sean Davis26k
gravatar for Pierre Lindenbaum
9.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum128k wrote:

I'm going to answer the first part of your question by using the picard library. The following java program reads a list of chrom/start/end values from stdin or from a file and outputs the number of reads in each region:


import net.sf.samtools.SAMFileReader;
import net.sf.samtools.SAMRecord;
import net.sf.samtools.SAMFileReader.ValidationStringency;
import net.sf.samtools.util.CloseableIterator;

public class BioStar3414
    private File bamFile=null;
    private SAMFileReader inputSam=null;
    private boolean containsbamRecord=false;//false : the alignment of the returned SAMRecords need only overlap the interval of interest. 
    public BioStar3414()


    public void open(File bamFile)
        this.inputSam=new SAMFileReader(this.bamFile);

    public void close()

    private int scan(String chromosome,int start,int end) throws IOException
        int nCount=0;
        CloseableIterator<SAMRecord> iter=null;
        try {
            iter=this.inputSam.query(chromosome, start, end, this.containsbamRecord);
            return nCount;
        catch (Exception e) {
            throw new IOException(e);
            if(iter!=null) iter.close();

    public void run(BufferedReader in) throws IOException
        String line;
            if(line.isEmpty()) continue;
            String tokens[]=line.split("[\t]");
            String chrom=tokens[0];
            int start=Integer.parseInt(tokens[1]);
            int end=Integer.parseInt(tokens[2]);
            int count=scan(chrom,start,end);

    public static void main(String[] args)
        File bamFile=null;
        BioStar3414 app;
            app=new BioStar3414();
            int optind=0;


javac -cp sam-1.16.jar:picard-1.16.jar


echo -e "chr1\t1000000\t2000000\nchr1\t2000000\t3000000\n" |\
  java -cp sam-1.16.jar:picard-1.16.jar:. BioStar3414  -bam my.sorted.bam
chr1    1000000    2000000    39604
chr1    2000000    3000000    14863
ADD COMMENTlink modified 8 months ago by RamRS27k • written 9.6 years ago by Pierre Lindenbaum128k
gravatar for Ian
9.6 years ago by
University of Manchester, UK
Ian5.6k wrote:

You might find bedtools useful You can convert BAM to BED or work with the BAM file directly. There are many tools for performing intersects, calculating coverage stats, etc.

But this would be outside of R!

ADD COMMENTlink written 9.6 years ago by Ian5.6k
gravatar for Alex Reynolds
6.1 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

With BEDOPS, the bedmap tool can be used to report the answer to all three questions in one command. Also, if you have access to a computational grid, BEDOPS can split work into smaller, per-node tasks very easily. This answer will review two approaches: one serial, the other parallel.

Let's assume you have a sorted BED file containing your regions-of-interest, called Regions.bed.

Let's also assume that you have run a pileup operation on your BAM file. This operation converts reads to piles of tags over a given window size, formatted either as a BED file, or a highly compressed BED file in a format called Starch. Windows of bins may or may not overlap - this is up to you. We have a writeup here that explains how to collapse reads to tag counts over windows. Let's say the output of this pileup step is a compressed BED file called BinnedReads.starch.

To display the count of binned reads over each region of interest, along with the bin element with the highest read count, add the --count and --max-element operators to the following bedmap operation:

$ bedmap --echo --count --max-element Regions.bed BinnedReads.starch > Answer.bed

The file Answer.bed will be a sorted BED file containing:

[ region-1 ] | [ count-of-binned-reads-over-reg-1 ] | [ maximum-sized-bin-over-reg-1 ]
[ region-2 ] | [ count-of-binned-reads-over-reg-2 ] | [ maximum-sized-bin-over-reg-2 ]
[ region-n ] | [ count-of-binned-reads-over-reg-n ] | [ maximum-sized-bin-over-reg-n ]

Sorted inputs will make this operation very fast.

Even further, you can parallelize this operation very easily with bedmap --chrom and bedextract --list-chr.

Let's say you have a Sun Grid Engine environment. Here is some bash-like pseudocode to explain how you might use BEDOPS to split tasks up by chromosome:

for chromosomeName in `bedextract --list-chr Regions.bed`; do \
    qsub <options> bedmap --chrom ${chromosomeName} --echo --count --max-element Regions.bed BinnedReads.starch > Answer.${chromosomeName}.bed; \

qsub -hold_jid <list_of_per_chrom_bedmap_job_names> <options> bedops --everything Answer.*.bed > Answer.bed

This map-reduce approach splits or maps work into per-chromosome jobs, where the final job reduces or concatenates all the results into one file. This shortens the time taken to perform your analysis to the time cost of the chromosome with the largest total region space - BEDOPS can make this task an order of magnitude faster, using distributed computation.

ADD COMMENTlink modified 8 months ago by RamRS27k • written 6.1 years ago by Alex Reynolds30k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2175 users visited in the last hour