How To Extract Reads From Bam That Overlap With Specific Regions?
6
18
Entering edit mode
10.9 years ago
Stew ★ 1.4k

This question is related to this one, but I would like to know if anyone knows of any methods of quickly extracting reads from a BAM file that overlap with a list of many regions (e.g. a BED file), but I would like the reads separately for each region.

My eventual requirements are that for each region individually, I would like to know:

1. The number of overlapping reads
2. The location of the highest pileup
3. The height of the highest pileup

However I am happy to calculate these, I just need a tool to return the reads for each region in a way that I can parse.

I am currently using Rsamtools, which does everything I require, as it returns a list, with each element in the list containing the reads that overlap each region. However, is really slow for a large number of regions. It's run time is dependent on the number of regions and largely independent of the number of reads in the BAM. I am using Rsamtools like this currently:

which <- RangedData(space=bed[,1],IRanges(bed[,2],bed[,3]))
param <- ScanBamParam(which=which,what=c("pos"))
bam <- scanBam(file=bamFile,param=param)

I can split the list of regions and run it as many parallel jobs and merge at the end, however first I wanted to check if there was another tool which was quicker for this task.

I have looked at intersectBed, but I can not get this to return reads in a way I can easily parse to give information for each region. The run time for this method is dependent on the number of reads in the BAM and largely independent of the number of regions, which would be good for my needs.

next-gen sequencing bam bed coverage read • 28k views
0
Entering edit mode

I coded this in python using pysam. For 166k regions and a bamfile with 100 million alignments, runtime was 412 seconds to count the number of reads in each region.

0
Entering edit mode

Hi, it's slow because scanBam with param is not meant to be used as repeatedly as you do it. for many regions it's advisable to use countOverlaps.

0
Entering edit mode

The java code by Pierre is incomplete. Hi Pierre, is there any chance that you could post the entire java code? Thx!

8
Entering edit mode
10.9 years ago

Hi Stew,

as Sean suggested you should take advantage of the efficient overlap computation in the IRanges package. What is slowing down your computation is probably the post-processing of the list you are making not the bam parsing itself. I'm going to provide a little recipe here.

Please have a look at the documentation of countOverlaps and Views in particular. Also note that you can import bed filed using the rtracklayer package.

I will edit this answer as soon as I find some more time to test the examples.

Edit: I promised some recipes in R so here is the first one.

Ingredients:

• gff file with the genomic regions
• bam file with the aligned reads
• fairly recent R & bioconductor with libraries: ShortRead, rtracklayer + dependencies

library(rtracklayer)
genes = import("~/bsubtilis.gff")
# easiest way to make a ranged data object out of the reads
# make sure the chromosome names in reads and genes are identical
names(genes)
# otherwise you will get an error
CompressedIntegerList of length 1
[["AL009126.3"]] 93 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
0 0 0 0 0 0 0  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Result: A list with an entry for each chromosome ('space') with the number of reads mapped for each gene in the chromosome.

0
Entering edit mode

Thanks for the reply and the code. I will try this. Since posting I have tried a new strategy which is a perl script that calculates overlap using the output from samtools view piped to stdin. In this case the overlaps are calculated as the file is streamed which has the advantage of not using any memory and only taking as long as it takes to view the file once. It will be interesting to compare your method to my perl hack. I like the idea of your method better as I end up having to save the results of my perl code to a file and read it into R anyway. Thanks again.

0
Entering edit mode

I would assume that when comparing both ways 'my' solution might be faster while using also much more memory. This is due to the R philosophy of loading everything into memory. Also, make sure you got the latest Bioconductor. import.gff has been updated recently and is now more efficient.

0
Entering edit mode

This is really good. Is there any way of getting the same result for Novoalign output, that is in the native novoalign format?

0
Entering edit mode

"import" is changing the order of chromosomes (not same order in genes and reads) and I am getting incorrect overlaps. Is there anyway to fix this? Thanks!

0
Entering edit mode

Kavita, please provide some evidence for that, then we can reproduce the problem.

0
Entering edit mode

@Sameet, as long as you can convert the output of novoalign to SAM or BAM it is equally easy, isn't there a switch in novoalign that allows that? Otherwise, parsing the textual output into a GRanges object should work.

0
Entering edit mode

I used the exact same code provided however when I import the gff file it reorders the space(genes) to follow this order

names(genes)= "chr1" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chrX" "chrY". But my BAM file reads are in this order: names(reads)= "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chrX" "chrY"

Upon doing countOverlaps- I get correct overlap counts for chr1,X andY, the others are wrong.

0
Entering edit mode

Here is the snippet of the output

CompressedIntegerList of length 21
[["chr1"]] 84 54 162 65 38 42 142 95 185 31 91 49 67 92 33 53 92 80 22 203 40 85 19 49 43 19 49 21 ... 142 711 140 365 44 119 74 202 144 85 227 67 32 44 79 125 50 63 39 53 92 473 402 168 75 45 51
[["chr10"]] 2 2 1 4 2 40 4 2 5 12 5 10 5 4 1 1 5 3 7 0 2 3 2 3 1 1 1 1 1 0 0 6 3 3 6 3 0 2 2 2 0 2 0 4 ... 1 4 3 2 4 4 3 9 1 1 5 2 1 0 0 2 4 2 0 2 2 2 1 0 2 0 3 1 4 2 2 2 3 3 3 4 1 6 1 1 1 0 0 0
0
Entering edit mode

It has been fixed. The RangedData object (genes) was reordered to match the read names using the command genes[names(reads)]

6
Entering edit mode
10.9 years ago

Using a bamfile as the source of data will be efficient for some tasks, but for others, it is actually perhaps more efficient to actually read in a larger chunk of the file and process it in memory. For example, you may want to consider processing all the reads and regions in each chromosome in memory.

1. Read in all regions
2. Read in large sections of bamfile (chromosome at a time)
3. Process all regions for the section read in in (2)
4. Repeat 2-3 for all chromosomes

Particularly since you are talking about pileups, etc., these are going to be done efficiently by the IRanges operations in R for a whole chromosome and then you can use Views to extract your regions of interest.

I would think that this method would be largely independent of the number of regions and, instead, would depend on the number of reads.

4
Entering edit mode
10.9 years ago

I'm going to answer the first part of your question by using the picard library. The following java program reads a list of chrom/start/end values from stdin or from a file and outputs the number of reads in each region:

import java.io.File;
import java.io.IOException;

import net.sf.samtools.SAMRecord;
import net.sf.samtools.util.CloseableIterator;

public class BioStar3414
{
private File bamFile=null;
private boolean containsbamRecord=false;//false : the alignment of the returned SAMRecords need only overlap the interval of interest.
public BioStar3414()
{

}

public void open(File bamFile)
{
close();
this.bamFile=bamFile;
this.inputSam.setValidationStringency(ValidationStringency.SILENT);
}

public void close()
{
if(inputSam!=null)
{
inputSam.close();
}
this.inputSam=null;
this.bamFile=null;
}

private int scan(String chromosome,int start,int end) throws IOException
{
int nCount=0;
CloseableIterator<SAMRecord> iter=null;
try {
iter=this.inputSam.query(chromosome, start, end, this.containsbamRecord);
while(iter.hasNext())
{
SAMRecord rec=iter.next();
++nCount;
}
return nCount;
}
catch (Exception e) {
throw new IOException(e);
}
finally
{
if(iter!=null) iter.close();
}
}

public void run(BufferedReader in) throws IOException
{
String line;
{
if(line.isEmpty()) continue;
String tokens[]=line.split("[\t]");
String chrom=tokens[0];
int start=Integer.parseInt(tokens[1]);
int end=Integer.parseInt(tokens[2]);
int count=scan(chrom,start,end);
System.err.println(chrom+"\t"+start+"\t"+end+"\t"+count);
}
}

public static void main(String[] args)
{
File bamFile=null;
BioStar3414 app;
try
{
app=new BioStar3414();
int optind=0;
while(optind

Compilation:

javac -cp sam-1.16.jar:picard-1.16.jar BioStar3414.java

Execute

echo -e "chr1\t1000000\t2000000\nchr1\t2000000\t3000000\n" |\
java -cp sam-1.16.jar:picard-1.16.jar:. BioStar3414  -bam my.sorted.bam
chr1    1000000    2000000    39604
chr1    2000000    3000000    14863
2
Entering edit mode
10.9 years ago
Ian 5.8k

You might find bedtools useful http://code.google.com/p/bedtools/. You can convert BAM to BED or work with the BAM file directly. There are many tools for performing intersects, calculating coverage stats, etc.

But this would be outside of R!

2
Entering edit mode
7.5 years ago

With BEDOPS, the bedmap tool can be used to report the answer to all three questions in one command. Also, if you have access to a computational grid, BEDOPS can split work into smaller, per-node tasks very easily. This answer will review two approaches: one serial, the other parallel.

Let's assume you have a sorted BED file containing your regions-of-interest, called Regions.bed.

Let's also assume that you have run a pileup operation on your BAM file. This operation converts reads to piles of tags over a given window size, formatted either as a BED file, or a highly compressed BED file in a format called Starch. Windows of bins may or may not overlap - this is up to you. We have a writeup here that explains how to collapse reads to tag counts over windows. Let's say the output of this pileup step is a compressed BED file called BinnedReads.starch.

To display the count of binned reads over each region of interest, along with the bin element with the highest read count, add the --count and --max-element operators to the following bedmap operation:

$bedmap --echo --count --max-element Regions.bed BinnedReads.starch > Answer.bed The file Answer.bed will be a sorted BED file containing: [ region-1 ] | [ count-of-binned-reads-over-reg-1 ] | [ maximum-sized-bin-over-reg-1 ] [ region-2 ] | [ count-of-binned-reads-over-reg-2 ] | [ maximum-sized-bin-over-reg-2 ] ... [ region-n ] | [ count-of-binned-reads-over-reg-n ] | [ maximum-sized-bin-over-reg-n ] Sorted inputs will make this operation very fast. Even further, you can parallelize this operation very easily with bedmap --chrom and bedextract --list-chr. Let's say you have a Sun Grid Engine environment. Here is some bash-like pseudocode to explain how you might use BEDOPS to split tasks up by chromosome: for chromosomeName in bedextract --list-chr Regions.bed; do \ qsub <options> bedmap --chrom${chromosomeName} --echo --count --max-element Regions.bed BinnedReads.starch > Answer.\${chromosomeName}.bed; \
done

qsub -hold_jid <list_of_per_chrom_bedmap_job_names> <options> bedops --everything Answer.*.bed > Answer.bed

This map-reduce approach splits or maps work into per-chromosome jobs, where the final job reduces or concatenates all the results into one file. This shortens the time taken to perform your analysis to the time cost of the chromosome with the largest total region space - BEDOPS can make this task an order of magnitude faster, using distributed computation.