Question: Intersecting A Set Of Bam Reads With A Set Of Coordinates
5
gravatar for User 9996
9.8 years ago by
User 9996810
User 9996810 wrote:

I have a set of regions in the genome (about 4 million regions, parametrized by start and end genomic cooridnates) that I want to align BAM reads into. I want to get all reads that overlap that set of coordinates.

The naive way to do this is very slow and I'd like to use interval trees. However, to use interval trees, I'd have to first index all the BAM reads into a tree, and then, for each coordinate in my set of regions, look for all the reads within the tree.

My concern is that this requires loading all of the BAM reads into memory since I have to index them into the interval tree data structure.

Is there a way around this, or a Python library that does this already? Any help would be appreciated.

Thanks.

p.s. I'd prefer to do this within Python and not use BEDTools unless there's an easy way to integrate Python with BEDTools that someone can recommend?

python next-gen sam bam sequencing • 4.8k views
ADD COMMENTlink written 9.8 years ago by User 9996810
1

Can you explain why you need an interval tree? If you have an aligned BAM file, you can sort and index it, then retrieve reads in a region using 'fetch' in pysam: http://code.google.com/p/pysam/

ADD REPLYlink written 9.8 years ago by Brad Chapman9.5k

fetch will be horribly inefficient. If you iterate through a list of coordinates and fetch the reads from each one, it will take forever.

ADD REPLYlink written 9.8 years ago by User 9996810
7
gravatar for Aaronquinlan
9.8 years ago by
Aaronquinlan11k
United States
Aaronquinlan11k wrote:

With plenty of help from Brent Pedersen, I added a small Python/Cython interface to the BEDTools overlap detection rotines called bedtools-python. Check out this brief example which combines pysam and bedtools-python for the task you describe. Also, Ryan Dale's pybedtools is a near complete Python interface to BEDTools.

ADD COMMENTlink modified 9.8 years ago • written 9.8 years ago by Aaronquinlan11k
1

Both python interfaces to bedtools are very useful! Thanks a lot for creating these tools.

ADD REPLYlink written 9.8 years ago by Istvan Albert ♦♦ 86k

to be clear, Ryan Dale deserves all of the credit for pybedtools. The initial release is quite good and he is working on substantial improvements.

ADD REPLYlink written 9.8 years ago by Aaronquinlan11k
5
gravatar for Casey Bergman
9.8 years ago by
Casey Bergman18k
Athens, GA, USA
Casey Bergman18k wrote:

See pybedtools: https://github.com/daler/pybedtools

pybedtools is a Python wrapper for Aaron Quinlan's BEDtools and is designed to leverage the "genome algebra" power of BEDtools from within Python scripts.

ADD COMMENTlink written 9.8 years ago by Casey Bergman18k

I cannot get it to install. After running python setup.py install --home=$HOME, I try to "import pybedtools" and I get the error "File "/srv/pkg/python/python-2.6.4/lib/python2.6/subprocess.py", line 1126, in _execute_child raise child_exception"

ADD REPLYlink written 9.8 years ago by User 9996810

Is there more to the error message? Sometimes you can get this if a command can't be found, which could happen if bedtools isn't on the path. Most helpful would be if you pasted the full traceback either here or perhaps more appropriately in an email to the bedtools mailing list.

ADD REPLYlink modified 15 months ago by _r_am31k • written 9.8 years ago by Ryan Dale4.9k

For anyone else with the above error for pybedtools: make sure you've installed BEDTools and that it's available on your path

ADD REPLYlink written 9.8 years ago by Ryan Dale4.9k
4
gravatar for Martin Morgan
9.8 years ago by
Martin Morgan1.6k
United States
Martin Morgan1.6k wrote:

The question implies python, but R / Bioconductor has the IRanges and related packages. You might (1) input your reference coordinates with the rtracklayer package (for gff and similar formats) or GenomicFeatures::makeTranscriptDb (for UCSF / biomaRt tracks) or rolled by hand; (b) count overlaps between the regions and bam files using Rsamtools::countBam(), which relies on the definition of overlap in samtools. Alternatively you might use IRanges::countOverlaps or findOverlaps for more flexibility in what counts as an overlap. A worked example is in section three of the vignette "GenomicRanges Use Cases" with a few more appropriate to counting approaches sketched in "Counting Alignment Overlaps with GenomicRanges". R works with objects in memory; the reference ranges fit easily and the bam files are easy to process in reasonably sized chunks.

ADD COMMENTlink modified 15 months ago by _r_am31k • written 9.8 years ago by Martin Morgan1.6k
1
gravatar for Brad Chapman
9.8 years ago by
Brad Chapman9.5k
Boston, MA
Brad Chapman9.5k wrote:

To follow up more on my comment above, at some point you are going to have to extract the reads from your BAM file to push them into an IntervalTree or BED file. Instead of doing this extraction, then querying, you can just do the extraction directly from the indexed file. This is fairly speedy, for disk access. For instance, I extracted 4 million random 500bp positions from a human genome BAM and it takes about 2 hours:

% ls -1sh
total 3.7G
3.7G 6_110216_FC638NPAAXX-sort.bam
6.9M 6_110216_FC638NPAAXX-sort.bam.bai
4.0K fetch_regions.py
% python2.6 fetch_regions.py 6_110216_FC638NPAAXX-sort.bam 
6717.29s user 123.74s system 99% cpu 1:54:03.33 total

The code is below:

import sys
import random

import pysam

def random_regions(in_bam, size):
    regions = []
    for chrom in in_bam.header["SQ"]:
        regions.append((chrom["SN"], 1, int(chrom["LN"]) - size))
    while 1:
        space, start, end = random.choice(regions)
        pos = random.randint(start, end)
        yield space, pos, pos+size

in_file = sys.argv[1]
in_bam = pysam.Samfile(in_file, "rb")
size = 500
n = 4e6
region_gen = random_regions(in_bam, size)
for i in range(int(n)):
    if i % 10000 == 0:
        print i
    space, start, end  = region_gen.next()
    read_size = 0
    for read in in_bam.fetch(space, start, end):
        read_size += read.rlen

This is not to take away from the other excellent tools mentioned for quick overlap comparison with IntervalTrees. I use these all the time for intersection of regions, but for your question about dealing directly with reads it seems reasonable to utilize direct lookup.

(As an aside, you've been here for a while as 'unknown (google)' and asked several questions. Why not add your name and become part of the community?)

ADD COMMENTlink modified 15 months ago by _r_am31k • written 9.8 years ago by Brad Chapman9.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1124 users visited in the last hour