Question

Counting read spacing efficiently

0

Entering edit mode

9.3 years ago

André Rendeiro ▴ 50

I'm looking for the most efficient (or at least more efficient than mine) way to count distances between reads from a bam file, such as the following figure (Winter et. al. 2013 Genome Research).

Since I'm only interested in distances smaller than 200bp, my current attempt is to get a window around each unique read and then to ask the absolute distance from all reads in that window regarding the window center.

Although I've embedded this in parallel code, it is still quite inefficient.

Any ideas on how to improve this?

Here's a gist with my code:

sequencing alignment DNase-seq parallel • 2.1k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by André Rendeiro ▴ 50

0

Entering edit mode

Are your BAM files sorted? If so, the whole process can be done in a much more efficient manner. Even if they're not sorted, it could well be faster to sort them and then compute what you need.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Devon Ryan 104k

0

Entering edit mode

Yes, they're sorted and indexed. HTSeq uses the index when I do bam[window] to retrieve reads in that window. However I also thought I should take more advantage of the sorted/indexed bam to count the distances. How would I go about doing that?

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by André Rendeiro ▴ 50

0

Entering edit mode

I wouldn't bother with htseq at all. Just use pysam to iterate over each alignment. You just want to know (A) the bounds of each alignment and (B) the bounds of the alignments before/after it. From that, you can determine the distance to the nearest alignment and just increment a counter vector appropriately. This should require only a single pass through the file and only a few bytes of memory.

The slowness in your current method is due to storing each window and then seeing for each alignment if its window is already in there. That's a huge number of comparisons that simply don't need to be made.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Devon Ryan 104k

1

Entering edit mode

Counting the distance to the nearest read (for all reads) is obviously the simplest way of doing it. Yet, You'd lose information regarding other reads which would be near within a window with the distance of interest (~200bp) and I'm not sure that was the way the figure above was made.

I think I'll do it in this simpler way and see how it compares. Thanks for the input.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by André Rendeiro ▴ 50

0

Entering edit mode

Ah, yeah, if you want a count of all alignments within a window around all other alignments then your implementation isn't that bad. You could make it more efficient in a similar manner to that I previously suggested, but honestly the time it'd take to write and debug that is probably longer than the time required to run what you already have. So unless you have to do this a lot or want to publish it as part of something else then you probably already have a good enough solution.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Devon Ryan 104k

0

Entering edit mode

9.3 years ago

Alex Reynolds 35k

Here's another approach that uses parallelization and sorting (and then more parallelization!). Splitting the work up into smaller and more efficient pieces should make this job much, much faster.

If your BAM file is indexed and you have an SGE or OGE computational grid, you could parallelize conversion of BAM to sorted BED:

$ bam2bed_sge reads.bam reads.bed

If you prefer to use GNU Parallel, there is also bam2bed_gnuParallel.

Splitting the conversion work by chromosome reduces the execution time generally to the largest chromosome of reads.

Then look for the closest read to each read and only report distances:

$ closest-features --closest --dist --no-ref --no-overlaps reads.bed reads.bed > dists.txt

The --no-ref and --no-overlaps options leave out the read element and only reports the closest signed distance between a read and its nearest read, up or downstream.

Sorting with BEDOPS eliminates wasteful pairwise comparisons between all reads and makes this task fast.

You can again parallelize this work with bedextract and the addition of the --chrom operator to closest-features. Again, supposing an SGE computational grid:

for chrName in `bedextract --list-chr reads.bed`; do \
    qsub <grid_options> closest-features --chrom ${chrName} --closest --dist --no-ref --no-overlaps reads.bed reads.bed > dists.${chrName}.txt; \
done

As with the BAM conversion step, this reduces search time down generally to the largest chromosome of reads.

Then process dists.txt (or each per-chromosome distances file, if parallelized) to filter out distances larger than some threshold (say, 200), remove the sign, and (say) report the average distance of thresholded distances:

$ awk ' \
    function abs(x) { \
        return ((x < 0.0) ? -x : x) \
    } \
    BEGIN { \
        s = 0; \
        n = 0; \
    } \
    { \
        if ($0 < 200) { s += abs($0); n++; } \
    } \
    END { \
        if (n > 0) { print s / n; } else { print "NaN"; } \
    }' dists.txt

ADD COMMENT • link updated 4.5 years ago by Ram 43k • written 9.3 years ago by Alex Reynolds 35k

Ram · Accepted Answer · 2015-01-19

For future reference (in case anyone faces the same problem) here's an improved version:

Make genome-wide windows with required width (400bp in my case).
Get pairwise combinations of reads in window (making use of indexed bam file), and count absolute distances between.

Subsampling windows or reads to have a first look is of course reasonable since data will start looking like the overlap picture at some (2-10) millions of reads.

Code is here: