Quick Programming Challenge: How Do I Calculate Reference Coverage From A Table Of Alignment Starts And Ends?
8
7
Entering edit mode
14.2 years ago

Imagine I have a reference sequence and some query sequences align to it

length=|end-start|+1


when end < start the query sequence has aligned to the negative strand

query   start stop
query1   100   120
query2    50    99
query3   130   110


I want to know the total coverage of the reference sequence, ignoring overlaps, so the number of reference bases covered by one or more alignments.

In this case it is 21+50+10=81

Any language is fine. Assume the input is tab delimited.

Note:This is not an invitation to recommend your favorite third-party alignment tool - you must deal with the data as I have presented it.

code alignment programming • 7.2k views
1
Entering edit mode

This problem becomes a lot easier if the start stop are in the form of start<stop, (you'll then need a strand column of course) then in addition you sort the file by the start. That's what I would do first rather than come up with a more complex solution.

1
Entering edit mode

Is it sorted or not?

1
Entering edit mode

We need more of these challenges :)

0
Entering edit mode

that would be considered part of the solution

0
Entering edit mode

that could be considered part of the solution

0
Entering edit mode

not sorted -i'll make that clearer

0
Entering edit mode

Correction: "In this case it is 21+50+21=91" since |130-110|+1=21, not 10.

0
Entering edit mode

Correction: "In this case it is 21+50+21=92" since |130-110|+1=21, not 10.

6
Entering edit mode
14.2 years ago

Here is another approach in Python making few assumptions and using very little memory:

in_file = "queries.txt"
coverage = set()

with open(in_file) as f:
for line in f:
try:
query, v1, v2 = line.strip().split("\t")
v_min, v_max = sorted([int(v1), int(v2)])
except:
continue
coverage.update(range(v_min, v_max + 1)) # as suggested by brentp

print "Reference coverage:", len(coverage)


Here is what I get when I run it:

time python golf-course.py
Reference coverage: 81

real    0m0.026s
user    0m0.016s
sys     0m0.008ss


It should also be pretty 'Pythonic' :)

Cheers

0
Entering edit mode

that is probably the most readable yet. great job

0
Entering edit mode

pythonic would use coverage.update(xrange(v_min, vmax + 1)) instead of using a loop.

0
Entering edit mode

That's true, I made the change! Thanks @brentp

0
Entering edit mode

@eric upvoted, i like this solution. just to note that try/except/continue for all errors is going to make it hard to debug in the case of weird input.

0
Entering edit mode

@brentp Yes, I considered making 2 or 3 try/except cases, each with an possible 'error message' and that would have made it easier to debug. I just wanted a very concise and comprehensible code that did the trick, given the input data of this problem. Cheers

0
Entering edit mode

I don't use python but it's a very elegant code... +1

5
Entering edit mode
14.2 years ago
brentp 24k

Here's a quick solution, and should run pretty fast. you can easily make it work per reference as well, though here i just assume you have only a single reference as you suggest.

import numpy as np
import sys
# since we dont know the max len beforehand,
# just use a big number and then chop it later.
MAX_LEN = 3e8

coverage = np.zeros((MAX_LEN,), dtype=np.uint8)

for i, line in enumerate(open(sys.argv[1])):
# logic for 1st line here.
if i == 0: continue
line = line.rstrip().split()
ref = line[0]
start, end = sorted(map(int, line[1:]))

coverage[start - 1:end] |= 1

print len(coverage[coverage == 1])


and run from the commandline with your alignment file as the 1st argument.

1
Entering edit mode

@nuin: well, numpy is pretty usual for a bioinformatics program

0
Entering edit mode

btw, the above uses memory proportional to the size of the reference sequence, independent of the number and content of reads. i think that's important for NGS.

0
Entering edit mode

that's quite elegant

0
Entering edit mode

+1 How can one not love Python when it makes things so easy, beautiful, AND understandable!

0
Entering edit mode

+1 for elegance

0
Entering edit mode

... until you find out that numpy is not installed in your system ...

0
Entering edit mode

The program only consumes about 250MB memory yet it creates a 300 million long array - that surprised me; then I noticed the one byte long datatype, neat!

0
Entering edit mode

Ideally, imports should be limited to modules distributed with Python.

0
Entering edit mode

it is on a Mac ;-)

5
Entering edit mode
14.2 years ago
Malcolm.Cook ★ 1.5k

If you fancy perl, and have faith in the bazaar or are generally lazy, then install Set::IntSpan (a module I find generally useful for genomic interval hacking in Perl)

allowing you to write this one-liner:

perl -MSet::IntSpan -an -e 'BEGIN{$c=Set::IntSpan->new} next if$. == 1; $c += [[sort @F[1,2]]]; END{print$c->size}'

0
Entering edit mode

cool. Set::IntSpan looks like a nice abstraction. might look nicer on multiple lines. :-)

0
Entering edit mode

This works fine. Set::IntSpan allows us to avoid reinventing the wheel. The code is both clear and brief. It allows us to think in terms of the domain, rather than the implementation. Thank you!

0
Entering edit mode

Fine! Then you might also like its extension, Set::IntSpan::Island, especially if you're hacking over spliced transcripts / gapped alignments in perl.

0
Entering edit mode

Fine! Then you might also like its extension, Set::IntSpan::Island, especially if you're hacking over spliced transcripts / gapped alignments in perl.

0
Entering edit mode

Fine! Then especially if you're hacking over spliced transcripts / gapped alignments in perl you might also like its extension Set::IntSpan::Island

4
Entering edit mode
14.2 years ago

My C++ solution (does NOT need to store all the data in memory. You just need to keep the contigs.

Compiling/Testing

all:
g++ coverage.cpp
echo -e "query\tstart\tstop\nquery1\t100\t120\nquery2\t50\t99\nquery3\t130\t110" | ./a.out
time mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18  \
-e 'select name as "query",if(strand="-",txEnd+1,txStart) as "start",if(strand="-",txStart,txEnd+1) as "stop" from knownGene where chrom="chr1"' | ./a.out


Output:

g++ coverage.cpp
echo -e "query\tstart\tstop\nquery1\t100\t120\nquery2\t50\t99\nquery3\t130\t110" | ./a.out
coverage=81 pb
time mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18  \
-e 'select name as "query",if(strand="-",txEnd+1,txStart) as "start",if(strand="-",txStart,txEnd+1) as "stop" from knownGene where chrom="chr1"' | ./a.out
coverage=114657440 pb

real    0m3.403s
user    0m0.076s
sys     0m0.008s

0
Entering edit mode

@pierre, by "it does not need to store the results in memory", that's only for overlaps, right? if there are no overlaps, it will store everything, or am i misreading your code?

0
Entering edit mode

you're right. It just stores the contigs 'on the fly'. No need to (1) first load all the data (2) remove the overlaps

0
Entering edit mode

@brentp I fixed the sentence.

2
Entering edit mode
14.2 years ago
Bilouweb ★ 1.1k

I made it in C++ because it is my "everyday life" language ;)

The idea is :

1. read queries and swap if start > stop
2. sort queries based on "start"
3. merge queries if they overlap
4. sum the coverage of each query

#include <vector>
#include <iostream>
#include <algorithm>

using namespace std;

// Class to store start and stop positions
class Query
{
public :
int start;
int stop;
Query(int i, int j):start(i),stop(j){
if (start > stop)
swap();
}

void swap(){
int i(start);
start = stop;
stop = i;
}
};

// function to compare start positions of two queries
bool CompQuery (Query i,Query j){return (i.start < j.start);}

int main(int argc, char * argv)
{
// Create query tab with start and stop positions
vector<Query> tab;
vector<Query>::iterator it;
tab.push_back(Query(100,120));
tab.push_back(Query(130,110));
tab.push_back(Query(50,99));
tab.push_back(Query(165,140));

// sort query tab by increasing start positions
sort(tab.begin(), tab.end(), CompQuery);

// Merge Queries if they cover adjacent segments
// example : 50->99 and 100->120 are merged in 50->120
it = tab.begin() + 1;
int cover = 0;
while(it != tab.end()){
if ((*it).start <= (*(it-1)).stop + 1){
if ((*it).stop > (*(it-1)).stop){
(*(it-1)).stop = (*it).stop;
}
it = tab.erase(it);
} else {
cout << (*it).start<<","<<(*it).stop<< "\n";
cover += (*(it-1)).stop - (*(it-1)).start + 1;
it++;
}
}
cover += (*(it-1)).stop -(*(it-1)).start + 1;

// Print Coverage
cout << "\nCoverage = "<<cover<<"\n";
return 0;
}

0
Entering edit mode

The last two steps can be made in one loop

0
Entering edit mode

advice: swap is a standard function of C++, you could overload the '<' operator rather than declaring ComQuery, you could also avoid to put everything in memory

0
Entering edit mode

Yes, I saw this in your code. Good job !

0
Entering edit mode

Rennes ? Nantes :-)

0
Entering edit mode

The answer lies in Brittany ;-)

2
Entering edit mode
14.2 years ago

Here is a version I wrote using R and Bioconductor's IRanges

#!/usr/bin/Rscript
library("IRanges",warn.conflicts=FALSE)
args<-commandArgs(TRUE)
mytab$starts<-apply(mytab[,c(2,3)],1,min) mytab$stops<-apply(mytab[,c(2,3)],1,max)
myrange<-IRanges(start=mytab$starts,end=mytab$stops)
sum(width(reduce(myrange)))

1
Entering edit mode
14.2 years ago

I swear I've written this code before somewhere, I'll see if I can dig it up later.

In the meantime, try this:

• In one quick pass, flip any start/stop values and sort the file:

[?]

• Now, you step through the outfile, merging segments.

Rough algorithm:

• coverage = 0
• store the first segment in a buffer
• check to see if the current segment's start is <= old segment stop
• if so, update the oldSeg's stop position
• if not, add the length of the old segment to coverage, then set the oldSeg = currSeg
• at the end of the loop, clear the last segment out of the buffer and add it's length to coverage.
• coverage/referenceSize = coverage percentage
0
Entering edit mode

To be clear, other algs may be faster, but this requires almost no memory and scales to as many reads as you've got.

0
Entering edit mode

i think it would be great to see a ruby version

1
Entering edit mode
14.2 years ago
Yuri ★ 1.7k

Just for the collection, the code in MATLAB (without using any library/toolbox):

q = dlmread('queries.txt','\t',1,1); %# read the file
q = sort(q,2); %# start < end
qidx = ones(size(q)); %# start = 1
qidx(:,2) = -1; %# end = -1
[qsorted, si] = sort(q(:));
sidx = qidx(si);
gap = find(cumsum(sidx)==0); %# find start indices of gaps
boundaryidx = reshape(unique([1; gap; gap(1:end-1)+1]),2,[]);
coverage = sum(diff(qsorted(boundaryidx))+1);
disp(coverage)