Question: Quick Programming Challenge: How Do I Calculate Reference Coverage From A Table Of Alignment Starts And Ends?
7
8.0 years ago by
Jeremy Leipzig17k wrote:

Imagine I have a reference sequence and some query sequences align to it

length=|end-start|+1

when `end<start` the query sequence has aligned to the negative strand

``````query   start stop
query1   100   120
query2    50    99
query3   130   110
``````

I want to know the total coverage of the reference sequence, ignoring overlaps, so the number of reference bases covered by one or more alignments.

In this case it is 21+50+10=81

Any language is fine. Assume the input is tab delimited.

Note:This is not an invitation to recommend your favorite third-party alignment tool - you must deal with the data as I have presented it.

alignment code programming • 3.4k views
modified 8.0 years ago by Malcolm.Cook790 • written 8.0 years ago by Jeremy Leipzig17k
1

This problem becomes a lot easier if the start stop are in the form of start<stop, (you'll then need a strand column of course) then in addition you sort the file by the start. That's what I would do first rather than come up with a more complex solution.

1

Is it sorted or not?

1

We need more of these challenges :)

that would be considered part of the solution

that could be considered part of the solution

not sorted -i'll make that clearer

Correction: "In this case it is 21+50+21=91" since |130-110|+1=21, not 10.

Correction: "In this case it is 21+50+21=92" since |130-110|+1=21, not 10.

6
8.0 years ago by
Eric Normandeau9.7k
Eric Normandeau9.7k wrote:

Here is another approach in Python making few assumptions and using very little memory:

``````in_file = "queries.txt"
coverage = set()

with open(in_file) as f:
for line in f:
try:
query, v1, v2 = line.strip().split("\t")
v_min, v_max = sorted([int(v1), int(v2)])
except:
continue
coverage.update(range(v_min, v_max + 1)) # as suggested by brentp

print "Reference coverage:", len(coverage)
``````

Here is what I get when I run it:

``````time python golf-course.py
Reference coverage: 81

real    0m0.026s
user    0m0.016s
sys     0m0.008ss
``````

It should also be pretty 'Pythonic' :)

Cheers

that is probably the most readable yet. great job

pythonic would use coverage.update(xrange(v_min, vmax + 1)) instead of using a loop.

That's true, I made the change! Thanks @brentp

@eric upvoted, i like this solution. just to note that try/except/continue for all errors is going to make it hard to debug in the case of weird input.

@brentp Yes, I considered making 2 or 3 try/except cases, each with an possible 'error message' and that would have made it easier to debug. I just wanted a very concise and comprehensible code that did the trick, given the input data of this problem. Cheers

I don't use python but it's a very elegant code... +1

5
8.0 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

here's a quick solution, and should run pretty fast. you can easily make it work per reference as well, though here i just assume you have only a single reference as you suggest.

``````import numpy as np
import sys
# since we dont know the max len beforehand,
# just use a big number and then chop it later.
MAX_LEN = 3e8

coverage = np.zeros((MAX_LEN,), dtype=np.uint8)

for i, line in enumerate(open(sys.argv[1])):
# logic for 1st line here.
if i == 0: continue
line = line.rstrip().split()
ref = line[0]
start, end = sorted(map(int, line[1:]))

coverage[start - 1:end] |= 1

print len(coverage[coverage == 1])
``````

and run from the commandline with your alignment file as the 1st argument.

1

@nuin: well, numpy is pretty usual for a bioinformatics program

btw, the above uses memory proportional to the size of the reference sequence, independent of the number and content of reads. i think that's important for NGS.

that's quite elegant

+1 How can one not love Python when it makes things so easy, beautiful, AND understandable!

+1 for elegance

... until you find out that numpy is not installed in your system ...

The program only consumes about 250MB memory yet it creates a 300 million long array - that surprised me; then I noticed the one byte long datatype, neat!

Ideally, imports should be limited to modules distributed with Python.

it is on a Mac ;-)

5
8.0 years ago by
Malcolm.Cook790
kansas, usa
Malcolm.Cook790 wrote:

if you fancy perl, and have faith in the bazaar or are generally lazy, then install Set::IntSpan (a module I find generally useful for genomic interval hacking in Perl)

allowing you to write this one-liner:

``````perl  -MSet::IntSpan -an -e 'BEGIN{\$c=Set::IntSpan->new} next if \$. == 1; \$c += [[sort @F[1,2]]]; END{print \$c->size}'
``````

cool. Set::IntSpan looks like a nice abstraction. might look nicer on multiple lines. :-)

This works fine. Set::IntSpan allows us to avoid reinventing the wheel. The code is both clear and brief. It allows us to think in terms of the domain, rather than the implementation. Thank you!

Fine! Then you might also like its extension, Set::IntSpan::Island, especially if you're hacking over spliced transcripts / gapped alignments in perl.

Fine! Then you might also like its extension, Set::IntSpan::Island, especially if you're hacking over spliced transcripts / gapped alignments in perl.

Fine! Then especially if you're hacking over spliced transcripts / gapped alignments in perl you might also like its extension Set::IntSpan::Island

4
8.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum106k wrote:

My C++ solution (does NOT need to store all the data in memory. You just need to keep the contigs.

``````#include <iostream>
#include <vector>
#include <cstdlib>
#include <cerrno>
#include <algorithm>

using namespace std;

class Range
{
public:
int start;
int end;
inline bool operator < (const Range &other) const
{
return start < other.start;
}
inline bool overlap(const Range &other) const
{
return !(end<=other.start || other.end<=start);
}
void merge(const Range &other)
{
start=min(start,other.start);
end=max(end,other.end);
}
};

int main(int argc,char **argv)
{
const char TAB='\t';
vector<Range> ranges;
string line;

if(!getline(cin,line,'\n'))
{
cerr << "first line missing" << endl;
return EXIT_FAILURE;
}
if(line.compare("query\tstart\tstop")!=0)
{
return EXIT_FAILURE;
}

while(getline(cin,line,'\n'))
{
int d1=line.find(TAB);
if(d1==string::npos )
{
cerr << "no tab in "<< line << endl;
return EXIT_FAILURE;
}
else if(d1==0)
{
cerr << "no name in "<< line << endl;
return EXIT_FAILURE;
}
int d2=line.find(TAB,d1+1);
if(d2==string::npos)
{
cerr << "second tab missing in "<< line << endl;
return EXIT_FAILURE;
}
Range range;
char *p=const_cast<char*>(line.c_str());
char *p2;
p[d2]=0;
errno=0;
range.start=strtod(&p[d1+1],&p2);
if(range.start<0 || errno!=0 || *p2!=0)
{
cerr << "bad start in "<< line << endl;
return EXIT_FAILURE;
}
errno=0;
range.end=strtod(&p[d2+1],&p2);
if(range.end<0 || errno!=0 || *p2!=0)
{
cerr << "bad end in "<< line << endl;
return EXIT_FAILURE;
}
if(range.start>range.end)
{
swap(range.start,range.end);
}

if(ranges.empty())
{
ranges.push_back(range);
}
else
{
vector<Range>::iterator iter= lower_bound(
ranges.begin(),
ranges.end(),
range
);

int index= distance(ranges.begin(),iter);
ranges.insert( iter, range );

bool done=false;
do
{
done=true;

//remove right side
while(index+1 < ranges.size() &&
ranges[index].overlap(ranges[index+1]))
{

ranges[index].merge(ranges[index+1]);
ranges.erase(ranges.begin()+(index+1));
done=false;
}
//remove left side
while(index>0 &&
ranges[index-1].overlap(ranges[index]))
{

ranges[index-1].merge(ranges[index]);
ranges.erase(ranges.begin()+index);
done=false;
--index;
}

} while(!done);
}

}
int sum=0;
for(vector<Range>::const_iterator i=ranges.begin();
i!=ranges.end();
++i)
{
sum+=((*i).end - (*i).start)+1;
}
cout << "coverage="<< sum << " pb" << endl;
return 0;
}
``````

Compiling/Testing

``````all:
g++ coverage.cpp
echo -e "query\tstart\tstop\nquery1\t100\t120\nquery2\t50\t99\nquery3\t130\t110" | ./a.out
time mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18  \
-e 'select name as "query",if(strand="-",txEnd+1,txStart) as "start",if(strand="-",txStart,txEnd+1) as "stop" from knownGene where chrom="chr1"' | ./a.out
``````

Output:

``````g++ coverage.cpp
echo -e "query\tstart\tstop\nquery1\t100\t120\nquery2\t50\t99\nquery3\t130\t110" | ./a.out
coverage=81 pb
time mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18  \
-e 'select name as "query",if(strand="-",txEnd+1,txStart) as "start",if(strand="-",txStart,txEnd+1) as "stop" from knownGene where chrom="chr1"' | ./a.out
coverage=114657440 pb

real    0m3.403s
user    0m0.076s
sys     0m0.008s
``````

@pierre, by "it does not need to store the results in memory", that's only for overlaps, right? if there are no overlaps, it will store everything, or am i misreading your code?

you're right. It just stores the contigs 'on the fly'. No need to (1) first load all the data (2) remove the overlaps

@brentp I fixed the sentence.

2
8.0 years ago by
Bilouweb1.1k
Saclay, France
Bilouweb1.1k wrote:

I made it in C++ because it is my "everyday life" language ;) the idea is :

read queries and swap if start > stop

sort queries based on "start"

merge queries if they overlap

sum the coverage of each query

``````#include <vector>
#include <iostream>
#include <algorithm>

using namespace std;

// Class to store start and stop positions
class Query
{
public :
int start;
int stop;
Query(int i, int j):start(i),stop(j){
if (start > stop)
swap();
}

void swap(){
int i(start);
start = stop;
stop = i;
}
};

// function to compare start positions of two queries
bool CompQuery (Query i,Query j){return (i.start < j.start);}

int main(int argc, char * argv)
{
// Create query tab with start and stop positions
vector<Query> tab;
vector<Query>::iterator it;
tab.push_back(Query(100,120));
tab.push_back(Query(130,110));
tab.push_back(Query(50,99));
tab.push_back(Query(165,140));

// sort query tab by increasing start positions
sort(tab.begin(), tab.end(), CompQuery);

// Merge Queries if they cover adjacent segments
// example : 50->99 and 100->120 are merged in 50->120
it = tab.begin() + 1;
int cover = 0;
while(it != tab.end()){
if ((*it).start <= (*(it-1)).stop + 1){
if ((*it).stop > (*(it-1)).stop){
(*(it-1)).stop = (*it).stop;
}
it = tab.erase(it);
} else {
cout << (*it).start<<","<<(*it).stop<< "\n";
cover += (*(it-1)).stop - (*(it-1)).start + 1;
it++;
}
}
cover += (*(it-1)).stop -(*(it-1)).start + 1;

// Print Coverage
cout << "\nCoverage = "<<cover<<"\n";
return 0;
}
``````

The last two steps can be made in one loop

advice: swap is a standard function of C++, you could overload the '<' operator rather than declaring ComQuery, you could also avoid to put everything in memory

Yes, I saw this in your code. Good job !

Rennes ? Nantes :-)

The answer lies in Brittany ;-)

2
8.0 years ago by
Jeremy Leipzig17k wrote:

Here is a version I wrote using R and Bioconductor's IRanges

``````#!/usr/bin/Rscript
library("IRanges",warn.conflicts=FALSE)
args<-commandArgs(TRUE)
mytab\$starts<-apply(mytab[,c(2,3)],1,min)
mytab\$stops<-apply(mytab[,c(2,3)],1,max)
myrange<-IRanges(start=mytab\$starts,end=mytab\$stops)
sum(width(reduce(myrange)))
``````
1
8.0 years ago by
Chris Miller19k
Washington University in St. Louis, MO
Chris Miller19k wrote:

I swear I've written this code before somewhere, I'll see if I can dig it up later.

In the meantime, try this:

• In one quick pass, flip any start/stop values and sort the file:

[?]

• Now, you step through the outfile, merging segments.

Rough algorithm:

• coverage = 0
• store the first segment in a buffer
• check to see if the current segment's start is <= old segment stop
• if so, update the oldSeg's stop position
• if not, add the length of the old segment to coverage, then set the oldSeg = currSeg
• at the end of the loop, clear the last segment out of the buffer and add it's length to coverage.
• coverage/referenceSize = coverage percentage

To be clear, other algs may be faster, but this requires almost no memory and scales to as many reads as you've got.

i think it would be great to see a ruby version

1
8.0 years ago by
Yuri1.5k
Bethesda, MD
Yuri1.5k wrote:

Just for the collection, the code in MATLAB (without using any library/toolbox):

``````q = dlmread('queries.txt','\t',1,1); %# read the file
q = sort(q,2); %# start < end
qidx = ones(size(q)); %# start = 1
qidx(:,2) = -1; %# end = -1
[qsorted, si] = sort(q(:));
sidx = qidx(si);
gap = find(cumsum(sidx)==0); %# find start indices of gaps
boundaryidx = reshape(unique([1; gap; gap(1:end-1)+1]),2,[]);
coverage = sum(diff(qsorted(boundaryidx))+1);
disp(coverage)
``````