Question: Quick Programming Challenge: How Do I Calculate Reference Coverage From A Table Of Alignment Starts And Ends?
7
gravatar for Jeremy Leipzig
8.2 years ago by
Philadelphia, PA
Jeremy Leipzig17k wrote:

Imagine I have a reference sequence and some query sequences align to it

length=|end-start|+1

when end<start the query sequence has aligned to the negative strand

query   start stop
query1   100   120
query2    50    99
query3   130   110

I want to know the total coverage of the reference sequence, ignoring overlaps, so the number of reference bases covered by one or more alignments.

In this case it is 21+50+10=81

Any language is fine. Assume the input is tab delimited.

Note:This is not an invitation to recommend your favorite third-party alignment tool - you must deal with the data as I have presented it.

alignment code programming • 3.5k views
ADD COMMENTlink modified 8.2 years ago by Malcolm.Cook800 • written 8.2 years ago by Jeremy Leipzig17k
1

This problem becomes a lot easier if the start stop are in the form of start<stop, (you'll then need a strand column of course) then in addition you sort the file by the start. That's what I would do first rather than come up with a more complex solution.

ADD REPLYlink written 8.2 years ago by Istvan Albert ♦♦ 77k
1

Is it sorted or not?

ADD REPLYlink written 8.2 years ago by Paulo Nuin3.7k
1

We need more of these challenges :)

ADD REPLYlink written 8.2 years ago by Eric Normandeau9.9k

that would be considered part of the solution

ADD REPLYlink written 8.2 years ago by Jeremy Leipzig17k

that could be considered part of the solution

ADD REPLYlink written 8.2 years ago by Jeremy Leipzig17k

not sorted -i'll make that clearer

ADD REPLYlink written 8.2 years ago by Jeremy Leipzig17k

Correction: "In this case it is 21+50+21=91" since |130-110|+1=21, not 10.

ADD REPLYlink written 6.9 years ago by Nikolay Vyahhi1.1k

Correction: "In this case it is 21+50+21=92" since |130-110|+1=21, not 10.

ADD REPLYlink written 6.9 years ago by Nikolay Vyahhi1.1k
6
gravatar for Eric Normandeau
8.2 years ago by
Eric Normandeau9.9k
Quebec, Canada
Eric Normandeau9.9k wrote:

Here is another approach in Python making few assumptions and using very little memory:

in_file = "queries.txt"
coverage = set()

with open(in_file) as f:
    for line in f:
        try:
            query, v1, v2 = line.strip().split("\t")
            v_min, v_max = sorted([int(v1), int(v2)])
        except:
            continue
        coverage.update(range(v_min, v_max + 1)) # as suggested by brentp

print "Reference coverage:", len(coverage)

Here is what I get when I run it:

time python golf-course.py 
Reference coverage: 81

real    0m0.026s
user    0m0.016s
sys     0m0.008ss

It should also be pretty 'Pythonic' :)

Cheers

ADD COMMENTlink modified 8.2 years ago • written 8.2 years ago by Eric Normandeau9.9k

that is probably the most readable yet. great job

ADD REPLYlink written 8.2 years ago by Jeremy Leipzig17k

pythonic would use coverage.update(xrange(v_min, vmax + 1)) instead of using a loop.

ADD REPLYlink written 8.2 years ago by brentp22k

That's true, I made the change! Thanks @brentp

ADD REPLYlink written 8.2 years ago by Eric Normandeau9.9k

@eric upvoted, i like this solution. just to note that try/except/continue for all errors is going to make it hard to debug in the case of weird input.

ADD REPLYlink written 8.2 years ago by brentp22k

@brentp Yes, I considered making 2 or 3 try/except cases, each with an possible 'error message' and that would have made it easier to debug. I just wanted a very concise and comprehensible code that did the trick, given the input data of this problem. Cheers

ADD REPLYlink written 8.2 years ago by Eric Normandeau9.9k

I don't use python but it's a very elegant code... +1

ADD REPLYlink written 8.2 years ago by Bilouweb1.1k
5
gravatar for brentp
8.2 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

here's a quick solution, and should run pretty fast. you can easily make it work per reference as well, though here i just assume you have only a single reference as you suggest.

import numpy as np
import sys
# since we dont know the max len beforehand,
# just use a big number and then chop it later.
MAX_LEN = 3e8

coverage = np.zeros((MAX_LEN,), dtype=np.uint8)

for i, line in enumerate(open(sys.argv[1])):
    # logic for 1st line here.
    if i == 0: continue
    line = line.rstrip().split()
    ref = line[0]
    start, end = sorted(map(int, line[1:]))

    coverage[start - 1:end] |= 1

print len(coverage[coverage == 1])

and run from the commandline with your alignment file as the 1st argument.

ADD COMMENTlink modified 8.2 years ago • written 8.2 years ago by brentp22k
1

@nuin: well, numpy is pretty usual for a bioinformatics program

ADD REPLYlink written 8.2 years ago by Giovanni M Dall'Olio26k

btw, the above uses memory proportional to the size of the reference sequence, independent of the number and content of reads. i think that's important for NGS.

ADD REPLYlink written 8.2 years ago by brentp22k

that's quite elegant

ADD REPLYlink written 8.2 years ago by Jeremy Leipzig17k

+1 How can one not love Python when it makes things so easy, beautiful, AND understandable!

ADD REPLYlink written 8.2 years ago by Eric Normandeau9.9k

+1 for elegance

ADD REPLYlink written 8.2 years ago by Pierre Lindenbaum110k

... until you find out that numpy is not installed in your system ...

ADD REPLYlink written 8.2 years ago by Paulo Nuin3.7k

The program only consumes about 250MB memory yet it creates a 300 million long array - that surprised me; then I noticed the one byte long datatype, neat!

ADD REPLYlink written 8.2 years ago by Istvan Albert ♦♦ 77k

Ideally, imports should be limited to modules distributed with Python.

ADD REPLYlink written 8.2 years ago by Paulo Nuin3.7k

it is on a Mac ;-)

ADD REPLYlink written 8.2 years ago by Istvan Albert ♦♦ 77k
5
gravatar for Malcolm.Cook
8.2 years ago by
Malcolm.Cook800
kansas, usa
Malcolm.Cook800 wrote:

if you fancy perl, and have faith in the bazaar or are generally lazy, then install Set::IntSpan (a module I find generally useful for genomic interval hacking in Perl)

allowing you to write this one-liner:

perl  -MSet::IntSpan -an -e 'BEGIN{$c=Set::IntSpan->new} next if $. == 1; $c += [[sort @F[1,2]]]; END{print $c->size}'
ADD COMMENTlink modified 8.2 years ago • written 8.2 years ago by Malcolm.Cook800

cool. Set::IntSpan looks like a nice abstraction. might look nicer on multiple lines. :-)

ADD REPLYlink written 8.2 years ago by brentp22k

This works fine. Set::IntSpan allows us to avoid reinventing the wheel. The code is both clear and brief. It allows us to think in terms of the domain, rather than the implementation. Thank you!

ADD REPLYlink written 7.2 years ago by Timur Shtatland0

Fine! Then you might also like its extension, Set::IntSpan::Island, especially if you're hacking over spliced transcripts / gapped alignments in perl.

ADD REPLYlink written 7.2 years ago by Malcolm.Cook800

Fine! Then you might also like its extension, Set::IntSpan::Island, especially if you're hacking over spliced transcripts / gapped alignments in perl.

ADD REPLYlink written 7.2 years ago by Malcolm.Cook800

Fine! Then especially if you're hacking over spliced transcripts / gapped alignments in perl you might also like its extension Set::IntSpan::Island

ADD REPLYlink written 7.2 years ago by Malcolm.Cook800
4
gravatar for Pierre Lindenbaum
8.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum110k wrote:

My C++ solution (does NOT need to store all the data in memory. You just need to keep the contigs.

#include <iostream>
#include <vector>
#include <cstdlib>
#include <cerrno>
#include <algorithm>

using namespace std;

class Range
    {
    public:
        int start;
        int end;
        inline bool operator < (const Range &other) const
            {
            return start < other.start;
            }
        inline bool overlap(const Range &other) const
            {
            return !(end<=other.start || other.end<=start);
            }
        void merge(const Range &other)
            {
            start=min(start,other.start);
            end=max(end,other.end);
            }
    };

int main(int argc,char **argv)
    {
    const char TAB='\t';
    vector<Range> ranges;
    string line;

    if(!getline(cin,line,'\n'))
        {
        cerr << "first line missing" << endl;
        return EXIT_FAILURE;
        }
    if(line.compare("query\tstart\tstop")!=0)
        {
        cerr << "bad header \""<< line << "\"" << endl;
        return EXIT_FAILURE;
        }

    while(getline(cin,line,'\n'))
        {
        int d1=line.find(TAB);
        if(d1==string::npos )
            {
            cerr << "no tab in "<< line << endl;
            return EXIT_FAILURE;
            }
        else if(d1==0)
            {
            cerr << "no name in "<< line << endl;
            return EXIT_FAILURE;
            }
        int d2=line.find(TAB,d1+1);
        if(d2==string::npos)
            {
            cerr << "second tab missing in "<< line << endl;
            return EXIT_FAILURE;
            }
        Range range;
        char *p=const_cast<char*>(line.c_str());
        char *p2;
        p[d2]=0;
        errno=0;
        range.start=strtod(&p[d1+1],&p2);
        if(range.start<0 || errno!=0 || *p2!=0)
            {
            cerr << "bad start in "<< line << endl;
            return EXIT_FAILURE;
            }
        errno=0;
        range.end=strtod(&p[d2+1],&p2);
        if(range.end<0 || errno!=0 || *p2!=0)
            {
            cerr << "bad end in "<< line << endl;
            return EXIT_FAILURE;
            }
        if(range.start>range.end)
            {
            swap(range.start,range.end);
            }

        if(ranges.empty())
            {
            ranges.push_back(range);
            }
        else
            {
            vector<Range>::iterator iter= lower_bound(
                ranges.begin(),
                ranges.end(),
                range
                );

            int index= distance(ranges.begin(),iter);
            ranges.insert( iter, range );

            bool done=false;
            do
                {
                done=true;

                //remove right side
                while(index+1 < ranges.size() &&
                    ranges[index].overlap(ranges[index+1]))
                    {

                    ranges[index].merge(ranges[index+1]);
                    ranges.erase(ranges.begin()+(index+1));
                    done=false;
                    }
                //remove left side
                while(index>0 &&
                    ranges[index-1].overlap(ranges[index]))
                    {

                    ranges[index-1].merge(ranges[index]);
                    ranges.erase(ranges.begin()+index);
                    done=false;
                    --index;
                    }

                } while(!done);
            }

        }
    int sum=0;
    for(vector<Range>::const_iterator i=ranges.begin();
        i!=ranges.end();
        ++i)
        {
        sum+=((*i).end - (*i).start)+1;
        }
    cout << "coverage="<< sum << " pb" << endl;
    return 0;
    }

Compiling/Testing

all:
    g++ coverage.cpp
    echo -e "query\tstart\tstop\nquery1\t100\t120\nquery2\t50\t99\nquery3\t130\t110" | ./a.out
    time mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18  \
      -e 'select name as "query",if(strand="-",txEnd+1,txStart) as "start",if(strand="-",txStart,txEnd+1) as "stop" from knownGene where chrom="chr1"' | ./a.out

Output:

g++ coverage.cpp
echo -e "query\tstart\tstop\nquery1\t100\t120\nquery2\t50\t99\nquery3\t130\t110" | ./a.out
coverage=81 pb
time mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18  \
          -e 'select name as "query",if(strand="-",txEnd+1,txStart) as "start",if(strand="-",txStart,txEnd+1) as "stop" from knownGene where chrom="chr1"' | ./a.out
coverage=114657440 pb

real    0m3.403s
user    0m0.076s
sys     0m0.008s
ADD COMMENTlink modified 8.2 years ago • written 8.2 years ago by Pierre Lindenbaum110k

@pierre, by "it does not need to store the results in memory", that's only for overlaps, right? if there are no overlaps, it will store everything, or am i misreading your code?

ADD REPLYlink written 8.2 years ago by brentp22k

you're right. It just stores the contigs 'on the fly'. No need to (1) first load all the data (2) remove the overlaps

ADD REPLYlink written 8.2 years ago by Pierre Lindenbaum110k

@brentp I fixed the sentence.

ADD REPLYlink written 8.2 years ago by Pierre Lindenbaum110k
2
gravatar for Bilouweb
8.2 years ago by
Bilouweb1.1k
Saclay, France
Bilouweb1.1k wrote:

I made it in C++ because it is my "everyday life" language ;) the idea is :

read queries and swap if start > stop

sort queries based on "start"

merge queries if they overlap

sum the coverage of each query

#include <vector>
#include <iostream>
#include <algorithm>

using namespace std;

// Class to store start and stop positions
class Query
{
  public :
    int start;
    int stop; 
    Query(int i, int j):start(i),stop(j){
      if (start > stop)
        swap();  
    }

    void swap(){
      int i(start);
      start = stop;
      stop = i;
    }
};

// function to compare start positions of two queries
bool CompQuery (Query i,Query j){return (i.start < j.start);}

int main(int argc, char * argv)
{
  // Create query tab with start and stop positions
  vector<Query> tab;
  vector<Query>::iterator it;
  tab.push_back(Query(100,120));
  tab.push_back(Query(130,110));
  tab.push_back(Query(50,99));
  tab.push_back(Query(165,140));

  // sort query tab by increasing start positions
  sort(tab.begin(), tab.end(), CompQuery);

  // Merge Queries if they cover adjacent segments
  // example : 50->99 and 100->120 are merged in 50->120
  it = tab.begin() + 1;
  int cover = 0;
  while(it != tab.end()){
    if ((*it).start <= (*(it-1)).stop + 1){
      if ((*it).stop > (*(it-1)).stop){
        (*(it-1)).stop = (*it).stop;
      }
      it = tab.erase(it);
    } else {
      cout << (*it).start<<","<<(*it).stop<< "\n";
      cover += (*(it-1)).stop - (*(it-1)).start + 1;
      it++;
    }
  }
  cover += (*(it-1)).stop -(*(it-1)).start + 1;

  // Print Coverage
  cout << "\nCoverage = "<<cover<<"\n";
  return 0;
}
ADD COMMENTlink modified 8.2 years ago • written 8.2 years ago by Bilouweb1.1k

The last two steps can be made in one loop

ADD REPLYlink written 8.2 years ago by Bilouweb1.1k

advice: swap is a standard function of C++, you could overload the '<' operator rather than declaring ComQuery, you could also avoid to put everything in memory

ADD REPLYlink written 8.2 years ago by Pierre Lindenbaum110k

Yes, I saw this in your code. Good job !

ADD REPLYlink written 8.2 years ago by Bilouweb1.1k

Rennes ? Nantes :-)

ADD REPLYlink written 8.2 years ago by Pierre Lindenbaum110k

The answer lies in Brittany ;-)

ADD REPLYlink written 8.2 years ago by Bilouweb1.1k
2
gravatar for Jeremy Leipzig
8.2 years ago by
Philadelphia, PA
Jeremy Leipzig17k wrote:

Here is a version I wrote using R and Bioconductor's IRanges

#!/usr/bin/Rscript
library("IRanges",warn.conflicts=FALSE)
args<-commandArgs(TRUE)
mytab<-read.table(args[1],header=TRUE)
mytab$starts<-apply(mytab[,c(2,3)],1,min)
mytab$stops<-apply(mytab[,c(2,3)],1,max)
myrange<-IRanges(start=mytab$starts,end=mytab$stops)
sum(width(reduce(myrange)))
ADD COMMENTlink written 8.2 years ago by Jeremy Leipzig17k
1
gravatar for Chris Miller
8.2 years ago by
Chris Miller20k
Washington University in St. Louis, MO
Chris Miller20k wrote:

I swear I've written this code before somewhere, I'll see if I can dig it up later.

In the meantime, try this:

  • In one quick pass, flip any start/stop values and sort the file:

[?]

  • Now, you step through the outfile, merging segments.

Rough algorithm:

  • coverage = 0
  • loop through the reads
  • store the first segment in a buffer
  • check to see if the current segment's start is <= old segment stop
    • if so, update the oldSeg's stop position
    • if not, add the length of the old segment to coverage, then set the oldSeg = currSeg
  • at the end of the loop, clear the last segment out of the buffer and add it's length to coverage.
  • coverage/referenceSize = coverage percentage
ADD COMMENTlink written 8.2 years ago by Chris Miller20k

To be clear, other algs may be faster, but this requires almost no memory and scales to as many reads as you've got.

ADD REPLYlink written 8.2 years ago by Chris Miller20k

i think it would be great to see a ruby version

ADD REPLYlink written 8.2 years ago by Jeremy Leipzig17k
1
gravatar for Yuri
8.2 years ago by
Yuri1.5k
Bethesda, MD
Yuri1.5k wrote:

Just for the collection, the code in MATLAB (without using any library/toolbox):

q = dlmread('queries.txt','\t',1,1); %# read the file
q = sort(q,2); %# start < end
qidx = ones(size(q)); %# start = 1
qidx(:,2) = -1; %# end = -1
[qsorted, si] = sort(q(:));
sidx = qidx(si);
gap = find(cumsum(sidx)==0); %# find start indices of gaps
boundaryidx = reshape(unique([1; gap; gap(1:end-1)+1]),2,[]);
coverage = sum(diff(qsorted(boundaryidx))+1);
disp(coverage)
ADD COMMENTlink written 8.2 years ago by Yuri1.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 460 users visited in the last hour