Question: Programming Challenge: Divide The Human Genome Among X Cores, Taking Into Account Gaps
4
gravatar for Jeremy Leipzig
4.2 years ago by
Philadelphia, PA
Jeremy Leipzig17k wrote:

Develop a tool that divides the hg19 human genome (1-22XYM) to distribute among different cores or nodes. The goals should be:

  • divide the genome such that each core has to deal with a roughly equal number of eligible base pairs (see gap note below)
  • keep these intervals non-overlapping
  • keep these intervals as close to large contiguous blocks as possible, for 100 cores, you can certainly have 120 intervals (someone has to get MT), but 500 intervals would be too much
  • take account genomic assembly gaps which are all NNN . You can include the gaps in the intervals, but they do not add to the burden and therefore should not be considered in the size calculation. Here some some gaps from UCSC table browser -> mapping and sequencing tracks -> gap

    bin chrom chromStart chromEnd ix n size type bridge

    0 chr1 124535434 142535434 1271 N 18000000 heterochromatin no

    23 chr1 121535434 124535434 1270 N 3000000 centromere no

    76 chr1 3845268 3995268 47 N 150000 contig no

    85 chr1 13219912 13319912 154 N 100000 contig no

    89 chr1 17125658 17175658 196 N 50000 clone yes

(Here is a copy of that table for 1:22XY)

A carefully considered metric for evaluating the solution is:

Score = (Std dev eligible bp per core) * (Number of intervals) * (Execution time in seconds) * (Lines of Code)

Lowest score wins!

Looking forward to seeing your code!

programming • 2.4k views
ADD COMMENTlink modified 4.2 years ago by Pierre Lindenbaum98k • written 4.2 years ago by Jeremy Leipzig17k

do you want the intervals to overlap ?

ADD REPLYlink modified 4.2 years ago by Jeremy Leipzig17k • written 4.2 years ago by Pierre Lindenbaum98k

no, updated post

ADD REPLYlink written 4.2 years ago by Jeremy Leipzig17k
5
gravatar for Pierre Lindenbaum
4.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum98k wrote:

Here is my solution. I put my code on github https://github.com/lindenb/jvarkit#biostar77828 https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/biostar/Biostar77828.java (requires the picard library)

The makefile: I use bedtools to substract the gaps from hg19. I removed the "*hap" chromosomes.

all: tmp.result.txt

tmp.result.txt : dist/biostar77828.jar tmp3.bed
    java -jar dist/biostar77828.jar VERBOSITY=DEBUG N_ITERATIONS=10000000 MIN_CORE=20 MAX_CORE=30 < tmp3.bed > $@

dist/biostar77828.jar: src/main/java/com/github/lindenb/jvarkit/tools/biostar/Biostar77828.java
     ant biostar77828

tmp3.bed: tmp1.bed tmp2.bed
    /commun/data/packages/BEDTools-Version-2.16.2/bin/subtractBed -a tmp1.bed -b tmp2.bed > $@

tmp1.bed:
     curl  "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz" |\
         gunzip -c |\
        grep -v hap |\
        awk -F '    ' '{printf("%s\t0\t%s\n",$$1,$$2);}' |\
         LC_ALL=C sort -t '    ' -k1,1 -k2,2n -k3,3n > $@
tmp2.bed:
     curl  "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz" |\
         gunzip -c |\
        grep -v hap |\
        cut -d '    ' -f2,3,4 |\
         LC_ALL=C sort -t '    ' -k1,1 -k2,2n -k3,3n > $@

I use a strategy using a random generator of solution, looping for N generations and printing the best one:

        Solution best=null;
        for(long generation=0;generation< this.N_ITERATIONS;++generation)
            {
            Solution sol=createSolution();

            if(best==null || sol.compareTo(best)<0)
                {
                best=sol;
                }
            }

Creating a solution: a random number of 'Core' objects is created and filled with a random segment.

    int n_cores=
            MIN_CORE+(MIN_CORE>=MAX_CORE?0:this.random.nextInt(MAX_CORE-MIN_CORE))
    (...)
    Collections.shuffle(segments, this.random);
    (...)
    for(int i=0;i< n_cores && i< segments.size();++i)
        {
        //get last
        Core core=new Core();
        core.segments.add(segments.get(i));
        sol.cores.add(core);
        }

then each BED segment is added in a Core with the smallest variation of the mean number of bases

            for(Core core:sol.cores)
                {
                if(best==null ||
                    (Math.abs((core.length()+seg.size())-mean) < Math.abs((best.length()+seg.size())-mean)))
                    {
                    best=core;
                    }

                }
            best.segments.add(seg);

The result:

<script src="&lt;a href=" lindenb="" 6130880"="">lindenb/6130880"></script>

<script src="&lt;a href=" 6130880"="">6130880"></script>

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Pierre Lindenbaum98k
3
gravatar for Alex Reynolds
4.2 years ago by
Alex Reynolds20k
Seattle, WA USA
Alex Reynolds20k wrote:

I haven't calculated scores, but I applied a first-fit algorithm on a regions-minus-gaps file generated with BEDOPS, where the regions are either sorted in descending order by size or shuffled randomly, before binning.

Here are the hg19 gaps I used:

<script src="&lt;a href=" alexpreynolds="" 6126976"="">alexpreynolds/6126976"></script>

Here are the hg19 regions ("extents") I used:

<script src="&lt;a href=" alexpreynolds="" 6126993"="">alexpreynolds/6126993"></script>

To generate the regions-minus-gaps file, adding size values:

$ bedops --difference hg19.extents.bed hg19.gaps.bed \
    | awk '{ print $0"\t"($3-$2) }' - \
    > hg19.gapped_extents.bed

Here is the regions-minus-gaps file that I get:

<script src="&lt;a href=" alexpreynolds="" 6127019"="">alexpreynolds/6127019"></script>

Here is the source for my first-fit approach:

<script src="&lt;a href=" alexpreynolds="" 6127008"="">alexpreynolds/6127008"></script>

And here is a result from one trial run:

<script src="&lt;a href=" alexpreynolds="" 6127041"="">alexpreynolds/6127041"></script>

On repeated trials, I seem to get more evenly distributed coverage of regions over the nodes when using a randomly shuffled dataset, as compared with bin packing on the descending-sort dataset.

I also get less "waste" per bin with random shuffling, i.e. a smaller standard deviation of total bases across bins.

This might be a jumping point to adding more statistics and doing more trials, to see how random shuffling performs, on average, against sorted data. And there are numerous ways to clean up my Python script and make things more "Pythonic", I'm certain.

Some thoughts: It should be possible to get a smoother distribution of elements per bin/node — and accordingly a smaller standard deviation of bases over bins — by putting a maximum size on a region that is smaller than the largest element. One would then split elements of size larger than this limit. The smaller the limit, the more disjoint elements, but I think it should be easier to pack regions into a bin more evenly.

Interesting problem. I am interested to see how others tackle it.

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Alex Reynolds20k

+1 for providing bed files.

ADD REPLYlink written 4.2 years ago by zx87543.7k

thanks for this solution. i was not thinking of using less than X cores, but that is easily remedied.

ADD REPLYlink written 4.2 years ago by Jeremy Leipzig17k
2
gravatar for Brad Chapman
4.2 years ago by
Brad Chapman9.1k
Boston, MA
Brad Chapman9.1k wrote:

I know this is overkill for this particular challenge but here is how I do this in practice for variant calling parallelization:

  • Identify non-callable regions (no coverage or NNN regions) in each sample.
  • Group regions for all samples in a population called jointly to identify shared non-callable regions.
  • Pick non-callable regions evenly spaced across the genome.
  • Split into regions bounded by these non-callable regions.
  • Parallelize work in those relatively uniform regions.

I make heavy use of pybedtools/bedtools to handle merging all the intervals:

https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/bam/callable.py

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Brad Chapman9.1k

i take it the regions are so small and there are so many of them you simply don't need to worry about load-balancing?

ADD REPLYlink written 4.2 years ago by Jeremy Leipzig17k

I do load balance by picking regions to split at that are evenly spaced. It's definitely more a heuristic approach: divide the genome size by the number of blocks you want, then only include split positions that are far enough from the last split:

https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/bam/callable.py#L167

Since I'm not evenly spllit anyway because of coverage requirement, this works decently to avoid the very large regions (or lots of tiny regions).

ADD REPLYlink written 4.2 years ago by Brad Chapman9.1k
1

I think Jeremy is asking if the frequency of assembly gaps is high enough that you can always find one close enough to your desired split point.

ADD REPLYlink written 4.2 years ago by Ryan Thompson3.2k

Ryan -- makes sense. I'm starting with the additional constraint that split points need to satisfy biological criteria (no coverage) so agree that my approach won't match the optimal splitting allowing breakpoints anywhere. It tries to do the best load-balancing it can under those conditions.

ADD REPLYlink written 4.2 years ago by Brad Chapman9.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1194 users visited in the last hour