Question

Binning Over Genes And Calculating The Coverage [Bedops/Bedmap]

5

Entering edit mode

12.4 years ago

Sukhi Singh 11k

I have a list of genes and a coverage track (generated using BEDOPS-based binning script also called as binReads.sh), modified it to take bed file as input).

What I want do it to divide the each gene into 100 bins and then calculate the mean overlap from my coverage file, so that later all the respective bins can be added for all genes to give a average protein binding/coverage in that bin.

I couldn't understand that how I can make a specific number of bins for the genes though I can regulate the binSize using binReads.sh. I made a custom R code, for taking my list of genes and the dividing each of them into 100 bins.

Then, I used this list as a reference, (added a fake 4th column, which is also not very intuitive) to find the mean overlap for my coverage file.

Problems :

1) This method is not very intutitive (because of R code, it takes ~4-6 mins to complete the process for a list of 30.2K genes producing 30.2K*100 lines as a bed file to be used as a reference). It there a way in bedOps suite to generate the bins on the fly.

2) The output of bedmap --mean genes_bins.bed file.cov > means.tsv is valid numeric for first 42K reference lines out of 3.2M(30.2K*100) and for the rest it produces "NAN". To cross-validate, I took some specific binned genes and it produces a valid output. Is this because of the size of the files (shouldn't be).

Can you suggest some other way to approach this problem.

genes_bins.bed
chr1    3214481    3219051    Xkr4-bin1
chr1    3219051    3223621    Xkr4-bin2
chr1    3223621    3228192    Xkr4-bin3
chr1    3228192    3232762    Xkr4-bin4
chr1    3232762    3237332    Xkr4-bin5
chr1    3237332    3241902    Xkr4-bin6

coverage.bed
chr1    3001500    3001520    1    1
chr1    3001520    3001540    1    1
chr1    3001540    3001560    1    1
chr1    3001560    3001580    1    1
chr1    3001580    3001600    1    1
chr1    3001600    3001620    1    1
chr1    3001620    3001640    1    1

P.S. It was started as comment under Converting BAM to bedGraph for viewing on UCSC and I thought its better to ask as a seperate question. This probably be answered by Alex Reynolds himself. The final aim to make the composite profiles of your protein of interest from the chip-seq datasets.

Thanks

ngs chipseq coverage • 12k views

ADD COMMENT • link updated 9.9 years ago by James Ashmore ★ 3.5k • written 12.4 years ago by Sukhi Singh 11k

0

Entering edit mode

sorry, I have a couple of questions. In the genes_bins.bed example file, the gene Xkr4 is divided into 6 bins. Is this correct? Second question: if you want to split each gene into 100 bins, how do you handle the fact that different genes may have different length?

ADD REPLY • link 12.4 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Giovanni, thats just the head of the file, I will mention that. Secondly, thats the whole point of binning, the genes have unequal length, if you want to make a Plotting Read Density/ Distribution/ Composite Profiles [Chip-Seq], then the genes should be be binned and avg binding per bin is calculated in the end to be summed up for per bin for all bins of all genes, to get an average binding over the whole length of the gene. This will effect, that sum genes have more density in a bin and some less. Can you suggest some other strategy!!

ADD REPLY • link 12.4 years ago by Sukhi Singh 11k

Ram · Answer 1 · 2013-02-27

The bedops and bedmap utilities in the BEDOPS suite do not generate bins, but you could use awk to do this, which might be faster than using R.

To demonstrate how to solve your problem, let's start with a couple example genes foo and bar:

$ more genes.bed
chr1    1000    4000    foo
chr1    6000    9000    bar

Here's how we could partition them into, say, five bins for each gene:

$ awk ' \
BEGIN { \
    binNum=5; \
} \
{ \
    chr=$1; \
    start=$2; \
    stop=$3; \
    binIdx=1; \
    binSize=(stop-start)/binNum; \
    for (i=start;i<stop;i+=binSize) { \
        print chr"\t"i"\t"(i+binSize)"\t"binIdx; \
        binIdx++; \
    } \
}' \
genes.bed > bins.bed

Here's what our example bins.bed looks like:

$ more bins.bed    
chr1    1000    1600    1
chr1    1600    2200    2
chr1    2200    2800    3
chr1    2800    3400    4
chr1    3400    4000    5
chr1    6000    6600    1
chr1    6600    7200    2
chr1    7200    7800    3
chr1    7800    8400    4
chr1    8400    9000    5

Note that we're making the ID field the bin index — we're not including the name of the gene in the bin index.

If I understand your question correctly (and the rest of this answer depends on this understanding), the actual name of the gene doesn't matter for the final answer — we just want an average signal for a bin across all genes in the set, which are divided into equal numbers of partitions regardless of size.

(Also, for the purposes of demonstration, I'm "cheating" with my choice of gene BED regions and bins, so that everything partitions cleanly. I'll leave it to you to handle edge cases where things aren't so neatly divisible.)

Next, let's set up an example signal file (coverage data). It doesn't matter what's in the ID field — only that something is there, so that the score or signal value is in the fifth column, thus ensuring that our tools are working with a properly UCSC-formatted BED file:

$ more signal.bed
chr1    1000    1500    id-1    1
chr1    1500    2000    id-2    3
chr1    2000    2500    id-3    4
chr1    2500    3000    id-4    4
chr1    3000    3500    id-5    8
chr1    3500    4000    id-6    20
chr1    4000    4500    id-7    15
chr1    4500    5000    id-8    12
chr1    5000    5500    id-9    30
chr1    5500    6000    id-10   35
chr1    6000    6500    id-11   20
chr1    6500    7000    id-12   10
chr1    7000    7500    id-13   15
chr1    7500    8000    id-14   12
chr1    8000    8500    id-15   23
chr1    8500    9000    id-16   26

Let's first do a naïve bedmap with the --mean operator to see what we get. We'll include the --echo operator, so we can see what mean signal is associated with what reference element (bin):

$ bedmap --echo --mean bins.bed signal.bed
chr1    1000    1600    1|2.000000
chr1    1600    2200    2|3.500000
chr1    2200    2800    3|4.000000
chr1    2800    3400    4|6.000000
chr1    3400    4000    5|14.000000
chr1    6000    6600    1|15.000000
chr1    6600    7200    2|12.500000
chr1    7200    7800    3|13.500000
chr1    7800    8400    4|17.500000
chr1    8400    9000    5|23.000000

Firstly, the precision of the mean result needs adjusting, using the --prec operator. Secondly, we'll remove the --echo operator, because we don't need it for the next step:

$ bedmap --mean --prec 0 bins.bed signal.bed
2
4
4
6
14
15
12
14
18
23

As a first pass, to demonstrate how we'll get a final answer, we'll pass our bedmap-ed stream of mean values to another awk statement to track cumulative sums of means over an individual bin, while counting the number of genes we traverse:

$ bedmap --mean --prec 0 bins.bed signal.bed \
    | awk ' \
        BEGIN { \
            binNum=5; \
            binCount=0; \
            binCumulSums[$binNum]; \
            geneCount=1; \
        } \
        { \
            binIdx=(binCount%binNum)+1; \
            binCumulSums[binIdx]+=$0; \
            print binIdx"\t"$0"\t"binCumulSums[binIdx]"\t"geneCount; \
            binCount++; \
            if (binIdx==binNum) { \
                geneCount++; \
            } \
        }' \
        - > intermediateResults.tsv

Let's take a look at these results:

$ more intermediateResults.tsv
1   2   2   1
2   4   4   1
3   4   4   1
4   6   6   1
5   14  14  1
1   15  17  2
2   12  16  2
3   14  18  2
4   18  24  2
5   23  37  2

Let's break this table of intermediate values down by column:

The first column is the bin ID value (1 through 5).
The second column is the mean signal across that individual bin.
The third column is the cumulative sum of signal across bins of that bin ID (for example, for bin 4, the cumulative sum is 6 + 18, or 24.
The fourth column is gene index associated the listed bin, irrespective of the gene name. Here, we just have two genes foo and bar (see genes.bed), so we have two gene indices 1 and 2. (In your case, the total number of genes is 30.2K, so you'd see indices 1..30200.)

We now have enough information to calculate a final answer. We'll remove the print statement in the awk command above that shows these intermediate values, and replace it with an END block that calculates and prints a final result, the average signal over an individual bin:

$ bedmap --mean --prec 0 bins.bed signal.bed \
    | awk ' \
        BEGIN { \
            binNum=5; \
            binCount=0; \
            binCumulSums[$binNum]; \
            geneCount=1; \
        } \
        { \
            binIdx=(binCount%binNum)+1; \
            binCumulSums[binIdx]+=$0; \
            binCount++; \
            if (binIdx==binNum) { \
                geneCount++; \
            } \
        } \
        END { \
            for (idx=1;idx<=binNum;idx++) { \
                print idx"\t"binCumulSums[idx]/(geneCount-1); \
            } \
        }' \
        - > answer.tsv

Let's look at the answer we'd expect from our sample data:

$ more answer.tsv
1    8.5
2    8
3    9
4    12
5    18.5

How well will this scale to your dataset? I honestly don't know, but since we're streaming through bedmap results one line at a time, I think the memory overhead is going to be perfectly reasonable on a typical, modern workstation.

If you want to optimize this further, you could even avoid the file I/O expense of creating the intermediate bins.bed file and just pipe the initial bin-creating awk statement into bedmap (as well as taking out the bin index variable, which is unused by bedmap --mean) e.g.:

$ awk ' \
    BEGIN { \
        binNum=5; \
    } \
    { \
        chr=$1; \
        start=$2; \
        stop=$3; \
        binSize=(stop-start)/binNum; \
        for (i=start;i<stop;i+=binSize) { \
            print chr"\t"i"\t"(i+binSize); \
        } \
    }' genes.bed \
    | bedmap --mean --prec 0 - signal.bed \
    | awk ' \
        BEGIN { \
            binNum=5; \
            binCount=0; \
            binCumulSums[$binNum]; \
            geneCount=1; \
        } \
        { \
            binIdx=(binCount%binNum)+1; \
            binCumulSums[binIdx]+=$0; \
            binCount++; \
            if (binIdx==binNum) { \
                geneCount++; \
            } \
        } \
        END { \
            for (idx=1;idx<=binNum;idx++) { \
                print idx"\t"binCumulSums[idx]/(geneCount-1); \
            } \
        }' \
        - > probablyFasterAnswer.tsv

Here, we replaced bins.bed in the bedmap statement with the hyphen character (-), which denotes standard input. Eliminating disk-based I/O should make this analysis run even faster. Handling standard input and output in this way is one of the major strengths of BEDOPS tools.

Finally, this is a problem that is one of those "embarrassingly parallel" types, if you think about splitting bins by ID into separate analysis tasks (which I'll leave to you to implement).

view raw biostars-65198.md hosted with ❤ by GitHub

score 6 · Answer 2 · 2015-08-22

This can also be done using BEDTools.

# Split each gene into 100 windows and label windows by window number
bedtools makewindows -b genes.bed -n 100 -i winnum  > windows.bed

# Compute coverage for each window, BAM input...
bedtools coverage -a windows.bed -b sample.bam -counts > coverage.bedgraph

# Compute coverage for each window, bedGraph input...
bedtools map -a windows.bed -b sample.bedgraph  -c 4 -o mean -null 0 > coverage.bedgraph

# Sort bedGraph by window number
sort -n -k4 coverage.bedgraph > coverage.sorted.bedgraph

# Calculate average coverage for each window number
bedtools groupby -i coverage.sorted.bedgraph -g 4 -c 5 -o mean > average.tsv