Question: Binning Over Genes And Calculating The Coverage [Bedops/Bedmap]
5
gravatar for Sukhdeep Singh
6.2 years ago by
Sukhdeep Singh9.7k
Netherlands
Sukhdeep Singh9.7k wrote:

I have a list of genes and a coverage track (generated using BEDOPS-based binning script also called as binReads.sh), modified it to take bed file as input).

What I want do it to divide the each gene into 100 bins and then calculate the mean overlap from my coverage file, so that later all the respective bins can be added for all genes to give a average protein binding/coverage in that bin.

I couldn't understand that how I can make a specific number of bins for the genes though I can regulate the binSize using binReads.sh. I made a custom R code, for taking my list of genes and the dividing each of them into 100 bins.

Then, I used this list as a reference, (added a fake 4th column, which is also not very intuitive) to find the mean overlap for my coverage file.

Problems :

1) This method is not very intutitive (because of R code, it takes ~4-6 mins to complete the process for a list of 30.2K genes producing 30.2K*100 lines as a bed file to be used as a reference). It there a way in bedOps suite to generate the bins on the fly.

2) The output of bedmap --mean genes_bins.bed file.cov > means.tsv is valid numeric for first 42K reference lines out of 3.2M(30.2K*100) and for the rest it produces "NAN". To cross-validate, I took some specific binned genes and it produces a valid output. Is this because of the size of the files (shouldn't be).

Can you suggest some other way to approach this problem.

genes_bins.bed
chr1    3214481    3219051    Xkr4-bin1
chr1    3219051    3223621    Xkr4-bin2
chr1    3223621    3228192    Xkr4-bin3
chr1    3228192    3232762    Xkr4-bin4
chr1    3232762    3237332    Xkr4-bin5
chr1    3237332    3241902    Xkr4-bin6

coverage.bed
chr1    3001500    3001520    1    1
chr1    3001520    3001540    1    1
chr1    3001540    3001560    1    1
chr1    3001560    3001580    1    1
chr1    3001580    3001600    1    1
chr1    3001600    3001620    1    1
chr1    3001620    3001640    1    1

P.S. It was started as comment under Converting BAM to bedGraph for viewing on UCSC and I thought its better to ask as a seperate question. This probably be answered by Alex Reynolds himself. The final aim to make the composite profiles of your protein of interest from the chip-seq datasets.

Thanks

chipseq ngs coverage • 6.6k views
ADD COMMENTlink modified 3.7 years ago by James Ashmore2.6k • written 6.2 years ago by Sukhdeep Singh9.7k

sorry, I have a couple of questions. In the genes_bins.bed example file, the gene Xkr4 is divided into 6 bins. Is this correct? Second question: if you want to split each gene into 100 bins, how do you handle the fact that different genes may have different length?

ADD REPLYlink written 6.2 years ago by Giovanni M Dall'Olio26k

Giovanni, thats just the head of the file, I will mention that. Secondly, thats the whole point of binning, the genes have unequal length, if you want to make a Plotting Read Density/ Distribution/ Composite Profiles [Chip-Seq], then the genes should be be binned and avg binding per bin is calculated in the end to be summed up for per bin for all bins of all genes, to get an average binding over the whole length of the gene. This will effect, that sum genes have more density in a bin and some less. Can you suggest some other strategy!!

ADD REPLYlink written 6.2 years ago by Sukhdeep Singh9.7k
3
gravatar for Alex Reynolds
6.2 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

The bedops and bedmap utilities in the BEDOPS suite do not generate bins, but you could use awk to do this, which might be faster than using R.

To demonstrate how to solve your problem, let's start with a couple example genes foo and bar:

$ more genes.bed
chr1    1000    4000    foo
chr1    6000    9000    bar

Here's how we could partition them into, say, five bins for each gene:

$ awk ' \
BEGIN { \
    binNum=5; \
} \
{ \
    chr=$1; \
    start=$2; \
    stop=$3; \
    binIdx=1; \
    binSize=(stop-start)/binNum; \
    for (i=start;i<stop;i+=binSize) { \
        print chr"\t"i"\t"(i+binSize)"\t"binIdx; \
        binIdx++; \
    } \
}' \
genes.bed > bins.bed

Here's what our example bins.bed looks like:

$ more bins.bed    
chr1    1000    1600    1
chr1    1600    2200    2
chr1    2200    2800    3
chr1    2800    3400    4
chr1    3400    4000    5
chr1    6000    6600    1
chr1    6600    7200    2
chr1    7200    7800    3
chr1    7800    8400    4
chr1    8400    9000    5

Note that we're making the ID field the bin index — we're not including the name of the gene in the bin index.

If I understand your question correctly (and the rest of this answer depends on this understanding), the actual name of the gene doesn't matter for the final answer — we just want an average signal for a bin across all genes in the set, which are divided into equal numbers of partitions regardless of size.

(Also, for the purposes of demonstration, I'm "cheating" with my choice of gene BED regions and bins, so that everything partitions cleanly. I'll leave it to you to handle edge cases where things aren't so neatly divisible.)

Next, let's set up an example signal file (coverage data). It doesn't matter what's in the ID field — only that something is there, so that the score or signal value is in the fifth column, thus ensuring that our tools are working with a properly UCSC-formatted BED file:

$ more signal.bed
chr1    1000    1500    id-1    1
chr1    1500    2000    id-2    3
chr1    2000    2500    id-3    4
chr1    2500    3000    id-4    4
chr1    3000    3500    id-5    8
chr1    3500    4000    id-6    20
chr1    4000    4500    id-7    15
chr1    4500    5000    id-8    12
chr1    5000    5500    id-9    30
chr1    5500    6000    id-10   35
chr1    6000    6500    id-11   20
chr1    6500    7000    id-12   10
chr1    7000    7500    id-13   15
chr1    7500    8000    id-14   12
chr1    8000    8500    id-15   23
chr1    8500    9000    id-16   26

Let's first do a naïve bedmap with the --mean operator to see what we get. We'll include the --echo operator, so we can see what mean signal is associated with what reference element (bin):

$ bedmap --echo --mean bins.bed signal.bed
chr1    1000    1600    1|2.000000
chr1    1600    2200    2|3.500000
chr1    2200    2800    3|4.000000
chr1    2800    3400    4|6.000000
chr1    3400    4000    5|14.000000
chr1    6000    6600    1|15.000000
chr1    6600    7200    2|12.500000
chr1    7200    7800    3|13.500000
chr1    7800    8400    4|17.500000
chr1    8400    9000    5|23.000000

Firstly, the precision of the mean result needs adjusting, using the --prec operator. Secondly, we'll remove the --echo operator, because we don't need it for the next step:

$ bedmap --mean --prec 0 bins.bed signal.bed
2
4
4
6
14
15
12
14
18
23

As a first pass, to demonstrate how we'll get a final answer, we'll pass our bedmap-ed stream of mean values to another awk statement to track cumulative sums of means over an individual bin, while counting the number of genes we traverse:

$ bedmap --mean --prec 0 bins.bed signal.bed \
    | awk ' \
        BEGIN { \
            binNum=5; \
            binCount=0; \
            binCumulSums[$binNum]; \
            geneCount=1; \
        } \
        { \
            binIdx=(binCount%binNum)+1; \
            binCumulSums[binIdx]+=$0; \
            print binIdx"\t"$0"\t"binCumulSums[binIdx]"\t"geneCount; \
            binCount++; \
            if (binIdx==binNum) { \
                geneCount++; \
            } \
        }' \
        - > intermediateResults.tsv

Let's take a look at these results:

$ more intermediateResults.tsv
1   2   2   1
2   4   4   1
3   4   4   1
4   6   6   1
5   14  14  1
1   15  17  2
2   12  16  2
3   14  18  2
4   18  24  2
5   23  37  2

Let's break this table of intermediate values down by column:

  • The first column is the bin ID value (1 through 5).
  • The second column is the mean signal across that individual bin.
  • The third column is the cumulative sum of signal across bins of that bin ID (for example, for bin 4, the cumulative sum is 6 + 18, or 24.
  • The fourth column is gene index associated the listed bin, irrespective of the gene name. Here, we just have two genes foo and bar (see genes.bed), so we have two gene indices 1 and 2. (In your case, the total number of genes is 30.2K, so you'd see indices 1..30200.)

We now have enough information to calculate a final answer. We'll remove the print statement in the awk command above that shows these intermediate values, and replace it with an END block that calculates and prints a final result, the average signal over an individual bin:

$ bedmap --mean --prec 0 bins.bed signal.bed \
    | awk ' \
        BEGIN { \
            binNum=5; \
            binCount=0; \
            binCumulSums[$binNum]; \
            geneCount=1; \
        } \
        { \
            binIdx=(binCount%binNum)+1; \
            binCumulSums[binIdx]+=$0; \
            binCount++; \
            if (binIdx==binNum) { \
                geneCount++; \
            } \
        } \
        END { \
            for (idx=1;idx<=binNum;idx++) { \
                print idx"\t"binCumulSums[idx]/(geneCount-1); \
            } \
        }' \
        - > answer.tsv

Let's look at the answer we'd expect from our sample data:

$ more answer.tsv
1    8.5
2    8
3    9
4    12
5    18.5

How well will this scale to your dataset? I honestly don't know, but since we're streaming through bedmap results one line at a time, I think the memory overhead is going to be perfectly reasonable on a typical, modern workstation.

If you want to optimize this further, you could even avoid the file I/O expense of creating the intermediate bins.bed file and just pipe the initial bin-creating awk statement into bedmap (as well as taking out the bin index variable, which is unused by bedmap --mean) e.g.:

$ awk ' \
    BEGIN { \
        binNum=5; \
    } \
    { \
        chr=$1; \
        start=$2; \
        stop=$3; \
        binSize=(stop-start)/binNum; \
        for (i=start;i<stop;i+=binSize) { \
            print chr"\t"i"\t"(i+binSize); \
        } \
    }' genes.bed \
    | bedmap --mean --prec 0 - signal.bed \
    | awk ' \
        BEGIN { \
            binNum=5; \
            binCount=0; \
            binCumulSums[$binNum]; \
            geneCount=1; \
        } \
        { \
            binIdx=(binCount%binNum)+1; \
            binCumulSums[binIdx]+=$0; \
            binCount++; \
            if (binIdx==binNum) { \
                geneCount++; \
            } \
        } \
        END { \
            for (idx=1;idx<=binNum;idx++) { \
                print idx"\t"binCumulSums[idx]/(geneCount-1); \
            } \
        }' \
        - > probablyFasterAnswer.tsv

Here, we replaced bins.bed in the bedmap statement with the hyphen character (-), which denotes standard input. Eliminating disk-based I/O should make this analysis run even faster. Handling standard input and output in this way is one of the major strengths of BEDOPS tools.

Finally, this is a problem that is one of those "embarrassingly parallel" types, if you think about splitting bins by ID into separate analysis tasks (which I'll leave to you to implement).

ADD COMMENTlink modified 6.1 years ago • written 6.2 years ago by Alex Reynolds28k

Thanks for a detailed answer Alex. Great tool. It works great and finishes under 2 mins. I played with the script to test afew things. Just a small query, in my refernece file some coordinates overlap (same genes with multiple isoforms), so bedmap returns the NAN for the second overlap set. Thats, what I was reporting earlier as well. How to control this default behaviour. There is coverage(reads) in mapped file for that region, I have checked seperately, just it doesn't report it together.

Another thing, I have noticed after plotting was while adding the bins, they should go strand specifically, as in the gene reference file, we always write small co-ordinate in the start(2nd column) and bigger at the nd (3rd column) but it necessarily doesn't means like that. 3rd col would be the start of a gene present on the -ve strand. I can make a script in R to do that or can just reverse the co-ordinates of my gene reference file with a bigger co-ordinate at start for the gene on -ve strand and vice versa. This was not possible with bedtools as it gives an error but with bedmap, its possible(actually not really- with the test below).

Tried with:

map.bed

chr1    5    15      1     1
chr1    25    44      2     2
chr1    46    64      3     3

ref.bed
chr1      45      14      a

bedmap --echo --mean map.bed ref.bed

chr1      45      14      a|NAN

So, I will just a script to add the bins strand specifically.

Cheers

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by Sukhdeep Singh9.7k
1

So there are two problems with your test:

  1. You need to reverse the order of map.bed and ref.bed in the bedmap call
  2. You have a start coordinate that is greater than the stop coordinate in ref.bed

Here is an example of what comes from using the --ec option with bedmap, using your inputs:

$ bedmap --ec --echo --mean ref.bed map.bed
May use bedmap --help for more help.

Error: in ref.bed
End coordinates must be greater than start coordinates.
See row: 1

Using --ec on a small subset of your datasets is useful for catching these sorts of errors. It can double the runtime, however, so you generally don't want to use it on a full analysis, unless you can afford the added time expense.

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by Alex Reynolds28k
1

After a week of analysis and script building, I will say everything is working fine. For cross sample comparison (different chip-seq profiles), I will just normalize the calculated avg column by the read density of the sequenced sample.

Another question was, why it reports NAN, answer is because the file was not sorted properly. --ec tag also helps to track the problem. So, I made two separate lists of the overlapping genes and then calculated the mean abundance in each bin.

Thanks Alex for the support. I am reading the table file in R, reversing the bins for the genes on -ve strands, adding the output from two different gene lists and the plotting using ggplot.

Just for example : This is how H3K4me3 looks over all genes enter image description here

Cheers

ADD REPLYlink modified 4.7 years ago • written 6.1 years ago by Sukhdeep Singh9.7k

I don't think that is a valid BED file, where the start coordinate is greater than the stop coordinate. Use the --ec option to do error checking with bedmap and you should see an error message along these lines, I think.

ADD REPLYlink written 6.2 years ago by Alex Reynolds28k

Hi Alex, I am coming back to same point (about overlapping co-ordinates in a same bed file) here in a different perspective.

I made a list of alternate isoforms of a gene (different transcription start or end but overlapping) as a text file ( splitting each of them into 100 bins). They are lined up according to their respective co-ordinates and thus not sorted.

When I try to calculate mean coverage using bedmap it reports NAN for the second overlapping co-ordinate, as in the first case it already reported it. Using --ec we know that it says the file is not sorted. This is making it hard for me.

Example (In form of 10 bin file) Using bedmap --echo --mean --prec 1

chr1    4854693    4854752    Tcea1-1    +|8.5
chr1    4854752    4854811    Tcea1-2    +|9.0
chr1    4854811    4854871    Tcea1-3    +|9.0
chr1    4854871    4854930    Tcea1-4    +|6.5
chr1    4854930    4854990    Tcea1-5    +|3.0
chr1    4854990    4855049    Tcea1-6    +|6.5
chr1    4855049    4855108    Tcea1-7    +|12.0
chr1    4855108    4855168    Tcea1-8    +|12.5
chr1    4855168    4855227    Tcea1-9    +|12.0
chr1    4855227    4855287    Tcea1-10    +|11.0
chr1    4855327    4855386    Tcea1.1-1    +|NAN
chr1    4855386    4855445    Tcea1.1-2    +|NAN
chr1    4855445    4855505    Tcea1.1-3    +|NAN
chr1    4855505    4855564    Tcea1.1-4    +|NAN
chr1    4855564    4855624    Tcea1.1-5    +|NAN
chr1    4855624    4855683    Tcea1.1-6    +|NAN
chr1    4855683    4855742    Tcea1.1-7    +|NAN
chr1    4855742    4855802    Tcea1.1-8    +|NAN
chr1    4855802    4855861    Tcea1.1-9    +|NAN
chr1    4855861    4855921    Tcea1.1-10    +|NAN

My possible solution would be to either split each line as a separate file, or loop over each line using while in shell or something, but simply can't we change the default behaviour of bedmap of not caring about input file sorting but just reporting the overlap as demanded. Thanks

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by Sukhdeep Singh9.7k

BEDOPS tools require sorted input, but it is not clear to me why you can't sort to get the answer you want, simply by getting the --mean values from sorted bins (regardless of ID value), then post-processing to split into groups-by-ID. I must be missing something — can you explain in more detail why this is an issue?

ADD REPLYlink written 5.1 years ago by Alex Reynolds28k
1

Thanks Alex for your reply, actually It was a small problem with the way I am doing. If I sort the bed file, it messes up my unique id's and then they start overlapping because of the overlapping co-ordinates. What I do is split this whole file after calculating mean, into a list of matrices with 100 lines each. As the file is not synchronised (each 100 lines are not referring to same gene), so I wanted not to sort the file. But as a workaround, I added an serial column to my original file and sorted using bed-sort, calculated mean and then re-sorted the file according to the serial number. It working, Thanks !!

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by Sukhdeep Singh9.7k
2
gravatar for James Ashmore
3.7 years ago by
James Ashmore2.6k
UK/Edinburgh/MRC Centre for Regenerative Medicine
James Ashmore2.6k wrote:

This can also be done using BEDTools.

# Split each gene into 100 windows and label windows by window number
bedtools makewindows -b genes.bed -n 100 -i winnum  > windows.bed

# Compute coverage for each window, BAM input...
bedtools coverage -a windows.bed -b sample.bam -counts > coverage.bedgraph

# Compute coverage for each window, bedGraph input...
bedtools map -a windows.bed -b sample.bedgraph  -c 4 -o mean -null 0 > coverage.bedgraph

# Sort bedGraph by window number
sort -n -k4 coverage.bedgraph > coverage.sorted.bedgraph

# Calculate average coverage for each window number
bedtools groupby -i coverage.sorted.bedgraph -g 4 -c 5 -o mean > average.tsv
ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by James Ashmore2.6k

Thanks, very functional code, haven't tested it yet.

ADD REPLYlink written 3.7 years ago by Sukhdeep Singh9.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1290 users visited in the last hour