Question

Binning Over Genes And Calculating The Coverage [Bedops/Bedmap]

5

Entering edit mode

12.7 years ago

Sukhi Singh 11k

I have a list of genes and a coverage track (generated using BEDOPS-based binning script also called as binReads.sh), modified it to take bed file as input).

What I want do it to divide the each gene into 100 bins and then calculate the mean overlap from my coverage file, so that later all the respective bins can be added for all genes to give a average protein binding/coverage in that bin.

I couldn't understand that how I can make a specific number of bins for the genes though I can regulate the binSize using binReads.sh. I made a custom R code, for taking my list of genes and the dividing each of them into 100 bins.

Then, I used this list as a reference, (added a fake 4th column, which is also not very intuitive) to find the mean overlap for my coverage file.

Problems :

1) This method is not very intutitive (because of R code, it takes ~4-6 mins to complete the process for a list of 30.2K genes producing 30.2K*100 lines as a bed file to be used as a reference). It there a way in bedOps suite to generate the bins on the fly.

2) The output of bedmap --mean genes_bins.bed file.cov > means.tsv is valid numeric for first 42K reference lines out of 3.2M(30.2K*100) and for the rest it produces "NAN". To cross-validate, I took some specific binned genes and it produces a valid output. Is this because of the size of the files (shouldn't be).

Can you suggest some other way to approach this problem.

genes_bins.bed
chr1    3214481    3219051    Xkr4-bin1
chr1    3219051    3223621    Xkr4-bin2
chr1    3223621    3228192    Xkr4-bin3
chr1    3228192    3232762    Xkr4-bin4
chr1    3232762    3237332    Xkr4-bin5
chr1    3237332    3241902    Xkr4-bin6

coverage.bed
chr1    3001500    3001520    1    1
chr1    3001520    3001540    1    1
chr1    3001540    3001560    1    1
chr1    3001560    3001580    1    1
chr1    3001580    3001600    1    1
chr1    3001600    3001620    1    1
chr1    3001620    3001640    1    1

P.S. It was started as comment under Converting BAM to bedGraph for viewing on UCSC and I thought its better to ask as a seperate question. This probably be answered by Alex Reynolds himself. The final aim to make the composite profiles of your protein of interest from the chip-seq datasets.

Thanks

ngs chipseq coverage • 13k views

ADD COMMENT • link updated 10.2 years ago by James Ashmore ★ 3.5k • written 12.7 years ago by Sukhi Singh 11k

0

Entering edit mode

sorry, I have a couple of questions. In the genes_bins.bed example file, the gene Xkr4 is divided into 6 bins. Is this correct? Second question: if you want to split each gene into 100 bins, how do you handle the fact that different genes may have different length?

ADD REPLY • link 12.7 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Giovanni, thats just the head of the file, I will mention that. Secondly, thats the whole point of binning, the genes have unequal length, if you want to make a Plotting Read Density/ Distribution/ Composite Profiles [Chip-Seq], then the genes should be be binned and avg binding per bin is calculated in the end to be summed up for per bin for all bins of all genes, to get an average binding over the whole length of the gene. This will effect, that sum genes have more density in a bin and some less. Can you suggest some other strategy!!

ADD REPLY • link 12.7 years ago by Sukhi Singh 11k

score 7 · Answer 1 · 2015-08-22

This can also be done using BEDTools.

# Split each gene into 100 windows and label windows by window number
bedtools makewindows -b genes.bed -n 100 -i winnum  > windows.bed

# Compute coverage for each window, BAM input...
bedtools coverage -a windows.bed -b sample.bam -counts > coverage.bedgraph

# Compute coverage for each window, bedGraph input...
bedtools map -a windows.bed -b sample.bedgraph  -c 4 -o mean -null 0 > coverage.bedgraph

# Sort bedGraph by window number
sort -n -k4 coverage.bedgraph > coverage.sorted.bedgraph

# Calculate average coverage for each window number
bedtools groupby -i coverage.sorted.bedgraph -g 4 -c 5 -o mean > average.tsv

Ram · Answer 2 · 2013-02-27

6

Entering edit mode

12.7 years ago

Alex Reynolds 36k

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 12.7 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks for a detailed answer Alex. Great tool. It works great and finishes under 2 mins. I played with the script to test afew things. Just a small query, in my refernece file some coordinates overlap (same genes with multiple isoforms), so bedmap returns the NAN for the second overlap set. Thats, what I was reporting earlier as well. How to control this default behaviour. There is coverage(reads) in mapped file for that region, I have checked seperately, just it doesn't report it together.

Another thing, I have noticed after plotting was while adding the bins, they should go strand specifically, as in the gene reference file, we always write small co-ordinate in the start(2nd column) and bigger at the nd (3rd column) but it necessarily doesn't means like that. 3rd col would be the start of a gene present on the -ve strand. I can make a script in R to do that or can just reverse the co-ordinates of my gene reference file with a bigger co-ordinate at start for the gene on -ve strand and vice versa. This was not possible with bedtools as it gives an error but with bedmap, its possible(actually not really- with the test below).

Tried with:

map.bed

chr1    5    15      1     1
chr1    25    44      2     2
chr1    46    64      3     3

ref.bed
chr1      45      14      a

bedmap --echo --mean map.bed ref.bed

chr1      45      14      a|NAN

So, I will just a script to add the bins strand specifically.

Cheers

ADD REPLY • link 12.7 years ago by Sukhi Singh 11k

1

Entering edit mode

So there are two problems with your test:

You need to reverse the order of map.bed and ref.bed in the bedmap call
You have a start coordinate that is greater than the stop coordinate in ref.bed

Here is an example of what comes from using the --ec option with bedmap, using your inputs:

$ bedmap --ec --echo --mean ref.bed map.bed
May use bedmap --help for more help.

Error: in ref.bed
End coordinates must be greater than start coordinates.
See row: 1

Using --ec on a small subset of your datasets is useful for catching these sorts of errors. It can double the runtime, however, so you generally don't want to use it on a full analysis, unless you can afford the added time expense.

ADD REPLY • link 12.7 years ago by Alex Reynolds 36k

1

Entering edit mode

After a week of analysis and script building, I will say everything is working fine. For cross sample comparison (different chip-seq profiles), I will just normalize the calculated avg column by the read density of the sequenced sample.

Another question was, why it reports NAN, answer is because the file was not sorted properly. --ec tag also helps to track the problem. So, I made two separate lists of the overlapping genes and then calculated the mean abundance in each bin.

Thanks Alex for the support. I am reading the table file in R, reversing the bins for the genes on -ve strands, adding the output from two different gene lists and the plotting using ggplot.

Just for example : This is how H3K4me3 looks over all genes enter image description here

Cheers

ADD REPLY • link 11.2 years ago by Sukhi Singh 11k

0

Entering edit mode

I don't think that is a valid BED file, where the start coordinate is greater than the stop coordinate. Use the --ec option to do error checking with bedmap and you should see an error message along these lines, I think.

ADD REPLY • link 12.7 years ago by Alex Reynolds 36k

0

Entering edit mode

Hi Alex, I am coming back to same point (about overlapping co-ordinates in a same bed file) here in a different perspective.

I made a list of alternate isoforms of a gene (different transcription start or end but overlapping) as a text file ( splitting each of them into 100 bins). They are lined up according to their respective co-ordinates and thus not sorted.

When I try to calculate mean coverage using bedmap it reports NAN for the second overlapping co-ordinate, as in the first case it already reported it. Using --ec we know that it says the file is not sorted. This is making it hard for me.

Example (In form of 10 bin file) Using bedmap --echo --mean --prec 1

chr1    4854693    4854752    Tcea1-1    +|8.5
chr1    4854752    4854811    Tcea1-2    +|9.0
chr1    4854811    4854871    Tcea1-3    +|9.0
chr1    4854871    4854930    Tcea1-4    +|6.5
chr1    4854930    4854990    Tcea1-5    +|3.0
chr1    4854990    4855049    Tcea1-6    +|6.5
chr1    4855049    4855108    Tcea1-7    +|12.0
chr1    4855108    4855168    Tcea1-8    +|12.5
chr1    4855168    4855227    Tcea1-9    +|12.0
chr1    4855227    4855287    Tcea1-10    +|11.0
chr1    4855327    4855386    Tcea1.1-1    +|NAN
chr1    4855386    4855445    Tcea1.1-2    +|NAN
chr1    4855445    4855505    Tcea1.1-3    +|NAN
chr1    4855505    4855564    Tcea1.1-4    +|NAN
chr1    4855564    4855624    Tcea1.1-5    +|NAN
chr1    4855624    4855683    Tcea1.1-6    +|NAN
chr1    4855683    4855742    Tcea1.1-7    +|NAN
chr1    4855742    4855802    Tcea1.1-8    +|NAN
chr1    4855802    4855861    Tcea1.1-9    +|NAN
chr1    4855861    4855921    Tcea1.1-10    +|NAN

My possible solution would be to either split each line as a separate file, or loop over each line using while in shell or something, but simply can't we change the default behaviour of bedmap of not caring about input file sorting but just reporting the overlap as demanded. Thanks

ADD REPLY • link 11.6 years ago by Sukhi Singh 11k

0

Entering edit mode

BEDOPS tools require sorted input, but it is not clear to me why you can't sort to get the answer you want, simply by getting the --mean values from sorted bins (regardless of ID value), then post-processing to split into groups-by-ID. I must be missing something - can you explain in more detail why this is an issue?

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 11.6 years ago by Alex Reynolds 36k

1

Entering edit mode

Thanks Alex for your reply, actually It was a small problem with the way I am doing. If I sort the bed file, it messes up my unique id's and then they start overlapping because of the overlapping co-ordinates. What I do is split this whole file after calculating mean, into a list of matrices with 100 lines each. As the file is not synchronised (each 100 lines are not referring to same gene), so I wanted not to sort the file. But as a workaround, I added an serial column to my original file and sorted using bed-sort, calculated mean and then re-sorted the file according to the serial number. It working, Thanks !!

ADD REPLY • link 11.6 years ago by Sukhi Singh 11k