I have a list of genes and a coverage track (generated using BEDOPS-based binning script also called as binReads.sh), modified it to take bed file as input).
What I want do it to divide the each gene into 100 bins and then calculate the mean overlap from my coverage file, so that later all the respective bins can be added for all genes to give a average protein binding/coverage in that bin.
I couldn't understand that how I can make a specific number of bins for the genes though I can regulate the binSize using binReads.sh. I made a custom R code, for taking my list of genes and the dividing each of them into 100 bins.
Then, I used this list as a reference, (added a fake 4th column, which is also not very intuitive) to find the mean overlap for my coverage file.
Problems :
1) This method is not very intutitive (because of R code, it takes ~4-6 mins to complete the process for a list of 30.2K genes producing 30.2K*100 lines as a bed file to be used as a reference). It there a way in bedOps suite to generate the bins on the fly.
2) The output of bedmap --mean genes_bins.bed file.cov > means.tsv is valid numeric for first 42K reference lines out of 3.2M(30.2K*100) and for the rest it produces "NAN". To cross-validate, I took some specific binned genes and it produces a valid output. Is this because of the size of the files (shouldn't be).
Can you suggest some other way to approach this problem.
genes_bins.bed
chr1    3214481    3219051    Xkr4-bin1
chr1    3219051    3223621    Xkr4-bin2
chr1    3223621    3228192    Xkr4-bin3
chr1    3228192    3232762    Xkr4-bin4
chr1    3232762    3237332    Xkr4-bin5
chr1    3237332    3241902    Xkr4-bin6
coverage.bed
chr1    3001500    3001520    1    1
chr1    3001520    3001540    1    1
chr1    3001540    3001560    1    1
chr1    3001560    3001580    1    1
chr1    3001580    3001600    1    1
chr1    3001600    3001620    1    1
chr1    3001620    3001640    1    1
P.S. It was started as comment under Converting BAM to bedGraph for viewing on UCSC and I thought its better to ask as a seperate question. This probably be answered by Alex Reynolds himself. The final aim to make the composite profiles of your protein of interest from the chip-seq datasets.
Thanks

sorry, I have a couple of questions. In the genes_bins.bed example file, the gene Xkr4 is divided into 6 bins. Is this correct? Second question: if you want to split each gene into 100 bins, how do you handle the fact that different genes may have different length?
Giovanni, thats just the
headof the file, I will mention that. Secondly, thats the whole point of binning, the genes have unequal length, if you want to make a Plotting Read Density/ Distribution/ Composite Profiles [Chip-Seq], then the genes should be be binned and avg binding per bin is calculated in the end to be summed up for per bin for all bins of all genes, to get an average binding over the whole length of the gene. This will effect, that sum genes have more density in a bin and some less. Can you suggest some other strategy!!