I would like to do a sliding window using the start position, a non overlapping window size of 20Kb along the chromosome, and calculate the median expression for that window.
I haven't been able to find a question similar to mine, most people seem to have different structured files and use bed/sam/vcf tools
I looked at this question (Binning Of Chromosome Data) but here the answer is only for the number of genes within that window, and not the expression.
The file expression.bed is your signal or map file in a bedmap operation. It contains the score signal in the fifth column, per UCSC specification.
Your sliding windows are then put into a second BED file, which acts as the reference file in a bedmap operation.
Given the bounds of chromosomes, you can easily make sliding windows across each chromosome with bedops.
Say you are working with genome build hg19, which has the following extents:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The file answer.bed contains each sliding window region (via the --echo operator), and the median expression signal over that window (via the --median operator).
Since it makes no meaningful statistical sense to calculate a median over a sample size of 0, consider adding the --skip-unmapped operator to leave out windows over which there is no expression signal:
The revised file answer.bed contains each sliding window region (via the --echo operator), and the median expression signal over that window (via the --median operator), only where there is any overlapping signal across the window (via the --skip-unmapped operator).
How do you want to deal with genes that span more than one window? If a gene overlaps a window just by one base I guess you shouldn't assign to that window the full gene expression, right?
The procedure below assigns gene expression to windows weighing expression by % overlap (so if 50% of a gene overlaps a window, then 50% of that gene expression is assigned to that window).
Convert expression file to bed (first line, header, skipped):
Looks to work well, but for the end file it doesn't seem to be able to calculate the median of the rpkm for each window (I got a median of 0 for each window) after following your steps. I think it has something to do with the intersect line, as the result from that command looks like this:
If a window does not intersect any gene, as in the lines in your comment, then the median is zero as it should be. Make sure chromosome names are the same in your input files. Here you have chr1, in your original post you have Chr1.
Looks to work well, but for the end file it doesn't seem to be able to calculate the median of the rpkm for each window (I got a median of 0 for each window) after following your steps. I think it has something to do with the intersect line, as the result from that command looks like this:
If a window does not intersect any gene, as in the lines in your comment, then the median is zero as it should be. Make sure chromosome names are the same in your input files. Here you have chr1, in your original post you have Chr1.