We observed some RNA-seq coverage in regions outside annotated genes - let's called them intergenic. This expression appears to be more pronounced, or unique, to a particular condition.
Find those genomic regions that have higher coverage than expected by random noise alone, along with a read count value (expression). We are looking to identify those regions with high resolution, but rather a broad overview to:
test whether or not there is a trend for more intergenic expression in some conditions;
intersect those expressed intergenic regions with other relevant genomic features.
Paired-end, total RNA-seq, Vertebrate species, not human.
- divide the genome in windows (size?)
- count reads per window
- remove regions containing genes +/- 5kb
- set background: randomly select X regions (1000) with 100 permutations to find distribution of background. Define cut-off as mean (or median + 2*SD).
- Use cut-off to select intergenic regions with high expression. Merge those within 1kb.
B, fancier following a histone mark-style approach:
csawto calculate coverage using sliding-window (size?)
- remove bins containing genes +/- 5kb
- median coverage across those bins used to filter "expressed regions" (I could also use a permutation approach here)
- Does any of the above options sound reasonable for what I trying to accomplish?
- Is there some detail missing?
For the window sizes I was thinking about using the average size of exons, since using the size of transcripts could lead to really large windows. Also, if the expression is "transcript-like", short exons - variable length intron - it could lead to large discrepancies in the average coverage and some regions might be missed.