I am currently working on RNA-seq data. One of the aims is to scan intergenic regions for transcription of e.g. small RNAs, missing transcripts. I have aligned to the reference genome and filtered the reads (bwa + blat) and can compute overlaps and coverage in R/Bioconductor and IRanges, that works. The problem is methodical: how to best detect a peak in the sequencing coverage in an intergenic region. My procedure now is a rather naive:
- Find regions of coverage > arbitrary threshold (say 30), with width > min.width (e.g. 30 bases)
Remove all regions overlapping with genes/coding sequences(maybe + some 10 bp for promoters/UTR)
I am doing this using
findOverlaps()from the IRanges package in R.
That way I can find something, but the point I do not like is the arbitrary cutoff, because of this I am also missing a lot of candidate peaks or if I set the threshold too low I will get lots of false positive. The data is not strand specific, and I have different samples but no biological replicates.
Do you have a better suggestion for selecting a possibly dynamic cutoff or for finding peaks in sequencing data, doesn't have to be in R, just a hint to an algortihm will do.