Peak Detection In Intergenic Regions In Rna-Seq Data
4
7
Entering edit mode
11.0 years ago

I am currently working on RNA-seq data. One of the aims is to scan intergenic regions for transcription of e.g. small RNAs, missing transcripts. I have aligned to the reference genome and filtered the reads (bwa + blat) and can compute overlaps and coverage in R/Bioconductor and IRanges, that works. The problem is methodical: how to best detect a peak in the sequencing coverage in an intergenic region. My procedure now is a rather naive:

• Find regions of coverage > arbitrary threshold (say 30), with width > min.width (e.g. 30 bases)
• Remove all regions overlapping with genes/coding sequences(maybe + some 10 bp for promoters/UTR)

I am doing this using views(), coverage(), and findOverlaps() from the IRanges package in R.

That way I can find something, but the point I do not like is the arbitrary cutoff, because of this I am also missing a lot of candidate peaks or if I set the threshold too low I will get lots of false positive. The data is not strand specific, and I have different samples but no biological replicates.

Do you have a better suggestion for selecting a possibly dynamic cutoff or for finding peaks in sequencing data, doesn't have to be in R, just a hint to an algortihm will do.

rna peak-calling bioconductor coverage • 7.7k views
5
Entering edit mode
11.0 years ago

A good place to get started would be adapting some of the statistical frameworks using in ChIP-seq analysis. Bioconductor has some higher level tools which will help with the normalization and peak finding. The slides and code from the Seattle workshop late last year. UC Riverside's awesome documentation should be helpful. You'll want to avoid the +/- strand requirements since your underlying data is different, but the general windowing approaches sound similar to what you've attempted.

As far as setting cutoffs, there is always a trade off between false positives and negatives. One approach, given that you are tackling a custom problem, is to categorize a set of regions by hand as yes/no answers according to some by eye criteria. Assuming this test set is relatively representative, you can then use it to measure your sensitivity and specificity at different cutoffs. Biconductor's ROC package might be useful here.

The alternative is to look at the distribution of scores and select a cutoff based on what percentage of the regions you expect to capture based on your biological understanding.

3
Entering edit mode
10.2 years ago

You might have a look at using Scripture.

0
Entering edit mode

Thanks Sean, I will definitely look into this tool, too, though the 'documentation' is very rudimentary, but that might improve. After taking a quick glance, it seems to me that that Scripture contains nothing related my problem I couldn't do easier with R/Bioconductor, but maybe you can point this out?

2
Entering edit mode
4.4 years ago

If you are trying to find deferentially expressed regions that don't necessarily match annotated regions, I would recommend derfinder. It's pretty quick (at least compared to trying to perform a novel assembly with cufflinks), although I typically would at least start by focusing on annotation-based quantification strategies.

However, it won't probably won't work as well with smaller regions (but, unless you are specifically using a small RNA-Seq protocol, your library size is probably larger than 150 bp and isn't really suitable for quantifying smaller RNA sequences).

I also haven't tested derfinder without replicates, and I would generally recommend the use of biological replicates for better quality results.

1
Entering edit mode
10.2 years ago
Alper Yilmaz ▴ 90

The USeq software, which was designed for ChIP-Seq initially, can be used for determining regions that are enriched in RNA-Seq as well. This might eliminate the arbitrary cut-off problem. Then, you can easily find regions that are not intergenic by using bedtools.

I hope this helps.

0
Entering edit mode

Will this software work for Illumina data?