Question: Peak Detection In Intergenic Regions In Rna-Seq Data
gravatar for Michael Dondrup
10.1 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

I am currently working on RNA-seq data. One of the aims is to scan intergenic regions for transcription of e.g. small RNAs, missing transcripts. I have aligned to the reference genome and filtered the reads (bwa + blat) and can compute overlaps and coverage in R/Bioconductor and IRanges, that works. The problem is methodical: how to best detect a peak in the sequencing coverage in an intergenic region. My procedure now is a rather naive:

  • Find regions of coverage > arbitrary threshold (say 30), with width > min.width (e.g. 30 bases)
  • Remove all regions overlapping with genes/coding sequences(maybe + some 10 bp for promoters/UTR)

    I am doing this using views(), coverage(), and findOverlaps() from the IRanges package in R.

That way I can find something, but the point I do not like is the arbitrary cutoff, because of this I am also missing a lot of candidate peaks or if I set the threshold too low I will get lots of false positive. The data is not strand specific, and I have different samples but no biological replicates.

Do you have a better suggestion for selecting a possibly dynamic cutoff or for finding peaks in sequencing data, doesn't have to be in R, just a hint to an algortihm will do.

ADD COMMENTlink modified 3.4 years ago by Charles Warden7.8k • written 10.1 years ago by Michael Dondrup47k
gravatar for Brad Chapman
10.1 years ago by
Brad Chapman9.5k
Boston, MA
Brad Chapman9.5k wrote:

A good place to get started would be adapting some of the statistical frameworks using in ChIP-seq analysis. Bioconductor has some higher level tools which will help with the normalization and peak finding. The slides and code from the Seattle workshop late last year. UC Riverside's awesome documentation should be helpful. You'll want to avoid the +/- strand requirements since your underlying data is different, but the general windowing approaches sound similar to what you've attempted.

As far as setting cutoffs, there is always a trade off between false positives and negatives. One approach, given that you are tackling a custom problem, is to categorize a set of regions by hand as yes/no answers according to some by eye criteria. Assuming this test set is relatively representative, you can then use it to measure your sensitivity and specificity at different cutoffs. Biconductor's ROC package might be useful here.

The alternative is to look at the distribution of scores and select a cutoff based on what percentage of the regions you expect to capture based on your biological understanding.

ADD COMMENTlink modified 22 months ago by RamRS27k • written 10.1 years ago by Brad Chapman9.5k
gravatar for Sean Davis
9.2 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

You might have a look at using Scripture.

ADD COMMENTlink written 9.2 years ago by Sean Davis26k

Thanks Sean, I will definitely look into this tool, too, though the 'documentation' is very rudimentary, but that might improve. After taking a quick glance, it seems to me that that Scripture contains nothing related my problem I couldn't do easier with R/Bioconductor, but maybe you can point this out?

ADD REPLYlink written 9.2 years ago by Michael Dondrup47k
gravatar for Charles Warden
3.4 years ago by
Charles Warden7.8k
Duarte, CA
Charles Warden7.8k wrote:

If you are trying to find deferentially expressed regions that don't necessarily match annotated regions, I would recommend derfinder. It's pretty quick (at least compared to trying to perform a novel assembly with cufflinks), although I typically would at least start by focusing on annotation-based quantification strategies.

However, it won't probably won't work as well with smaller regions (but, unless you are specifically using a small RNA-Seq protocol, your library size is probably larger than 150 bp and isn't really suitable for quantifying smaller RNA sequences).

I also haven't tested derfinder without replicates, and I would generally recommend the use of biological replicates for better quality results.

ADD COMMENTlink written 3.4 years ago by Charles Warden7.8k
gravatar for Alper Yilmaz
9.2 years ago by
Alper Yilmaz90
Alper Yilmaz90 wrote:

The question is already answered but still I wanted to offer an alternative for whomever reached this page for his/her needs.

The USeq software, which was designed for ChIP-Seq initially, can be used for determining regions that are enriched in RNA-Seq as well. This might eliminate the arbitrary cut-off problem. Then, you can easily find regions that are not intergenic by using bedtools.

I hope this helps.

ADD COMMENTlink written 9.2 years ago by Alper Yilmaz90

Will this software work for Illumina data?

ADD REPLYlink written 3.7 years ago by vimlakany0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1948 users visited in the last hour