Question: Peak Detection In Intergenic Regions In Rna-Seq Data
gravatar for Michael Dondrup
8.8 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

I am currently working on RNA-seq data. One of the aims is to scan intergenic regions for transcription of e.g. small RNAs, missing transcripts. I have aligned to the reference genome and filtered the reads (bwa + blat) and can compute overlaps and coverage in R/Bioconductor and IRanges, that works. The problem is methodical: how to best detect a peak in the sequencing coverage in an intergenic region. My procedure now is a rather naive:

  • Find regions of coverage > arbitrary threshold (say 30), with width > min.width (e.g. 30 bases)
  • Remove all regions overlapping with genes/coding sequences(maybe + some 10 bp for promoters/UTR)

    I am doing this using views(), coverage(), and findOverlaps() from the IRanges package in R.

That way I can find something, but the point I do not like is the arbitrary cutoff, because of this I am also missing a lot of candidate peaks or if I set the threshold too low I will get lots of false positive. The data is not strand specific, and I have different samples but no biological replicates.

Do you have a better suggestion for selecting a possibly dynamic cutoff or for finding peaks in sequencing data, doesn't have to be in R, just a hint to an algortihm will do.

ADD COMMENTlink modified 2.2 years ago by Charles Warden6.5k • written 8.8 years ago by Michael Dondrup46k
gravatar for Brad Chapman
8.8 years ago by
Brad Chapman9.4k
Boston, MA
Brad Chapman9.4k wrote:

A good place to get started would be adapting some of the statistical frameworks using in ChIP-seq analysis. Bioconductor has some higher level tools which will help with the normalization and peak finding. The slides and code from the Seattle workshop late last year. UC Riverside's awesome documentation should be helpful. You'll want to avoid the +/- strand requirements since your underlying data is different, but the general windowing approaches sound similar to what you've attempted.

As far as setting cutoffs, there is always a trade off between false positives and negatives. One approach, given that you are tackling a custom problem, is to categorize a set of regions by hand as yes/no answers according to some by eye criteria. Assuming this test set is relatively representative, you can then use it to measure your sensitivity and specificity at different cutoffs. Biconductor's ROC package might be useful here.

The alternative is to look at the distribution of scores and select a cutoff based on what percentage of the regions you expect to capture based on your biological understanding.

ADD COMMENTlink modified 7 months ago by RamRS21k • written 8.8 years ago by Brad Chapman9.4k
gravatar for Sean Davis
8.0 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

You might have a look at using Scripture.

ADD COMMENTlink written 8.0 years ago by Sean Davis25k

Thanks Sean, I will definitely look into this tool, too, though the 'documentation' is very rudimentary, but that might improve. After taking a quick glance, it seems to me that that Scripture contains nothing related my problem I couldn't do easier with R/Bioconductor, but maybe you can point this out?

ADD REPLYlink written 8.0 years ago by Michael Dondrup46k
gravatar for Charles Warden
2.2 years ago by
Charles Warden6.5k
Duarte, CA
Charles Warden6.5k wrote:

If you are trying to find deferentially expressed regions that don't necessarily match annotated regions, I would recommend derfinder. It's pretty quick (at least compared to trying to perform a novel assembly with cufflinks), although I typically would at least start by focusing on annotation-based quantification strategies.

However, it won't probably won't work as well with smaller regions (but, unless you are specifically using a small RNA-Seq protocol, your library size is probably larger than 150 bp and isn't really suitable for quantifying smaller RNA sequences).

I also haven't tested derfinder without replicates, and I would generally recommend the use of biological replicates for better quality results.

ADD COMMENTlink written 2.2 years ago by Charles Warden6.5k
gravatar for Alper Yilmaz
8.0 years ago by
Alper Yilmaz90
Alper Yilmaz90 wrote:

The question is already answered but still I wanted to offer an alternative for whomever reached this page for his/her needs.

The USeq software, which was designed for ChIP-Seq initially, can be used for determining regions that are enriched in RNA-Seq as well. This might eliminate the arbitrary cut-off problem. Then, you can easily find regions that are not intergenic by using bedtools.

I hope this helps.

ADD COMMENTlink written 8.0 years ago by Alper Yilmaz90

Will this software work for Illumina data?

ADD REPLYlink written 2.5 years ago by vimlakany0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1231 users visited in the last hour