I have a collection of small RNA-seq datasets from D. melanogaster and I'm interested in de novo identification of any sort of small RNA (miRNA, piRNA, etc), as well as differential expression of new and annotated small RNAs. I'm struggling to find the best way to do this. The best advice I found was at Peak Detection In Intergenic Regions In Rna-Seq Data. However, I have a few additional questions.
The answer from Brad Chapman says to adapt the statistical framework from ChIP-seq algorithms, but how, and specifically, what? The underlying distributional assumptions for identifying DE in RNA-Seq data are often either negative binomial or Poisson, and both also seem to be the distributions used in ChIP-seq peak calling software, which is helpful (although this did surprise me; I would have thought given the locus-specific enrichment of ChIP that another distribution would be used). I've looked particularly at MACS as an example of what statistical frameworks I would need to address.
- -s; Sequence tag size; while MACS doesn't require this as input, it will determine it empirically if I don't. If I am interested in small RNAs of various sizes, do I need to set this parameter to be a different integer for any length I'm interested in?
- --shift; the default is zero and I think that is what I would want to keep it at, but I don't fully understand its use in ChIP-seq data so would like to verify that.
- --keep-dup; since I anticipate many small RNAs will come from repeated regions, I will use this parameter
- --cal-summits; since there may be clusters of transcribed small RNAs.
Is there anything else I should be aware of? I was going to follow this up with DiffBind in R to find DE peaks.
Are there any other methods to find enriched sequences in a sample? This doesn't seem like a novel question or method, so I'm slightly concerned there doesn't seem to be any literature using this technique.
In case it comes up, I can't use anything with precomputed genome files (ie UCSC or Ensembl) as the assembly releases are years out of date. Additionally, ~50% of reads don't align to the reference genome, which is not terribly surprising given I expect reads to arise from repeated/non-assembled/heterochromatic regions.
I really appreciate any insight or advice.