Background: I'm working on finding small RNA (piRNA more precisely) sequences on a recently sequenced genome. The genome is in the form of scaffolds currently. I have 3 biological replicates from small RNA seq, and I would like to predict piRNA sequences in these samples. I know that there are some prediction tools for that (either for individual sequences or piRNA clusters) but I would like to see if there is another more "robust" way (Bioconductor packages, or a workflow) or some filtering steps that I haven't used yet.
Is it better to collapse all the samples together or use them individually and then search for identical sequences between them?
To predict piRNAs, you need to predict sequences that are tRNAs, rRNAs, miRNAs (this information has to be generated) first and then look on the other sequences for prediction?
My workflow until now: Aligned reads are collapsed and filtered for counts > 9, length of 26 - 34 with Uridine on the 1st base or Adenine on the 10th base (piRNA characteristic). With regard to this article piRNAs are not conserved among species so it would be probably useless to "blast" the sequences to other known piRNAs.
I'm not sure how to proceed.
I used the function
plyranges::join_overlap_self_directed(minoverlap = 20) in order to see if there are reads with more or less the same sequence. From that, I got ~ 1.5 million overlaps (starting Genomic ranges(GR) are ~300k) making it more confusing than before.
I also tried to "reduce" the GRs (using the function
plyranges::reduce_ranges_directed()) but it gave me some ranges of 2kbs long...
As I don't have any experience with annotation, am looking for advice, workflows, or suggestions on the current work.
Thank you for your time,