Hello, I have a problem concerning a differential expression analysis of small non-coding RNA using sequencing data. I am trying to adapt a pipeline that I use for human small non-coding to mouse small non-coding. For human, I have a set of reference sequences where I use piRNA sequences from the piRBase database. The problem is that while piRBase has around 50-60 thousand sequences for human, mouse has 50 million sequences.

This makes little sense to me from a biological perspective, why should mouse have so much more piRNAs than human? Consider that rat has around 120 thousand sequences in piRBase, which is much closer to human than mouse.

This is also a problem when I perform read alignment on this reference, since BWA appears to be using a lot of memory and crashing. I think the problem is related to the sheer amount of sequences in the reference.

Does anyone know anything about this? Should I avoid using piRBase, at least for murine piRNA?

Any help in making sense of this is appreciated, thanks!

RNAcentral has ~73K piRNA for mouse. If that helps any.

RNAcentral has ~73K piRNA for mouse. If that helps any.

Thanks, that is a possible solution to the technical problem. 73k sequences are much more manageable, but I am still puzzled by the wildly different numbers in these databases...

