If I have two small RNA libraries from different (but related) species, what would be the best way of finding the proportion of reads that are shared between libraries? How do I treat, for example, a read that is a subsequence of another read? Should the matching be reciprocal?
Maybe my idea is not the best but using a software originally meant to construct phylogenetic tree could help you. I understand you have no interest in constructing a phylogenetic tree but the concept is the same with what you want to do. Softwares like Phylobayes would output a likelyhood value and would tell you what percent of your contigs are overlapping each-other. So instead of comparing two species you would fake this by comparing two RNA libraries.. RNA libraries are quite small compared to DNA libraries, so that would be done in no time. I'm not sure how you can do this Phylobayes but my idea is to use this kind of software as a starting point. Otherwise, depending on the size of your library you can use Python / MySQL and write a piece of code to parse your reads. (MySQL is easier to learn..and yes it DOES parse text and works really well on small libraries).
Just my thoughts, I hope it helped.
This is a hard question, many small RNAs could be conserved, but others will have minor differences, like microRNAs, some families are really conserved with only 1-2 changes in not-so close species, but others only share a small portion of the sequences (the "seed" region with 6-8 bases) even in close species.
In this case, maybe you can start with a fixed length (21-24b) and check the proportions with 0, 1, 2 mismatches in both libraries.