I have got mouse miRNA rnaseq data and I am supposed to give my feedback about the quality of the library that was being developed for this project. The barcodes are already trimmed off and the final reads range between 20bp to 70bp. I have grouped reads into two - one with read length between 20 to 35 bp (mature miRNAs) and the other greater than 35 bp (progenitors).
Here is what I am planning to do
a) Estimating the percentage of reads belonging to real miRNA (based on currently available annotation).
I will be aligning the read sequences against already known mouse miRNA hairpins and mature/mature star miRNA sequences downloaded from miRBase. I will also align all the reads to mouse reference genome and use UCSC genome browser mouse miRNA track to check for the contamination due to non-miRNA sequences.
b) Calculating the complexity of the library.
As miRNAs are short entities and a particular type of miRNA can bind to 3' UTR of many genes I am not sure how to do this. What would be the right approach to know if our library represents/samples and also quantifies different type of miRNAs at the right level.
Any other suggestions or link to relevant papers are also appreciated.
don't be surprised if a very small fraction of your sequences align to miRNAs