I have miRNA short read data for few samples. I want to do differential analysis to find out if the samples show differences in the expression level of any miRNA. I know there are few tools available for this but still I carried out most of the steps manually. This is what i did:
1) Selected reads that range between 18-32 bp and aligned them against reference genome and only kept the ones that got aligned.
2) Filtered reads that got aligned to database of small RNA other than miRNA.
3) Aligned them against the miRBase using SHRiMP2 using the miRNA mode. I used precursor microRNA database. I am not sure why I did this but somewhere I read that is advisable to align short reads against precursor miRNA rather than mature miRNA. Please free to comment about this step. I may not be right.
Now, I have the SAM files which i can use to quantify the expression of different miRNAs. But as different samples have different number of starting reads I need to normalize the counts. I can use a modified RPKM value where I can ignore the length factor and only use total aligned reads for a sample. My first question is should I use a) total number of reads aligned for a sample including non-miRNA short RNAs, miRNAs to normalize. b) OR total number of reads aligned to miRNA database to normalize the counts for a miRNA.
Also, in case if you have a better idea for differential expression analysis I would appreciate it.