I have a dataset of small RNA-seq reads that have their 3-prime adapter sequence trimmed such that the remaining seqeunce precisely represents the bases present in the biological fragment that was sequenced. Many of these sequences represent mature miRNAs, and I want to count how many times each known miRNA sequence occurs in the dataset. Obviously I can perform an exact string match and get a good estimate, but that ignores the possibility of sequencing errors and imprecise end cleavage. So I think I want to run an alignment of all my reads against every known human mature miRNA sequence allowing for gaps and mismatches and pick the best match for each one (if there is any match good enough). Does anyone know of a fast and efficient way to do this, preferable using R?
I don't know how to do it in R. But here is how I do it:
1) Select reads that range between 18-30 bp and aligned them against reference genome and only keep those reads that originated from the reference genome or got aligned. You can be little easy here with the mismatches. I will go with 2 mismatches.
2) Filtered reads that got aligned to database of small RNA other than miRNA. Either you can use a gtf file for the location of small RNAs or create a database of small rna and remove reads that mapped to them.
The first two steps can be combined or the first step can be totally skipped. I do it as a part of QC.
3) Align the filtered reads against the miRBase using SHRiMP2 under the miRNA mode using default settings. I used precursor microRNA database. I am not sure why I did this but somewhere I read that is advisable to align short reads against precursor miRNA rather than mature miRNA.
4) Then you can use some RPKM generator tool to get counts of various miRNAs from sam or bam file. I use my own script as sam files for miRNA samples are small. In case, a read gets aligned against two miRNAs , then i choose the alignment with the least mismatches.
You could try to map against the mature.fa sequence file from miRBase. It is easy to find out which of those sequences are human (the hsa-XXX sequences). Note that you may need to substitute Ts for Us in the miRBase sequences. If you want to map your reads using R, you could use the subread aligner via the Rsubread package.