I have been given deep sequencing data (Illumina/Solexa) for the short RNAs from the tissue of an organism whose genome has not been sequenced. From what I was told, the reads come from RNA that was size-selected (< 100 bps) by extracting from a gel.
I would like to create a list of all the microRNAs in the sample. I am most interested in miRNAs that are unique to my organism- I don't know the exact phylogenetic relationship, but AFAIK, nothing closely related (e.g. as related as mouse and rat) has been studied.
There have been a number of studies that ask the same question, but in organisms where the genomic sequence is already known. Programs like miRDeep can then be used to map the reads onto the genome and predict whether the reads come from microRNAs.
One option that was suggested was to run miRDeep using organisms that have genomic sequence. Because miRNAs are often conserved, I should find some miRNAs among my reads that way. One problem with this approach is that I am unlikely to find what I'm most interested in (miRNAs unique to my species), though I may find some that are 'archaic' (present in the genome of these other organisms but no longer expressed- or at least detected under conditions tested so far).
To begin, I used FreClu to generate a unique set of reads. I set a minimum read count of 5. I tried filtering out non-miRNA sequence: I BLASTed against Rfam and fRNAdb to identify non-miRNA short RNAs. I BLASTed against an EST-based transcriptome to identify any mRNA contaminants.
After running miRDeep as suggested, I did indeed find many miRNAs that are conserved. I also used the seed sequence (bps 2-7 or 2-8) to find putative family relationships with miRBase members.
In the end, though, I still have lots of sequences that don't have obvious hits to other small RNAs. Presumably some of them could be the unique miRNAs that I am interested in. At the very least, they're sequences I cannot classify. Can someone suggest another computational approach I could use to try identifying which of these unclassified seqs could be miRNAs? I experimented with trying to match major and minor products among the reads, but what I ended up with was very noisy (too many possible matches to be useful and no luck at finding a match threshold that would give sensible results).
Thank you in advance for your help,
Edit (2011-11-15): Here are some more details on the deep sequencing data. I apologise to Larry and anyone who was misled by the original description and lack of details. The data was given to me a long time ago and there was initially some confusion about its makeup, which I obviously internalised (I should have reviewed the e-mails again rather than rely on memory- again, I apologise for this mistake). The reads are single end reads from an Illumina/Solexa (not 454) sequencer and are maximum 36 bps. The extracted gel band was supposed to contain 15-30 bp seqs, though I was told that it is not unexpected for larger seqs (> 40 bps) to be extracted as well. Most of my reads are in the 19-23 bp range (there are some that are the max [36 bps], though I was not able to classify most of them).