I am a mathematical modeller with very limited bioinformatics experience so please forgive me if my question is entirely obvious and multiple accepted algorithms exist to answer my question.
I am in a position whereby I have a set, A, of about 250 sequences of RNA all 100 bases in length which are associated with dysfunctional splicing under the action of a certain pathway inhibitor (details unimportant here). As a complimentary data set, B, I also have ~10,000 more sequences of RNA, of the same length which have no observed dysfunctional splicing under the same inhibitor.
What I would like to run is some form of analysis to identify either structural properties similar to members of A and not present within B or 'motifs' which are over-represented within either of the sets.
I can think of general methods of approaching this, working out what the general distribution of different lengths of motifs should be within a random sample and comparing it to my data or even learning and applying some machine learning algorithm to the data set which could hopefully identify shared patterns between the sets. However, given how new I am to this area of bioinformatics I wouldn't want to waste energy when pattern recognition algorithms are already rigorously established and accepted by the bioinformatics community which could be tweaked to my problem.
Alternatively, could anyone suggest a good resource for learning any pattern recognition or machine learning techniques? I am proficient in python, c++, Matlab etc. so would really just be happy to hear of any algorithm techniques rather than specific programming advice (although if anyone is aware of any useful libraries out there...)