I'm a graduate student with experience in wet lab research, but I've recently come across a problem for which I am making a foray into bioinformatics. More specifically, my goal is to identify exact sequences that are repeated within one region of the genome but do not appear anywhere else in the genome. For example, I may want to find all sequences longer than 5 base pairs that are repeated more than 10 times on chromosome 2 but are not present on any other chromosome.
I imagine that others have addressed this problem, but I have no idea where to start looking for preexisting code, or even what language I should be starting out with. I have some experience programming, have taken intro bioinformatics classes, and am not afraid to learn much more, but as a beginner it's difficult for me to judge what areas I should be focusing on.
My questions are:
- Where should I look for existing resources to solve my problem?
- What language is most suited to my question?
- Should I be starting from scratch, writing my own algorithm, or should I start with something more user friendly, like Galaxy?