Beginner with problem and no idea where to start
1
3
Entering edit mode
8.4 years ago
athena ▴ 50

I'm a graduate student with experience in wet lab research, but I've recently come across a problem for which I am making a foray into bioinformatics. More specifically, my goal is to identify exact sequences that are repeated within one region of the genome but do not appear anywhere else in the genome. For example, I may want to find all sequences longer than 5 base pairs that are repeated more than 10 times on chromosome 2 but are not present on any other chromosome.

I imagine that others have addressed this problem, but I have no idea where to start looking for preexisting code, or even what language I should be starting out with. I have some experience programming, have taken intro bioinformatics classes, and am not afraid to learn much more, but as a beginner it's difficult for me to judge what areas I should be focusing on.

My questions are:

  • Where should I look for existing resources to solve my problem?
  • What language is most suited to my question?
  • Should I be starting from scratch, writing my own algorithm, or should I start with something more user friendly, like Galaxy?
sequence • 1.9k views
ADD COMMENT
0
Entering edit mode

start searching in google and you will end up in papers like this PubMed

ADD REPLY
1
Entering edit mode

Thanks for the link! Because I'm unfamiliar with the jargon, it's been difficult for me to find relevant papers, but the one you reference looks like a good place for me to begin.

ADD REPLY
0
Entering edit mode

For example, I may want to find all sequences longer than 5 base pairs that are repeated more than 10 times on chromosome 2 but are not present on any other chromosome.

I imagine that others have addressed this problem

Actually, that's very specific, and not at all easy. I can't imagine why someone else would have wanted to solve this problem. Why do you want to solve it? Also, your definition is vague. Do you mean tandem repeats, or repeats in general?

ADD REPLY
0
Entering edit mode

I mean repeats in general. My long term goal is to design a single sgRNA for CRISPR that will cut at multiple positions within a defined region of a chromosome. I know that sgRNA design and avoiding off-target effects are very complicated, but I first wanted to determine if there were long enough repeats unique to regions to merit attempting to design the sgRNAs.

ADD REPLY
0
Entering edit mode

In the human genome, there are probably no 6bp sequences present more than 10 times in one chromosome that are not present in all chromosomes. A 6bp sequence has a 1/4^6 (or 1 in 4096) chance of occurring in random sequence. It's unrealistic to expect that to not occur in a 100Mbp chromosome, completely by chance.

You can use KmerCountExact in the BBMap package to count the occurrences of specific kmers in a genome. For example,

kmercountexact.sh in=genome.fasta out=counts.fasta k=17

This will give you the counts of all 17-mers in the human genome, so you can find the ones that occur only once.

ADD REPLY
0
Entering edit mode
8.4 years ago

Where should I look for existing resources to solve my problem?

  • you should search for "mappability" e.g: http://www.ncbi.nlm.nih.gov/pubmed/22276185
  • if you're a real beginner, it's great you want to learn a programming language, but I'm afraid it will take you a long time before you get a result ( unless you have a high resistance to frustration :-) ) I would ask a local bioinformatician for a collaboration.
ADD COMMENT
1
Entering edit mode

First, thank you for giving me the term mappability; I think it will help get me started in the right direction. You may be right that I should team up with a bioinformatician- there are certainly some at my university who would be willing to help. However, I'd like to have a good enough grasp of the question I'm trying to ask and how it should be done before I approach anyone else. At the moment I don't even know the terminology or issues with designing such a program, and I want to be able to understand how and why it works.

That being said, this project isn't really pressing and I don't have time restraints. At the moment, I can answer my question with some poorly coded Python and a very small set of sample data, but I know that it is too inefficient to work at the genome scale. Do you think that it would be reasonable for me to learn how to make a more efficient algorithm within, say, a year? I am in a human genetics program that is pretty well split between computational and wet lab research, and I do think it would be valuable for me to learn more about bioinformatics/programming if only to understand my peers' work.

ADD REPLY

Login before adding your answer.

Traffic: 3222 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6