Find Sequences That Does Not Match To A Target Sequence
3
2
Entering edit mode
10.0 years ago
Rnaer ▴ 120

From all the possible seqeunces of 30 nt (4 ** 30 possibilities), how can I quickly find the ones that does not match to any region of the C.elegans genome. Is there any tool to do that?

alignment • 4.8k views
0
Entering edit mode

Does it mean you want to compare 4^30 sequences against C.elegans genome ?

0
Entering edit mode

yes, basically, but of course, not do it in brute force way

1
Entering edit mode

No...I don't think you really want to check every single 30-mer in existence. It would take a thousand years. Why don't you start by generating, say, 1000 random 30-mers, and aligning them to your genome, and seeing if the worst aligning ones don't suit your needs.

0
Entering edit mode

0
Entering edit mode

I am afraid that your answer is not what he is looking for.

0
Entering edit mode

How many of those do you want? (All?)

0
Entering edit mode

well, ideally, the most distant sequences from any regions of the target sequence, but one is enough

0
Entering edit mode

If I understand you correctly, than I would probably just generated 100k random sequences of length 30 and for every 30nt window of C.elegans performed Needlman-Wunsch algorithm with all my 100k sequences and reported the worst alignment.

0
Entering edit mode

yes, but that's too brute force. I actually agree with swbarnes2. It's not so trivial to implement it, though. just want to see if any tool exist.

0
Entering edit mode

Why don't you use bowtie? It's quite fast and you can use it with fasta sequences of the 30-mers. You can set the parameters to be exact matches and allow for multi-mapping. Then just look for the 30-mer sequences (reads) that don't align to the genome.

0
Entering edit mode

even each 30mers takes 1 sec, to test all possibilities will take 4**30/3600 = 3.20256e+14 hrs

0
Entering edit mode

You know that short read aligners work much faster than 1 read every second. Bowtie aligns tens of millions of reads in hours. So aligning 100K test reads would be feasible.

1
Entering edit mode

if you wanna check for k-mers and which of them occur just go with Jellyfish

1
Entering edit mode
10.0 years ago

I feel like some kind of indexing software should be able to tell you all the 20-mers present in C.elegans, but I don't know what, exactly. That would be a start.

If you get multiple people bewildered by your question, you probably aren't approaching your problem correctly. If you said what problem you were trying to solve, maybe people could give you a more feasible solution.

0
Entering edit mode

Bowtie uses indexing why not try that?

0
Entering edit mode

That's along the lines of what I was thinking, but Bowtie is so named because it's a Burrows-Wheeler Transform algorithm, and the transform part might be problematic.

0
Entering edit mode
10.0 years ago
a1ultima ▴ 840

Yep, just use the exclude field in NCBI's BLASTn

0
Entering edit mode

It does not answer my question. For blast, you have to provide the sequence as query; my question is to find the most dissimilar sequences to the database.

1
Entering edit mode

Your question has nothing to do with "most dissimilar sequence" - you wanted to "find unique sequences ... that does not match ... C.elegans genome"

0
Entering edit mode

0
Entering edit mode

ok. I reworded the question.

0
Entering edit mode
10.0 years ago
jackuser1979 ▴ 890

I don't know this will help you. Why don't you create two dataset, one containing all the possible seqeunces of 30 nt (A) and another with c.elegans genome (B) and do bi-directional blast, and find sequence which have not matched (B) set.