Question

Find Sequences That Does Not Match To A Target Sequence

2

Entering edit mode

10.9 years ago

Rnaer ▴ 120

From all the possible seqeunces of 30 nt (4 ** 30 possibilities), how can I quickly find the ones that does not match to any region of the C.elegans genome. Is there any tool to do that?

alignment • 5.6k views

ADD COMMENT • link updated 10.9 years ago by jackuser1979 ▴ 890 • written 10.9 years ago by Rnaer ▴ 120

0

Entering edit mode

Does it mean you want to compare 4^30 sequences against C.elegans genome ?

ADD REPLY • link 10.9 years ago by Ashutosh Pandey 12k

0

Entering edit mode

yes, basically, but of course, not do it in brute force way

ADD REPLY • link 10.9 years ago by Rnaer ▴ 120

1

Entering edit mode

No...I don't think you really want to check every single 30-mer in existence. It would take a thousand years. Why don't you start by generating, say, 1000 random 30-mers, and aligning them to your genome, and seeing if the worst aligning ones don't suit your needs.

ADD REPLY • link 10.9 years ago by swbarnes2 14k

0

Entering edit mode

What about my answer, is this what you want?

ADD REPLY • link 10.9 years ago by a1ultima ▴ 850

0

Entering edit mode

I am afraid that your answer is not what he is looking for.

ADD REPLY • link 10.9 years ago by Ashutosh Pandey 12k

0

Entering edit mode

How many of those do you want? (All?)

ADD REPLY • link 10.9 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

well, ideally, the most distant sequences from any regions of the target sequence, but one is enough

ADD REPLY • link 10.9 years ago by Rnaer ▴ 120

0

Entering edit mode

If I understand you correctly, than I would probably just generated 100k random sequences of length 30 and for every 30nt window of C.elegans performed Needlman-Wunsch algorithm with all my 100k sequences and reported the worst alignment.

ADD REPLY • link 10.9 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

yes, but that's too brute force. I actually agree with swbarnes2. It's not so trivial to implement it, though. just want to see if any tool exist.

ADD REPLY • link 10.9 years ago by Rnaer ▴ 120

0

Entering edit mode

Why don't you use bowtie? It's quite fast and you can use it with fasta sequences of the 30-mers. You can set the parameters to be exact matches and allow for multi-mapping. Then just look for the 30-mer sequences (reads) that don't align to the genome.

ADD REPLY • link 10.9 years ago by UnivStudent ▴ 440

0

Entering edit mode

even each 30mers takes 1 sec, to test all possibilities will take 4**30/3600 = 3.20256e+14 hrs

ADD REPLY • link 10.9 years ago by Rnaer ▴ 120

0

Entering edit mode

You know that short read aligners work much faster than 1 read every second. Bowtie aligns tens of millions of reads in hours. So aligning 100K test reads would be feasible.

ADD REPLY • link 10.9 years ago by swbarnes2 14k

1

Entering edit mode

if you wanna check for k-mers and which of them occur just go with Jellyfish

ADD REPLY • link 10.9 years ago by Phil S. ▴ 700

score 1 · Answer 1 · 2013-11-26

1

Entering edit mode

10.9 years ago

swbarnes2 14k

I feel like some kind of indexing software should be able to tell you all the 20-mers present in C.elegans, but I don't know what, exactly. That would be a start.

If you get multiple people bewildered by your question, you probably aren't approaching your problem correctly. If you said what problem you were trying to solve, maybe people could give you a more feasible solution.

ADD COMMENT • link 10.9 years ago by swbarnes2 14k

0

Entering edit mode

Bowtie uses indexing why not try that?

ADD REPLY • link 10.9 years ago by UnivStudent ▴ 440

0

Entering edit mode

That's along the lines of what I was thinking, but Bowtie is so named because it's a Burrows-Wheeler Transform algorithm, and the transform part might be problematic.

ADD REPLY • link 10.9 years ago by swbarnes2 14k

score 0 · Answer 2 · 2013-11-25

0

Entering edit mode

10.9 years ago

a1ultima ▴ 850

Yep, just use the exclude field in NCBI's BLASTn

enter image description here

ADD COMMENT • link 10.9 years ago by a1ultima ▴ 850

0

Entering edit mode

It does not answer my question. For blast, you have to provide the sequence as query; my question is to find the most dissimilar sequences to the database.

ADD REPLY • link 10.9 years ago by Rnaer ▴ 120

1

Entering edit mode

Your question has nothing to do with "most dissimilar sequence" - you wanted to "find unique sequences ... that does not match ... C.elegans genome"

ADD REPLY • link 10.9 years ago by PoGibas 5.1k

0

Entering edit mode

Then please consider making that the question asked above.

ADD REPLY • link 10.9 years ago by a1ultima ▴ 850

0

Entering edit mode

ok. I reworded the question.

ADD REPLY • link 10.9 years ago by Rnaer ▴ 120

score 0 · Answer 3 · 2013-12-02

0

Entering edit mode

10.9 years ago

jackuser1979 ▴ 890

I don't know this will help you. Why don't you create two dataset, one containing all the possible seqeunces of 30 nt (A) and another with c.elegans genome (B) and do bi-directional blast, and find sequence which have not matched (B) set.

ADD COMMENT • link 10.9 years ago by jackuser1979 ▴ 890