Find Sequences That Does Not Match To A Target Sequence
3
2
Entering edit mode
10.9 years ago
Rnaer ▴ 120

From all the possible seqeunces of 30 nt (4 ** 30 possibilities), how can I quickly find the ones that does not match to any region of the C.elegans genome. Is there any tool to do that?

alignment • 5.6k views
ADD COMMENT
0
Entering edit mode

Does it mean you want to compare 4^30 sequences against C.elegans genome ?

ADD REPLY
0
Entering edit mode

yes, basically, but of course, not do it in brute force way

ADD REPLY
1
Entering edit mode

No...I don't think you really want to check every single 30-mer in existence. It would take a thousand years. Why don't you start by generating, say, 1000 random 30-mers, and aligning them to your genome, and seeing if the worst aligning ones don't suit your needs.

ADD REPLY
0
Entering edit mode

What about my answer, is this what you want?

ADD REPLY
0
Entering edit mode

I am afraid that your answer is not what he is looking for.

ADD REPLY
0
Entering edit mode

How many of those do you want? (All?)

ADD REPLY
0
Entering edit mode

well, ideally, the most distant sequences from any regions of the target sequence, but one is enough

ADD REPLY
0
Entering edit mode

If I understand you correctly, than I would probably just generated 100k random sequences of length 30 and for every 30nt window of C.elegans performed Needlman-Wunsch algorithm with all my 100k sequences and reported the worst alignment.

ADD REPLY
0
Entering edit mode

yes, but that's too brute force. I actually agree with swbarnes2. It's not so trivial to implement it, though. just want to see if any tool exist.

ADD REPLY
0
Entering edit mode

Why don't you use bowtie? It's quite fast and you can use it with fasta sequences of the 30-mers. You can set the parameters to be exact matches and allow for multi-mapping. Then just look for the 30-mer sequences (reads) that don't align to the genome.

ADD REPLY
0
Entering edit mode

even each 30mers takes 1 sec, to test all possibilities will take 4**30/3600 = 3.20256e+14 hrs

ADD REPLY
0
Entering edit mode

You know that short read aligners work much faster than 1 read every second. Bowtie aligns tens of millions of reads in hours. So aligning 100K test reads would be feasible.

ADD REPLY
1
Entering edit mode

if you wanna check for k-mers and which of them occur just go with Jellyfish

ADD REPLY
1
Entering edit mode
10.9 years ago

I feel like some kind of indexing software should be able to tell you all the 20-mers present in C.elegans, but I don't know what, exactly. That would be a start.

If you get multiple people bewildered by your question, you probably aren't approaching your problem correctly. If you said what problem you were trying to solve, maybe people could give you a more feasible solution.

ADD COMMENT
0
Entering edit mode

Bowtie uses indexing why not try that?

ADD REPLY
0
Entering edit mode

That's along the lines of what I was thinking, but Bowtie is so named because it's a Burrows-Wheeler Transform algorithm, and the transform part might be problematic.

ADD REPLY
0
Entering edit mode
10.9 years ago
a1ultima ▴ 850

Yep, just use the exclude field in NCBI's BLASTn

enter image description here

ADD COMMENT
0
Entering edit mode

It does not answer my question. For blast, you have to provide the sequence as query; my question is to find the most dissimilar sequences to the database.

ADD REPLY
1
Entering edit mode

Your question has nothing to do with "most dissimilar sequence" - you wanted to "find unique sequences ... that does not match ... C.elegans genome"

ADD REPLY
0
Entering edit mode

Then please consider making that the question asked above.

ADD REPLY
0
Entering edit mode

ok. I reworded the question.

ADD REPLY
0
Entering edit mode
10.9 years ago
jackuser1979 ▴ 890

I don't know this will help you. Why don't you create two dataset, one containing all the possible seqeunces of 30 nt (A) and another with c.elegans genome (B) and do bi-directional blast, and find sequence which have not matched (B) set.

ADD COMMENT

Login before adding your answer.

Traffic: 1770 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6