Question: Genomic Dna Hash Tables And Ambiguous Bases
5
gravatar for Gww
8.4 years ago by
Gww2.6k
Canada
Gww2.6k wrote:

Hi,

I'm hoping to implement a genomic DNA hash-table and I unsure how to handle N bases?

Should I skip the k-mers that contain them? or generate the possible sequences up to a limit of X N's per k-mer?

EDIT:

I'm hoping to use the hash table to find perfect k-mer matches within the human genome (I'm aiming for a k-mer size of 10-12 nucleotides). My query sequences won't have ambiguous bases so I'm not too worried about dealing with those. I assume the best strategy is to just skip sequences that have N's.

Thanks

alignment genomics • 3.2k views
ADD COMMENTlink modified 8.4 years ago by Gingi330 • written 8.4 years ago by Gww2.6k
4
gravatar for Gingi
8.4 years ago by
Gingi330
Irvington, NY
Gingi330 wrote:

Yes, if you're looking for perfect matches, don't index kmers that contain Ns.

Are you coding the kmer index yourself? You might want to take a look at Tallymer, which creates an index similar to what you have in mind.

ADD COMMENTlink written 8.4 years ago by Gingi330
1

Thanks for the advice, I want to code it myself mostly for the learning experience (I haven't written a hash table before). But that link will be really helpful.

ADD REPLYlink written 8.4 years ago by Gww2.6k
2
gravatar for brentp
8.4 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

you'll probably get more useful answers if you indicate your intended use of the hash-table.

meanwhile check this thread to see how existing software handles the problem.

as another datapoint bowtie just treats non-ACGT characters in a read as mismatches--but it's not using a hash.

ADD COMMENTlink written 8.4 years ago by brentp23k
1

Strictly speaking, bowtie treats an ambiguous base as a random base in mapping. It corrects for that afterwards, but this is different from building the ambiguity in the index.

ADD REPLYlink written 8.4 years ago by lh331k

Thanks for the answer, I updated my question with a bit more information regarding my goals.

ADD REPLYlink written 8.4 years ago by Gww2.6k

Thanks for the answer, I updated my question with a bit more information regarding my goals. Oh PS. I really enjoyed your blog article on bloom filters :).

ADD REPLYlink written 8.4 years ago by Gww2.6k

@lh3, aye, but that's in the reference. at least according to the docs:"Ambiguous characters in the read mismatch all other characters."

ADD REPLYlink written 8.4 years ago by brentp23k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1165 users visited in the last hour