I'm hoping to implement a genomic DNA hash-table and I unsure how to handle N bases?
Should I skip the k-mers that contain them? or generate the possible sequences up to a limit of X N's per k-mer?
I'm hoping to use the hash table to find perfect k-mer matches within the human genome (I'm aiming for a k-mer size of 10-12 nucleotides). My query sequences won't have ambiguous bases so I'm not too worried about dealing with those. I assume the best strategy is to just skip sequences that have N's.