Question: In R or Python, How do I implement an allele sequence "autocomplete" tool given a dictionary of possible allele sequence matches?
1
gravatar for kristopherfernandez
14 days ago by
kristopherfernandez10 wrote:

In R or Python, how do I implement an allele sequence "autocomplete" tool given a dictionary of possible allele sequence matches?

That is, given, say an allele sequence:

CC_ATCGATCGTA_GATCGGCAA_GTGA

And given a limited number of dictionary values:

Allele 1: CCGATCGATCGTACGATCGGCAAGGTGA
Allele 2: ACTATCTATCGTAAGATCGGCAACGTGG
Allele 3: CCAATCGATCGTACGATCGGCAACGTGA
Allele 4: TCTATCTATCGTAAGATCGGCAACGTGG

The program will return to me Allele 1 and Allele 3 as possible reconstructions of the original allele sequence. Since:

Original:

CC_ATCGATCGTA_GATCGGCAA_GTGA

Matches:

Allele 1: CCGATCGATCGTACGATCGGCAAGGTGA
Allele 3: CCAATCGATCGTACGATCGGCAACGTGA

Ideally, the tool will search through the dictionary values as a tree search rather than exhaustively comparing the incomplete string with each dictionary entry. In other words, upon comparison of the first letter of alleles 2 and 3, they are automatically eliminated from consideration given that the first nucleotide does not match with the original string.

python assembly snp sequencing R • 172 views
ADD COMMENTlink modified 14 days ago by zx87547.1k • written 14 days ago by kristopherfernandez10
1

I would zip() the strings and iterate over the pairs, break if mismatch, continue if _ and return if reached the end

ADD REPLYlink written 14 days ago by WouterDeCoster38k
3
gravatar for manuel.belmadani
14 days ago by
Canada
manuel.belmadani730 wrote:

There's a lot of ways to do it, depending how large your dictionary is, or how often/quickly those are queried.

My intuition is that you probably want to use a Trie structure where you would insert all your alleles, and then query your trie once it's built.

I wrote a quick solution using LexPy:

from lexpy.trie import Trie
trie = Trie()
input_alleles = [
    "CCGATCGATCGTACGATCGGCAAGGTGA",
    "ACTATCTATCGTAAGATCGGCAACGTGG",
    "CCAATCGATCGTACGATCGGCAACGTGA",
    "TCTATCTATCGTAAGATCGGCAACGTGG"
]

trie.add_all(input_alleles)

query = "CC_ATCGATCGTA_GATCGGCAA_GTGA"
query_dist = query.count("_")

matches = trie.search_within_distance(query, dist=query_dist)

print matches

Results:

python trie-demo.py
['CCAATCGATCGTACGATCGGCAACGTGA', 'CCGATCGATCGTACGATCGGCAAGGTGA']

Further readings:

https://en.wikipedia.org/wiki/Trie
https://en.wikipedia.org/wiki/Finite-state_machine
https://en.wikipedia.org/wiki/Suffix_tree

ADD COMMENTlink written 14 days ago by manuel.belmadani730

Amazing! I am new to computer programming and learning these things really help me add new skills to my toolbox. So for that, thank you!

ADD REPLYlink written 14 days ago by kristopherfernandez10
0
gravatar for zx8754
14 days ago by
zx87547.1k
London
zx87547.1k wrote:

There must be some ready R packages for this specific task, but here is a simple one using agrep (not sure how well this will perform, but works for your example input):

x <- "CC_ATCGATCGTA_GATCGGCAA_GTGA"

y <- c("CCGATCGATCGTACGATCGGCAAGGTGA",
       "ACTATCTATCGTAAGATCGGCAACGTGG",
       "CCAATCGATCGTACGATCGGCAACGTGA",
       "TCTATCTATCGTAAGATCGGCAACGTGG")

y[ agrep(x, y) ]
# [1] "CCGATCGATCGTACGATCGGCAAGGTGA" "CCAATCGATCGTACGATCGGCAACGTGA"
ADD COMMENTlink written 14 days ago by zx87547.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1659 users visited in the last hour