Question

Question regarding searching problems

0

Entering edit mode

6.5 years ago

jingchen1996 • 0

I have completed part Q1 of this question and I am not understanding what Q2 is asking for?

Q1. Consider the searching problem:

Input: A Genome G ∈ Σ∗ = A string (sequence of characters) over an alphabet of characters Σ = {a, t, c, g}; A reverse-complement restriction pattern k-cutters of length k, pk ∈ Σ∗, such that Reverse(Complement(pk)) = pk.

Output: An index i such that pk = G[i..i + k − 1] or the special value NIL if v does not appear in G. Write pseudocode for linear search, which scans through the sequence, looking for pk. How will you optimize it, if you need to search for multiple pk’s?

LinearSearch(G, pk): 
   i = 0 
   while i < length(G): 
      if G[i] == pk[0]: 
      i2 = i 
      j = 0 
      while G[i2] == pk[j] and j < k: 
         i2 = i2 +1 
         j = j + 1  
if j == k: 
   return i 
   i = i + 1 
if i == length(G): 
   return NIL

Q2. In the same setting as the one described above: Let |G| = n and |pk| = k, what is the probability of a pk occurring in G, assuming that G is a random string with all characters occurring equiprobably. Consider the distance between two consecutive occurrences of pk in G: What is its average value? Variance?

genome string • 1.6k views

ADD COMMENT • link updated 6.5 years ago by kloetzl ★ 1.1k • written 6.5 years ago by jingchen1996 • 0

score 1 · Answer 1 · 2017-11-03

Instead of answering your homework for you, let me rephrase Q2:

Given pk and a random string of equal length g what is the probability of them to be the same string? How many gs are in G? Can all of them be considered independent? Accumulate these numbers into the probability of a string pk occurring in a random genome.

The second part seems non-obvious to me, but nonetheless doable if you find the right formula for the average distance.