I have completed part Q1 of this question and I am not understanding what Q2 is asking for?
Q1. Consider the searching problem:
Input: A Genome G ∈ Σ∗ = A string (sequence of characters) over an alphabet of characters Σ = {a, t, c, g}; A reverse-complement restriction pattern k-cutters of length k, pk ∈ Σ∗, such that Reverse(Complement(pk)) = pk.
Output: An index i such that pk = G[i..i + k − 1] or the special value NIL if v does not appear in G. Write pseudocode for linear search, which scans through the sequence, looking for pk. How will you optimize it, if you need to search for multiple pk’s?
LinearSearch(G, pk):
i = 0
while i < length(G):
if G[i] == pk[0]:
i2 = i
j = 0
while G[i2] == pk[j] and j < k:
i2 = i2 +1
j = j + 1
if j == k:
return i
i = i + 1
if i == length(G):
return NIL
Q2. In the same setting as the one described above: Let |G| = n and |pk| = k, what is the probability of a pk occurring in G, assuming that G is a random string with all characters occurring equiprobably. Consider the distance between two consecutive occurrences of pk in G: What is its average value? Variance?