DNA stadistical studies- Local Base Composition Code
2.5 years ago

Hi there! :)

I have a question for you all. I am trying to study the local base composition of a DNA sequence by using python. If you don't know what I am talking about, don't worry here I explain to you:

Imagine that you have this DNA sequence:

g a g t t t t a t c g c g c t t c c a t g

And you want to know how many a, c, g and t are in a part of it (window) and you want to repeat this process in the whole sequence but with a certain offset. So, this is a sum up of what you will have:

So, at the end, you will have the base composition of each subgroup you have made from the beginning sequence.

This is what I am trying to do in python. Here is my code:

def composicionBasesLocal(seq, window_len = 200, offset = 100, circular = False):
lowest = 0
highest = window_len
res = []

while highest<=len(seq)-1:
window = seq[lowest:highest+1]

if lowest<= len(seq):
mm = ModeloMultinomial(window)
res.append(mm)

else:
break

lowest = lowest + offset
highest = highest + offset

return(res)


ModeloMultinomial(seq) code:

def ModeloMultinomial(seq):
ModMul = []
pa = seq.count('A')/len(seq)
pc = seq.count('C')/len(seq)
pg = seq.count('G')/len(seq)
pt = seq.count('T')/len(seq)

ModMul.append([pa,pc,pg,pt])

return(['pa','pc','pg', 'pt'], ModMul)


This code (composicionBasesLocal) doesn't give me any message error but when I run it, it loops and I have to stopped it. I did it whit a for loop and it works without any problem.

What I have done wrong? Thank you!! :D

dna local bases composition stadistics python
check for the indentation of the

lowest = lowest + offset highest = highest + offset

because you are in a infinite loop.

0
Entering edit mode
def composicionBasesLocal(seq, window_len = 200, offset = 100, circular = False):
lowest = 0
highest = window_len
res = []

while highest <= len(seq)-1:
window = seq[lowest:highest+1]
print(window)

if lowest<= len(seq):
mm = ModeloMultinomial(window)
res.append(mm)

else:
break

lowest = lowest + offset
highest = highest + offset
print(lowest)
print(highest)

return(res)

22 months ago
schlogl ▴ 110
def get_kmers_counts(sequence, k=1):
"""Returns the count of all the contiguous and overlapping
substrings of length K from a genome."""
return Counter(sequence[i:i+k] for i in range(len(sequence) - k + 1))

def get_kmers_frequencies(sequence, k=1):
"""Returns the frequencies of all the contiguous and overlapping
substrings of length K from a genome."""
kmers = get_kmers_counts(sequence, k)
freq = defaultdict(float)
for mer, count in kmers.items():
freq[mer] = round(count / sum(kmers.values()), 4)
return freq


You can use it as two separate functions or use it to make your own function! Paulo