I'm new to Python and would appreciate any help you could give me. I'm looking for fundamental insertions and deletions in the genomes of two closely related species [E C & E F]. I used sequences from both species to do a FASTA pairwise alignment (glsearch36).
The portion of my Python script below shows how I identified a 7 nucleotide (heptamer) in one sequence (database) that correlates to a gap in the other sequence (query). Here's an example of what I've got:
ATGCACAA-ACCTGTATG # query
ATGCAGAGGAAGAGCAAG # database
9
GAGGAAG
Let's say the gap is at position 9. I'm attempting to improve the script so that it only selects gaps that are 20 nucleotides or more apart on both sequences and only if the surrounding nucleotides match.
ATGCACAAGTAAGGTTACCG-ACCTGTATGTGAACTCAACA
||| |||
GTGCTCGGGTCACCTTACCGGACCGCCCAGGGCGGCCCAAG
21
CCGGACC
This is the portion of my script; the upper half is concerned with opening various files. At the end, it also outputs a dictionary with the count of each sequence.
list_of_positions = []
for match in re.finditer(r'(?=(%s))' % re.escape("-"), dict_seqs[E_C]):
list_of_positions.append(match.start())
set_of_positions = set(list_of_positions)
for position in list_of_positions:
list_no_indels = []
for number in range(position-20, position) :
list_no_indels.append(number)
for number in range(position+1, position+21) :
list_no_indels.append(number)
set_no_indels = set(list_no_indels)
if len(set_no_indels.intersection(set_of_positions))> 0 : continue
if len(set_no_indels.intersection(set_of_positions_EF))> 0 : continue
print position
#print match.start()
print dict_seqs[E_F][position -3:position +3]
key = dict_seqs[E_F][position -3: position +3]
if nt_dict.has_key(key):
nt_dict[key] += 1
else:
nt_dict[key] = 1
print nt_dict
I was attempting to alter the results of pairwise alignments in order to find the nucleotides opposite the gaps in both the query and database sequences in order to perform some basic Insertion/Deletion analysis.
I was able to enhance my findings by extending the distance between gaps "-" to 20 nt's in an attempt to decrease noise. Above is an altered script.
This is an example of one of my findings, and at the end I have a dictionary that counts the number of times each sequence appears.
ATGCACAA-ACCTGTATG # query
ATGCAGAGGAAGAGCAAG # database
9 (position on the sequence)
GAGGAA (hexamer)
ATGCACAAGACCTGTATG # query
ATGCAGAG-AAGAGCAAG # database
9 (position)
CAAGAC (hexamer)
But, I am currently attempting to adjust the script such that the nucleotides around the gap match exactly like this, where the | is only to indicate matching nt's on each sequence:
GGTTACCG-ACCTGTATGTGAACTCAACA # query
||| ||
CCTTACCGGACCGCCCAGGGCGGCCCAAG # database
9
ACCGAC
Any assistance would be much appreciated!