Question

A bioinformatics query script written in Python

0

Entering edit mode

2.3 years ago

zainabi8077 ▴ 20

I'm new to Python and would appreciate any help you could give me. I'm looking for fundamental insertions and deletions in the genomes of two closely related species [E C & E F]. I used sequences from both species to do a FASTA pairwise alignment (glsearch36).

The portion of my Python script below shows how I identified a 7 nucleotide (heptamer) in one sequence (database) that correlates to a gap in the other sequence (query). Here's an example of what I've got:

    ATGCACAA-ACCTGTATG # query
ATGCAGAGGAAGAGCAAG # database
9
GAGGAAG

Let's say the gap is at position 9. I'm attempting to improve the script so that it only selects gaps that are 20 nucleotides or more apart on both sequences and only if the surrounding nucleotides match.

ATGCACAAGTAAGGTTACCG-ACCTGTATGTGAACTCAACA
                 ||| |||
GTGCTCGGGTCACCTTACCGGACCGCCCAGGGCGGCCCAAG
21
CCGGACC

This is the portion of my script; the upper half is concerned with opening various files. At the end, it also outputs a dictionary with the count of each sequence.

list_of_positions = []

for match in re.finditer(r'(?=(%s))' % re.escape("-"), dict_seqs[E_C]): 
    list_of_positions.append(match.start())

set_of_positions = set(list_of_positions)
for position in list_of_positions:
    list_no_indels = []
    for number in range(position-20, position) :
        list_no_indels.append(number)
    for number in range(position+1, position+21) :
        list_no_indels.append(number)
    set_no_indels = set(list_no_indels)
    if len(set_no_indels.intersection(set_of_positions))> 0 : continue

    if len(set_no_indels.intersection(set_of_positions_EF))> 0 : continue


    print position 
    #print match.start()

    print dict_seqs[E_F][position -3:position +3]

    key = dict_seqs[E_F][position -3: position +3]

    if nt_dict.has_key(key):
        nt_dict[key] += 1 
    else:
        nt_dict[key] = 1


print nt_dict

I was attempting to alter the results of pairwise alignments in order to find the nucleotides opposite the gaps in both the query and database sequences in order to perform some basic Insertion/Deletion analysis.

I was able to enhance my findings by extending the distance between gaps "-" to 20 nt's in an attempt to decrease noise. Above is an altered script.

This is an example of one of my findings, and at the end I have a dictionary that counts the number of times each sequence appears.

ATGCACAA-ACCTGTATG # query
ATGCAGAGGAAGAGCAAG # database
9 (position on the sequence)
GAGGAA (hexamer)


ATGCACAAGACCTGTATG # query
ATGCAGAG-AAGAGCAAG # database
9 (position)
CAAGAC (hexamer)

But, I am currently attempting to adjust the script such that the nucleotides around the gap match exactly like this, where the | is only to indicate matching nt's on each sequence:

GGTTACCG-ACCTGTATGTGAACTCAACA # query
     ||| ||
CCTTACCGGACCGCCCAGGGCGGCCCAAG # database

9
ACCGAC

Any assistance would be much appreciated!

python • 827 views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 2.3 years ago by zainabi8077 ▴ 20

score 0 · Answer 1 · 2023-03-11

It's not clear from your code how to retrieve that database sequence given the sequence query; but since you know the position (and with index = position-1) then something like:

def filter_flanking_matches(query, position, k=3):
    db = get_database_match_for_query(query)
    ix = position-1
    return ix > k and query[(ix-k):ix] == db[(ix-k):ix] and ix < len(query) and \
           query[position:(position+k)] == db[position:(position+k)])