Search for pattern within multiple sequence alignement
0
0
Entering edit mode
5 months ago
jmungar2 ▴ 10

Hello,

I need to search for a pattern within a multiple sequence alignment allowing any number of - or . symbols to be including within the characters of the patter. For example, I want to search for the string pattern RAGTLQYD (see bold characters) within the alignment below, and to do so I have to ignore any number of - and . symbols that appear between the characters of the pattern. Also, I want to print out the position in the alignement where the first character of the pattern is located. So far I got to this:

from re import search, IGNORECASE import pandas as pd

df1 = pd.read_csv(multiple_sequence_alignment_file, delimiter = "\t")
matchseq = pd.read_csv(file_of_patterns) # all the patterns I want to search
for seq in matchseq:
if search(seq, df1, IGNORECASE):
print(seq, df1)


This works only for the patterns that do not have any - or . symbols in between. I couldn't find in the re.search manual how to specify to ignore some characters in the search. Any guidance would be really helpful.

-..-------------------------------HSLKYDKLYS.SKN..SLCYVLLIWLLTLAAVLPNLRAGTL.--.. QYDPR........IYSCTFAQSV..........SSAYTIAVVVFHFLV.PMIIVIFCYLRIWILVLQV-----------.

python search regex sequence-alignment pandas • 193 views
1
Entering edit mode

here is the regex that works with bash. Try building this regex in python:

$cat test.txt PARCTR-----.........A...G-T.......LQYDRDTCG MRDTCR..A..G..T..lq....--YdRAGTLQYD$ grep -iPo R$-.$\*A$-.$\*G$-.$\*T$-.$\*L$-.$\*Q$-.$\*Y$-.$\*D test.txt
R-----.........A...G-T.......LQYD
R..A..G..T..lq....--Yd
RAGTLQYD


But if you are handling biological sequences, I would recommend to use established bio libraries such as biopython.