Python script for text matching
1
0
Entering edit mode
3.6 years ago
anasjamshed ▴ 120

I have one fasta file which has 30 sequences and pattern file which contain different motifs.

I need to match my motifs with fasta sequences and location in which it is authenticated should be highlight like this :

enter image description here

The script which I have developed until now :

# biopython
from Bio import SeqIO
# regex library
import re

# file with FASTA sequence
infile = r"C:\Users\Lenovo\Desktop\fnl\pyth\Promoter Sequence.fasta"

# pattern to search for
iupac = 'GGCA'


# look through each FASTA sequence in the file
for seq_record in SeqIO.parse(infile, "fasta"):
    print ("Sequence ID: ", seq_record.id, "; ", len(seq_record), "bp")
    print (seq_record.seq)
    print(iupac)


    # scan for IUPAC; re.I makes search case-insensitive
    matches = re.findall( iupac, str(seq_record.seq), re.I)
    if matches:
        print ("Matches = ", len(matches))

I need help to complete this script?

Python Regex Fasta • 1.6k views
ADD COMMENT
0
Entering edit mode

Please use imgbb to upload images. The service you're using doesn't seem to work with biostars.

ADD REPLY
0
Entering edit mode

Probably motifs module in bio.motifs (biopython) package may help you.

ADD REPLY
1
Entering edit mode
3.6 years ago
fishgolden ▴ 510

I will use "re.sub".

# biopython
from Bio import SeqIO
# regex library
import re

# file with FASTA sequence
infile = r"C:\Users\Lenovo\Desktop\fnl\pyth\Promoter Sequence.fasta"

# pattern to search for
iupac = 'GGCA'

print("<html><body>")
for seq_record in SeqIO.parse(infile, "fasta"):
    print ("Sequence ID: ", seq_record.id, "; ", len(seq_record.seq), "bp")
    print("<br>")
    print(iupac)
    print("<br>")
    sseq = re.sub( iupac,"<span style=\"color:red\">"+iupac+"</span>","".join(seq_record.seq))
    spp = re.split("color\:",sseq)
    print(str(len(spp)-1)+" matches!")
    print("<br>")
    print(sseq)
    print("<br>")
print("</body></html>")

Please save the output text as HTML and open it with a web-browser.

If you use

re.sub(iupac,iupac.lower(),seq_record.seq)

every match will become lowercase, therefore it is also easy to handle.

ADD COMMENT
0
Entering edit mode

This piece of code replaces matches, doesn't count them (which is what OP wants to do). The HTML part is irrelevant to this question. I don't think this qualifies as an answer.

ADD REPLY
0
Entering edit mode

I thought the final goal is "highlight". The code was updated.

ADD REPLY
0
Entering edit mode

OK, let's wait detailed explanation by OP...

ADD REPLY
0
Entering edit mode

I want to highlight multiple motifs

ADD REPLY
0
Entering edit mode

Define "highlight". Your picture shows a word document with WordArt, which is done manually. Determine a format for this "highlight". In face, give us example output with a motif and a match in a small input sequence. If you can do that, you'll be 80% of your way to the solution yourself.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Please read the how to post and add images properly.

How to add images to a Biostars post

The image is of no use to anyone by the way, you're showing MS Word designs which cannot be automated easily.

ADD REPLY
0
Entering edit mode

You want to change the background color of motifs in a sequence?

Replace "color:red" with "background-color:red".

If you want to consider overlap, it will very difficult but you can do it with lower() trick.

Continue loop until no change is made by "re.sub" and insert and tag between uppercase and lowercase letters.

If you want to detect different motifs in different colors, since you cannot set multiple colors on a letter, result of different motifs must be shown as a different entry.

ADD REPLY

Login before adding your answer.

Traffic: 2064 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6