Question

Python script for text matching

0

Entering edit mode

3.6 years ago

anasjamshed ▴ 120

I have one fasta file which has 30 sequences and pattern file which contain different motifs.

I need to match my motifs with fasta sequences and location in which it is authenticated should be highlight like this :

enter image description here

The script which I have developed until now :

# biopython
from Bio import SeqIO
# regex library
import re

# file with FASTA sequence
infile = r"C:\Users\Lenovo\Desktop\fnl\pyth\Promoter Sequence.fasta"

# pattern to search for
iupac = 'GGCA'


# look through each FASTA sequence in the file
for seq_record in SeqIO.parse(infile, "fasta"):
    print ("Sequence ID: ", seq_record.id, "; ", len(seq_record), "bp")
    print (seq_record.seq)
    print(iupac)


    # scan for IUPAC; re.I makes search case-insensitive
    matches = re.findall( iupac, str(seq_record.seq), re.I)
    if matches:
        print ("Matches = ", len(matches))

I need help to complete this script?

Python Regex Fasta • 1.6k views

ADD COMMENT • link updated 3.6 years ago by fishgolden ▴ 510 • written 3.6 years ago by anasjamshed ▴ 120

0

Entering edit mode

Please use imgbb to upload images. The service you're using doesn't seem to work with biostars.

ADD REPLY • link 3.6 years ago by Ram 43k

0

Entering edit mode

Probably motifs module in bio.motifs (biopython) package may help you.

ADD REPLY • link 3.6 years ago by cpad0112 21k

score 1 · Answer 1 · 2020-09-13

1

Entering edit mode

3.6 years ago

fishgolden ▴ 510

I will use "re.sub".

# biopython
from Bio import SeqIO
# regex library
import re

# file with FASTA sequence
infile = r"C:\Users\Lenovo\Desktop\fnl\pyth\Promoter Sequence.fasta"

# pattern to search for
iupac = 'GGCA'

print("<html><body>")
for seq_record in SeqIO.parse(infile, "fasta"):
    print ("Sequence ID: ", seq_record.id, "; ", len(seq_record.seq), "bp")
    print("<br>")
    print(iupac)
    print("<br>")
    sseq = re.sub( iupac,"<span style=\"color:red\">"+iupac+"</span>","".join(seq_record.seq))
    spp = re.split("color\:",sseq)
    print(str(len(spp)-1)+" matches!")
    print("<br>")
    print(sseq)
    print("<br>")
print("</body></html>")

Please save the output text as HTML and open it with a web-browser.

If you use

re.sub(iupac,iupac.lower(),seq_record.seq)

every match will become lowercase, therefore it is also easy to handle.

ADD COMMENT • link 3.6 years ago by fishgolden ▴ 510

0

Entering edit mode

This piece of code replaces matches, doesn't count them (which is what OP wants to do). The HTML part is irrelevant to this question. I don't think this qualifies as an answer.

ADD REPLY • link 3.6 years ago by Ram 43k

0

Entering edit mode

I thought the final goal is "highlight". The code was updated.

ADD REPLY • link 3.6 years ago by fishgolden ▴ 510

0

Entering edit mode

OP's requirement is hidden behind an image that OP did not upload properly: https://scontent.fkhi8-1.fna.fbcdn.net/v/t1.15752-9/119473604_319735005978317_174427337883479388_n.jpg?_nc_cat=100&_nc_sid=b96e70&_nc_eui2=AeE4pNKqKCcYPQwq97EctFrzPu6JEvsjN9I-7okS-yM30uKSg1hSOSUlKNs9yjLxt1-TBxtCAr0c8EsrsrpgeWD2&_nc_ohc=0t3rK-Jz5oEAX_5z02m&_nc_oc=AQlkfPJk6pSnm0kV7cw5gbGPqWOzVagqm9fZZlCygp1OPQMbWPazLi8HaCkS0KQpPMM&_nc_ht=scontent.fkhi8-1.fna&oh=c054daef61c907900829cbfcc0e101a8&oe=5F85B29C

It's a photo of a laptop displaying a Word document, so who knows what OP means by "highlight"

ADD REPLY • link 3.6 years ago by Ram 43k

0

Entering edit mode

OK, let's wait detailed explanation by OP...

ADD REPLY • link 3.6 years ago by fishgolden ▴ 510

0

Entering edit mode

I want to highlight multiple motifs

ADD REPLY • link 3.6 years ago by anasjamshed ▴ 120

0

Entering edit mode

Define "highlight". Your picture shows a word document with WordArt, which is done manually. Determine a format for this "highlight". In face, give us example output with a motif and a match in a small input sequence. If you can do that, you'll be 80% of your way to the solution yourself.

ADD REPLY • link 3.6 years ago by Ram 43k

0

Entering edit mode

https://ibb.co/r7Yc7Y8

ADD REPLY • link 3.6 years ago by anasjamshed ▴ 120

0

Entering edit mode

Please read the how to post and add images properly.

How to add images to a Biostars post

The image is of no use to anyone by the way, you're showing MS Word designs which cannot be automated easily.

ADD REPLY • link 3.6 years ago by Ram 43k

0

Entering edit mode

You want to change the background color of motifs in a sequence?

Replace "color:red" with "background-color:red".

If you want to consider overlap, it will very difficult but you can do it with lower() trick.

Continue loop until no change is made by "re.sub" and insert and tag between uppercase and lowercase letters.

If you want to detect different motifs in different colors, since you cannot set multiple colors on a letter, result of different motifs must be shown as a different entry.

ADD REPLY • link 3.6 years ago by fishgolden ▴ 510