Question: Python script for text matching
0
gravatar for anasjamshed1994
6 weeks ago by
anasjamshed199420 wrote:

I have one fasta file which has 30 sequences and pattern file which contain different motifs.

I need to match my motifs with fasta sequences and location in which it is authenticated should be highlight like this :

enter image description here

The script which I have developed until now :

# biopython
from Bio import SeqIO
# regex library
import re

# file with FASTA sequence
infile = r"C:\Users\Lenovo\Desktop\fnl\pyth\Promoter Sequence.fasta"

# pattern to search for
iupac = 'GGCA'


# look through each FASTA sequence in the file
for seq_record in SeqIO.parse(infile, "fasta"):
    print ("Sequence ID: ", seq_record.id, "; ", len(seq_record), "bp")
    print (seq_record.seq)
    print(iupac)


    # scan for IUPAC; re.I makes search case-insensitive
    matches = re.findall( iupac, str(seq_record.seq), re.I)
    if matches:
        print ("Matches = ", len(matches))

I need help to complete this script?

regex python fasta • 179 views
ADD COMMENTlink modified 6 weeks ago by fishgolden450 • written 6 weeks ago by anasjamshed199420

Please use imgbb to upload images. The service you're using doesn't seem to work with biostars.

ADD REPLYlink written 6 weeks ago by RamRS30k

Probably motifs module in bio.motifs (biopython) package may help you.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by cpad011214k
1
gravatar for fishgolden
6 weeks ago by
fishgolden450
fishgolden450 wrote:

I will use "re.sub".

# biopython
from Bio import SeqIO
# regex library
import re

# file with FASTA sequence
infile = r"C:\Users\Lenovo\Desktop\fnl\pyth\Promoter Sequence.fasta"

# pattern to search for
iupac = 'GGCA'

print("<html><body>")
for seq_record in SeqIO.parse(infile, "fasta"):
    print ("Sequence ID: ", seq_record.id, "; ", len(seq_record.seq), "bp")
    print("<br>")
    print(iupac)
    print("<br>")
    sseq = re.sub( iupac,"<span style=\"color:red\">"+iupac+"</span>","".join(seq_record.seq))
    spp = re.split("color\:",sseq)
    print(str(len(spp)-1)+" matches!")
    print("<br>")
    print(sseq)
    print("<br>")
print("</body></html>")

Please save the output text as HTML and open it with a web-browser.

If you use

re.sub(iupac,iupac.lower(),seq_record.seq)

every match will become lowercase, therefore it is also easy to handle.

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by fishgolden450

This piece of code replaces matches, doesn't count them (which is what OP wants to do). The HTML part is irrelevant to this question. I don't think this qualifies as an answer.

ADD REPLYlink written 6 weeks ago by RamRS30k

I thought the final goal is "highlight". The code was updated.

ADD REPLYlink written 6 weeks ago by fishgolden450

OP's requirement is hidden behind an image that OP did not upload properly: https://scontent.fkhi8-1.fna.fbcdn.net/v/t1.15752-9/119473604_319735005978317_174427337883479388_n.jpg?_nc_cat=100&_nc_sid=b96e70&_nc_eui2=AeE4pNKqKCcYPQwq97EctFrzPu6JEvsjN9I-7okS-yM30uKSg1hSOSUlKNs9yjLxt1-TBxtCAr0c8EsrsrpgeWD2&_nc_ohc=0t3rK-Jz5oEAX_5z02m&_nc_oc=AQlkfPJk6pSnm0kV7cw5gbGPqWOzVagqm9fZZlCygp1OPQMbWPazLi8HaCkS0KQpPMM&_nc_ht=scontent.fkhi8-1.fna&oh=c054daef61c907900829cbfcc0e101a8&oe=5F85B29C

It's a photo of a laptop displaying a Word document, so who knows what OP means by "highlight"

ADD REPLYlink written 6 weeks ago by RamRS30k

OK, let's wait detailed explanation by OP...

ADD REPLYlink written 6 weeks ago by fishgolden450

I want to highlight multiple motifs

ADD REPLYlink written 6 weeks ago by anasjamshed199420

Define "highlight". Your picture shows a word document with WordArt, which is done manually. Determine a format for this "highlight". In face, give us example output with a motif and a match in a small input sequence. If you can do that, you'll be 80% of your way to the solution yourself.

ADD REPLYlink written 6 weeks ago by RamRS30k

https://ibb.co/r7Yc7Y8

ADD REPLYlink written 6 weeks ago by anasjamshed199420

Please read the how to post and add images properly.

How to add images to a Biostars post

The image is of no use to anyone by the way, you're showing MS Word designs which cannot be automated easily.

ADD REPLYlink written 6 weeks ago by RamRS30k

You want to change the background color of motifs in a sequence?

Replace "color:red" with "background-color:red".

If you want to consider overlap, it will very difficult but you can do it with lower() trick.

Continue loop until no change is made by "re.sub" and insert and tag between uppercase and lowercase letters.

If you want to detect different motifs in different colors, since you cannot set multiple colors on a letter, result of different motifs must be shown as a different entry.

ADD REPLYlink written 6 weeks ago by fishgolden450
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1712 users visited in the last hour