Question

How to use for loops to compare each DNA sequence in one file to each sequence in another file?

0

Entering edit mode

6.8 years ago

rmartson ▴ 30

So here is my aim. I've carried out a BLAST search and downloaded the results in two formats: Complete FASTA sequences and aligned FASTA sequences. In the aligned FASTA file, I want to retrieve the ID and sequence for each hit (there are 32 hits). Then I want to take each hit and compare them to the complete sequences (there are 20 genomic regions the hits were found in) IF the hit ID and complete sequence ID match i.e. the aligned sequence belongs to that complete sequence, then I find the start of the hit sequence within the complete sequence and use that to retrieve the sequence of a flanking sequence upstream of the hit.

I can do this manually for one hit which I know matches a particular complete sequence like so:

from Bio import SeqIO
completelist = list(SeqIO.parse("complete.fasta", "fasta"))
hitlist = list(SeqIO.parse("hits.fasta", "fasta"))
start = completelist[0].seq.find(hitlist[0].seq) 
print(completelist[0].seq[start-100:start]) #This returns the 100bp upstream of the hit

I skipped over a bunch of basics to learn Biopython, so this might look stupid but to do this automatically for every hit I've tried:

def flankfind(complete_handle, hit_handle):
    from Bio import SeqIO
    completes = SeqIO.parse(complete_handle, "fasta")
    hits = SeqIO.parse(hit_handle, "fasta")

    for hit_record in hits:
        for complete_record in completes:
            if hit_record.id == complete_record.id:
                start = complete_record.seq.find(hit_record.seq)
                print(complete_record.seq[start-100:start])

flankfind("complete.fasta", "hits.fasta")

EDIT1:

I noticed and fixed one problem in my script. I didn't have ".seq" in the "complete_record.seq.find(hit_record.seq) part. Now the script actually returns the correct upstream 100bp sequence for the very first hit, but stops there.

I have implemented the completes.seek(0) solution, but this returns the error:

Traceback (most recent call last):
  File "<pyshell#147>", line 1, in <module>
    flankfind("complete.fasta", "hits.fasta")
  File "<pyshell#146>", line 11, in flankfind
    completes.seek(0)
AttributeError: 'generator' object has no attribute 'seek'

EDIT2:

Okay, the issue is fixed. I just had to convert "hits" and "completes" into lists based on the solution offered here https://stackoverflow.com/questions/1271320/reseting-generator-object-in-python

My current code is this: def flankfind(complete_handle, hit_handle): from Bio import SeqIO completes = list(SeqIO.parse(complete_handle, "fasta")) hits = list(SeqIO.parse(hit_handle, "fasta"))

    for index, hit_record in enumerate(hits):
        for complete_record in completes:
            if hit_record.id == complete_record.id:
                start = complete_record.seq.find(hit_record.seq)
                print '>', hit_record.id, index
                print complete_record.seq[start-100:start]

flankfind("complete.fasta", "hits.fasta")

This returns the 100bp sequences upstream of the 32 hits with their indices and IDs though the formatting for the print section can be improved.

Biopython BLAST • 3.1k views

ADD COMMENT • link 6.8 years ago by rmartson ▴ 30

score 1 · Answer 1 · 2017-07-05

1

Entering edit mode

6.8 years ago

lakhujanivijay 5.8k

I didn't try your code, but from a first look I can point out that you should set the file reading offset for completes to zero for each iteration of hits. in simple terms, for each iteration of hits, you should force completes to start reading from the beginning of the file. Once the file is read, it reaches the end and hence it is required to set the reader to zero.

You can use the Python seek method. I am changing the code below (see # tag).

def flankfind(complete_handle, hit_handle):
    from Bio import SeqIO
    completes = SeqIO.parse(complete_handle, "fasta")
    hits = SeqIO.parse(hit_handle, "fasta")

    for hit_record in hits:
        return hit_record.id
        return hit_record.seq
        for complete_record in completes:
            if hit_record.id == complete_record.id:
                start = complete_record.find(hit_record.seq)
                print(complete_record[start-100:start])
        #making the change here
        completes.seek(0)
flankfind("complete.fasta", "hits.fasta")

Let me know if it worked.

ADD COMMENT • link 6.8 years ago by lakhujanivijay 5.8k

0

Entering edit mode

I'll be back in 30min or so but I tried this and it returned a single ID for the first FASTA sequence.

flankfind("complete.fasta", "hits.fasta")

'AC226770.1'

Thanks for the help

ADD REPLY • link 6.8 years ago by rmartson ▴ 30

0

Entering edit mode

After fixing an issue in my original script I used completes.seek(0) again and got this:

Traceback (most recent call last):
  File "<pyshell#147>", line 1, in <module>
    flankfind("complete.fasta", "hits.fasta")
  File "<pyshell#146>", line 11, in flankfind
    completes.seek(0)
AttributeError: 'generator' object has no attribute 'seek'

I don't really understand the issue. You can use completes.next() to read through each SeqRecord item just fine. I only read up on seek() a little but doesn't that mean it should be able to set it back to the start of the file?

ADD REPLY • link 6.8 years ago by rmartson ▴ 30

0

Entering edit mode

I only read up on seek() a little but doesn't that mean it should be able to set it back to the start of the file?

That is what it is suppose to do.

ADD REPLY • link 6.8 years ago by lakhujanivijay 5.8k

0

Entering edit mode

But SeqIO.parse returns a generator.

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k