Question

How to group BLAST results by ID and then check the difference in an attribute like sequence length for every possible pair within a group?

0

Entering edit mode

6.8 years ago

rmartson ▴ 30

This is my aim. I am retrieving a list of BLAST results in FASTA format for a single sequence. What I then want to do is sort any hits that are within a certain distance of each other into another list. Currently all I can do is work out the "start" location for any given hit within its genomic context:

def startfind(complete_handle, hit_handle):
    from Bio import SeqIO
    completes = list(SeqIO.parse(complete_handle, "fasta"))
    hits = list(SeqIO.parse(hit_handle, "fasta"))

    for index, hit_record in enumerate(hits): #This is adapted from another function where I actually use the index for something I just forgot to remove it here
        for complete_record in completes:
            if hit_record.id == complete_record.id:
                start = complete_record.seq.find(hit_record.seq)
                print hit_record.id, start

What I need to be able to do is this. For every single possible pair of hits where hit_record.id is equal, I want to subtract the "start" for one from the other and then check whether the absolute value/distance is within a certain range Something that would be along the lines of:

if (hit_record.id1 == hit_record.id2 and 7000 <  abs(start1-start2) &&  abs(start1-start2) < 8000):
    print [some specific file] >> hit_record1
    print [some specific file] >> hit_record2

But for every possible pair of hits. How can I do this?

EDIT: I just figured it out, but if anyone can think of a more efficient way (other formats of data are welcome too) I'd prefer that.

def fullfind(complete_handle, hit_handle):
    from Bio import SeqIO
    completes = list(SeqIO.parse(complete_handle, "fasta"))
    hits = list(SeqIO.parse(hit_handle, "fasta"))
    flanks = open("flanks.fasta", "w")

    for hit_record1 in hits:
        for hit_record2 in hits:
            for complete_record in completes:
                if hit_record1.id == complete_record.id and hit_record2.id == complete_record.id):
                    start1 = complete_record.seq.find(hit_record1.seq)
                    start2 = complete_record.seq.find(hit_record2.seq)              
                    if (8000 <  start1-start2 and  start1-start2 < 9000):
                        #Don't need absolute values because every possible pair is considered and there is always one pair of hits where the difference is positive 
                        print complete_record.id

fullfind("complete.fasta", "hit.fasta")

Biopython BLAST • 1.7k views

ADD COMMENT • link 6.8 years ago by rmartson ▴ 30

0

Entering edit mode

Personally, I think you're actually over-complicating things using BioPython. Use a tab-delimited blast output, which would be easier to parse. The first column is your query id, and the 7-8 cols are your start and end position of the blast result.

ADD REPLY • link 6.8 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Isn't tab-delimited blast output a format you get by converting it from XML format using Biopython? It's not a default format that NCBI BLAST lets you download. And it doesn't give you the actual sequences, so I'd still need a second file containing the complete sequences to eventually retrieve flanks. In the end I need to access two files.

ADD REPLY • link 6.8 years ago by rmartson ▴ 30

0

Entering edit mode

If you're running BLAST on the command-line you can use -outfmt 6 to give you a tab-delimited output. You are correct you will need both files. Your script would need to read the blast output, I would use a dictionary with the query id as the key, and a tuple value containing the start/end positions. Then, read in your fasta, and process the sequences by sequence header (query id) and blast results.

ADD REPLY • link 6.8 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Thanks, I'll consider rewriting the script to automatically carry out a blast search, then convert the xml output into a tab-delimited and FASTA formats so I don't need to access two files.

ADD REPLY • link 6.8 years ago by rmartson ▴ 30