How to group BLAST results by ID and then check the difference in an attribute like sequence length for every possible pair within a group?
0
0
Entering edit mode
6.8 years ago
rmartson ▴ 30

This is my aim. I am retrieving a list of BLAST results in FASTA format for a single sequence. What I then want to do is sort any hits that are within a certain distance of each other into another list. Currently all I can do is work out the "start" location for any given hit within its genomic context:

def startfind(complete_handle, hit_handle):
    from Bio import SeqIO
    completes = list(SeqIO.parse(complete_handle, "fasta"))
    hits = list(SeqIO.parse(hit_handle, "fasta"))

    for index, hit_record in enumerate(hits): #This is adapted from another function where I actually use the index for something I just forgot to remove it here
        for complete_record in completes:
            if hit_record.id == complete_record.id:
                start = complete_record.seq.find(hit_record.seq)
                print hit_record.id, start

What I need to be able to do is this. For every single possible pair of hits where hit_record.id is equal, I want to subtract the "start" for one from the other and then check whether the absolute value/distance is within a certain range Something that would be along the lines of:

if (hit_record.id1 == hit_record.id2 and 7000 <  abs(start1-start2) &&  abs(start1-start2) < 8000):
    print [some specific file] >> hit_record1
    print [some specific file] >> hit_record2

But for every possible pair of hits. How can I do this?

EDIT: I just figured it out, but if anyone can think of a more efficient way (other formats of data are welcome too) I'd prefer that.

def fullfind(complete_handle, hit_handle):
    from Bio import SeqIO
    completes = list(SeqIO.parse(complete_handle, "fasta"))
    hits = list(SeqIO.parse(hit_handle, "fasta"))
    flanks = open("flanks.fasta", "w")

    for hit_record1 in hits:
        for hit_record2 in hits:
            for complete_record in completes:
                if hit_record1.id == complete_record.id and hit_record2.id == complete_record.id):
                    start1 = complete_record.seq.find(hit_record1.seq)
                    start2 = complete_record.seq.find(hit_record2.seq)              
                    if (8000 <  start1-start2 and  start1-start2 < 9000):
                        #Don't need absolute values because every possible pair is considered and there is always one pair of hits where the difference is positive 
                        print complete_record.id

fullfind("complete.fasta", "hit.fasta")

Biopython BLAST • 1.7k views
ADD COMMENT
0
Entering edit mode

Personally, I think you're actually over-complicating things using BioPython. Use a tab-delimited blast output, which would be easier to parse. The first column is your query id, and the 7-8 cols are your start and end position of the blast result.

ADD REPLY
0
Entering edit mode

Isn't tab-delimited blast output a format you get by converting it from XML format using Biopython? It's not a default format that NCBI BLAST lets you download. And it doesn't give you the actual sequences, so I'd still need a second file containing the complete sequences to eventually retrieve flanks. In the end I need to access two files.

ADD REPLY
0
Entering edit mode

If you're running BLAST on the command-line you can use -outfmt 6 to give you a tab-delimited output. You are correct you will need both files. Your script would need to read the blast output, I would use a dictionary with the query id as the key, and a tuple value containing the start/end positions. Then, read in your fasta, and process the sequences by sequence header (query id) and blast results.

ADD REPLY
0
Entering edit mode

Thanks, I'll consider rewriting the script to automatically carry out a blast search, then convert the xml output into a tab-delimited and FASTA formats so I don't need to access two files.

ADD REPLY

Login before adding your answer.

Traffic: 3070 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6