Question

Strategizing Blast Result Parsing

0

Entering edit mode

12.0 years ago

Zach Powers ▴ 340

I have a set of ~200 known sequences of a particular domain that I want to use to fish out similar sequences from NCBI-NT.As you can imagine, my blast results for this dataset are redundant in the sense that multiple sequences from my start set will pick up some of the same sequences. Furthermore, as I was intending upon using the GI or Accession number as the Fasta Name there is the issue that multiple domains may be found for a single GI or Accession so simply reducing by GI number will not work. Therefore, I want to develop a parsing strategy that will return a fasta file with all the significant sequences labelled according to 1) Accession Number, 2) Start and 3) End sites where there are no redundant sequences and where,I keep the longest alignment in the case that a sequence is returned on multiple blast hits.

In strategizing the best approach I can think of a few ways to achieve this results:

Iterate through the Blast XML and make a list/dict of accession numbers with start/end locations. Use the list to find the longest sequence for each domain in an Accession-number-list. Use the new list to reiterate through the blast output for sequences to keep.(what I was planning on doing)
write out a fasta for each accession number, align the sequences and keep only the longest sequence for each domain. then recombine.(also not too difficult)

Okay.. so maybe there are not so many ways when i try to write them down. In any event I am wondering how some of you would approach this problem and if I am overlooking a more obvious way (split/apply/recombine tables? ). I am using BioPython for parsing XML output - but a command line approach would be great as well.

thanks, zach cp

edit: I ended up using the following function which looks at the Accession number and the the alignment start positions. I keep only sequences that have a a unique Accession Number or whose starting alignment is at least 500 base pairs from other alignments. Although I do not get the full length sequence the representative sequences are larely the same (+/1 only a few base pairs on each end)

def write_unique_fasta(xmlfile, length = 200):
    startdict = defaultdict(list)
    blast_records = NCBIXML.parse(open(xmlfile, "r"))
    output_handle = open('{}.fasta'.format(str(xmlfile).split('.')[0]), 'w')
    convert  = 0
    total    = 0
    for blast_record in blast_records:    
        for alignment in blast_record.alignments:
            for hsp in alignment.hsps:
                if len(hsp.sbjct) > length: 
                    total += 1
                    title = ((alignment.title).split('|')[3]).strip()
                    #check to see if Accession number exists in dictionary.if so only write a record if its start site is >1kb
                    if not title in startdict.keys():
                        print "New Accession: {}   Add It".format(title)
                        output_handle.write('> {} {} {} \n'.format((alignment.title).split('|')[3], hsp.sbjct_start, hsp.sbjct_end))
                        output_handle.write('{} \n'.format(hsp.sbjct))
                        startdict[title].append(hsp.sbjct_start)
                        convert += 1
                    else:
                        print "Accession {} Found. Scan for Similar Sequence....".format(title)
                        counter = 0
                        for start in list(set(startdict[title])):
                            if  (start - 500 < hsp.sbjct_start < start + 500):
                                #print counter
                                counter += 1
                        if counter == 0: 
                            output_handle.write('> {} {} {} \n'.format((alignment.title).split('|')[3], hsp.sbjct_start, hsp.sbjct_end))
                            output_handle.write('{} \n'.format(hsp.sbjct))
                            startdict[title].append(hsp.sbjct_start)
                            convert += 1

    output_handle.close()
    blast_records.close()
    print "Converted   = {}".format(convert)
    print "Total = {}".format(total)

biopython blast xml • 5.3k views

ADD COMMENT • link updated 10.8 years ago by Biostar 20 • written 12.0 years ago by Zach Powers ▴ 340

0

Entering edit mode

I might be remembering wrong, but I thought that splice variants had their own unique GI numbers and accession codes. Also, it seems to me (but I might be wrong) that you are misinterpreting HSP's. If you're getting the same accession numbers with different alignment results they could be different HSP's found on the same subject, in which case you shouldn't take the longest alignment but the whole subject sequence.

ADD REPLY • link 12.0 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

Thanks Niek. Indeed I can have multiple HSP's per sequence. I was going to distinguish them based on their start location. However for a given HSP from as given Accession numbers the alignment site might vary a little based on the query sequence. By non-redundant I meant each unique HSP. And by longest I meant getting the longest version of the HSP as seen by any of my query sequences.

ADD REPLY • link 12.0 years ago by Zach Powers ▴ 340

0

Entering edit mode

Why do you want to distinguish between the HSP's?

ADD REPLY • link 12.0 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

the more general result I want is to obtain is all unique sequences in NCBI-NT that are related to my start set. I was planning on iteratively BLASTING until I stop turning up new sequences and therefore wanted to keep track of HSPs so I can keep track each round. In order to do this I was keeping track of HSPs by their accession number and their location in that sequence (start/end). If you have a better method, I am all ears.

ADD REPLY • link 12.0 years ago by Zach Powers ▴ 340

0

Entering edit mode

If you just want to blast until you don't get new sequences you only have to save the accession number, no need to know the start position of the HSP.

ADD REPLY • link 12.0 years ago by Niek De Klein ★ 2.6k

Ram · Answer 1 · 2012-05-02

0

Entering edit mode

12.0 years ago

Cabbagesofdoom ▴ 110

Do you have redundancy in both query and search sets of sequences? And are you trying to remove redundancy from both from the output? You want to make sure that you are also handling partially overlapping hits where one might be longer at the start and the other at the end - presumably you would want the full coverage in this case, not just the longest alignment? (I actually have a script for doing this already, which you are welcome to try out, (GABLAM). I can't quite tell if it's what you need but it will summarise BLAST hits and generate hit sequences that are non-redundant with respect to the queries. I have another script that can then get rid of redundancy in the hits themselves.)

ADD COMMENT • link updated 3.5 years ago by Ram 43k • written 12.0 years ago by Cabbagesofdoom ▴ 110

0

Entering edit mode

thanks CabbagesofDoom. What is your approach to get rid of redundancy in the hits themselves? Alignment of the sequences? Comparison of the start/end sites of the hit?

ADD REPLY • link 12.0 years ago by Zach Powers ▴ 340

0

Entering edit mode

I just compare the start and end positions but (in GABLAM) I only get rid of hit redundancy in the sense of multiple queries hitting the very same entry in the search database. At this stage, I do not remove redundancy in terms of multiple entries in the search database that have the same sequence at the location(s) hit by queries. For this, I use direct sequence comparison (for exact matches) or a second round of BLAST (for a threshold < 100%).

ADD REPLY • link 12.0 years ago by Cabbagesofdoom ▴ 110