Question: Getting Top 10 Sequences Of Blast Results Bio Python
3
gravatar for Ankur
2.8 years ago by
Ankur40
Ankur40 wrote:

Hi,

I want to get top 10 sequences of BLAST results (just the sequences, no alignment or score or e-value etc). I am inputting a text file containing 5 fasta file. So my output should be top 10 blast hits of each fasta file.. therefore my output file will have 50 sequences.

I am reading each of my input fasta file through Bio.SeqIO, writing it as temp.faa and then passing it to command line BLAST through subprocess as

blastp -db nr -query temp.faa -out out.faa -evalue 0.001 -gapopen 11 -gapextend 1 -matrix BLOSUM62 -remote -outfmt 2

the output has lots of other information. Should I parse this output now or there's a better way.

Thanks

P.S XML might be the way, but I didn't find a relavant NCBIXML parser syntax

ADD COMMENTlink modified 2.8 years ago by Peter3.8k • written 2.8 years ago by Ankur40
6
gravatar for Peter
2.8 years ago by
Peter3.8k
Scotland, UK
Peter3.8k wrote:

An alternative to filtering the top 10 hits in a script after running BLAST is to ask BLAST to do this for you (using argument -maxtargetseqs). This way the output file will be smaller and the BLAST job should be faster to run.

Also, if you don't like XML and really just want the sequence, specify tabular output with just the matched sequence (field name sseq).

ADD COMMENTlink written 2.8 years ago by Peter3.8k

Makes for much easier/quicker parsing as well in order to grab the full sequences from the relevant sequence file.

ADD REPLYlink written 2.8 years ago by Dan Gaston3.0k
3
gravatar for Hranjeev
2.8 years ago by
Hranjeev1.2k
Singapore
Hranjeev1.2k wrote:

This is a good guide in doing what you want.

You may want to parse like this for getting the sequence.


    from Bio.Blast import NCBIXM

    blast_records = NCBIXML.parse(blast_handle)

    save_file = open("/path/to/file/my_fasta_seq.fasta", 'w')

    for i,blast_record in enumerate(blast_records):
      if i==10: break
      for alignment in blast_record.alignments:
          for hsp in alignment.hsps:
                save_file.write('>%s\n' % (hsp.query))
    save_file.close()

Update: fixed some errors

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Hranjeev1.2k

fixed an error; should work now

ADD REPLYlink written 2.8 years ago by Hranjeev1.2k

@HRanjeev for a multi-fasta file wont this simply loop through the first 10 10 records, taking ALL of the hits for each record?

I think you want the enumerate/break in the alignment part.

ADD REPLYlink written 2.3 years ago by Zach Powers270
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 687 users visited in the last hour