I want to get top 10 sequences of BLAST results (just the sequences, no alignment or score or e-value etc). I am inputting a text file containing 5 fasta file. So my output should be top 10 blast hits of each fasta file.. therefore my output file will have 50 sequences.
I am reading each of my input fasta file through Bio.SeqIO, writing it as temp.faa and then passing it to command line BLAST through subprocess as
blastp -db nr -query temp.faa -out out.faa -evalue 0.001 -gapopen 11 -gapextend 1 -matrix BLOSUM62 -remote -outfmt 2
the output has lots of other information. Should I parse this output now or there's a better way.
P.S XML might be the way, but I didn't find a relavant NCBIXML parser syntax
An alternative to filtering the top 10 hits in a script after running BLAST is to ask BLAST to do this for you (using argument -maxtargetseqs). This way the output file will be smaller and the BLAST job should be faster to run.
Also, if you don't like XML and really just want the sequence, specify tabular output with just the matched sequence (field name sseq).
This is a good guide in doing what you want.
You may want to parse like this for getting the sequence.
from Bio.Blast import NCBIXM blast_records = NCBIXML.parse(blast_handle) save_file = open("/path/to/file/my_fasta_seq.fasta", 'w') for i,blast_record in enumerate(blast_records): if i==10: break for alignment in blast_record.alignments: for hsp in alignment.hsps: save_file.write('>%s\n' % (hsp.query)) save_file.close()
Update: fixed some errors