Question

standalone BLAST parametes

0

Entering edit mode

6.6 years ago

cvu ▴ 180

Hi All,

I've predicted genes in genome. Now I want to identify proteins, For that, I've blasted all predicted proteins against uniprot database.

blast parameters

blastp -query proteins.fasta -db Uniprotdb -max_target_seqs 1 -max_hsps 1 -out output.blastp -outfmt 6 -evalue 0.001

what should be my blastp parameters, to get only significant match ?

Thank you in advance!!

genome gene blast • 1.8k views

ADD COMMENT • link updated 6.6 years ago by Matteo Schiavinato ★ 3.6k • written 6.6 years ago by cvu ▴ 180

0

Entering edit mode

What is a significant match for you ? If you expect to always find nearly perfect matches in Uniprot, then restrictive parameters would work but if you also expect imperfect matches you need to have parameters to accommodate them. It may be easier to let blast report more hits then filter these with your favourite scripting language.

ADD REPLY • link 6.6 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks for the reply. I want perfect matches, but which parameters to set to get a good match?

ADD REPLY • link 6.6 years ago by cvu ▴ 180

0

Entering edit mode

If you're only looking for identical matches blast is the wrong tool for the job. Just use grep or the string matching function of a scripting language or an implementation of a global alignment algorithm (e.g. needle in the EMBOSS suite). If you insist on blast, filter the output on alignment length and percent identity, i.e. only keep alignments (HSPs) that are full length relative to the query and 100% identity.

ADD REPLY • link 6.6 years ago by Jean-Karim Heriche 27k

score 0 · Answer 1 · 2017-09-15

0

Entering edit mode

6.6 years ago

Matteo Schiavinato ★ 3.6k

what should be my blastp parameters, to get only significant match ?

This question groups together with "what is the cure of cancer".

Functional annotation is a pain in the a**, you have to deal with it. There are no "optimal" parameters, and many results in the databases are either wrong or "unknown", "uncharacterized", "undefined".

From your command I can see that you are limiting the hits to 1 and the high-scoring pairs to 1 (hsp). Why? Are you working in a non-model organism I assume (like I do) so you did a gene prediction because there was none. However, you could allow for more hsps than 1 because many times you have more than 1 hsp per sequence. There is one script (this: find-best-hit.py ) which allows you to find the best combination of HSPs in a blast run from an xml output file. Give it a try!

ADD COMMENT • link 6.6 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Thanks for the reply. I want only one best match for each protein, Hence I set hsp 1.

ADD REPLY • link 6.6 years ago by cvu ▴ 180

0

Entering edit mode

"HSP corresponds to the matching region between the query sequence and the database hit sequence." from High Scoring Pairs (HSP) in BLAST output

There can be many HSPs per match, and limiting your blast run to one per match may reduce / underestimate / misestimate the overall sequence identity.

I think you should make the effort to read the literature about it before doing your analysis blindly.

EDIT: this is the literature you need to read! http://jeff.wintersinger.org/posts/2014/07/designing-an-algorithm-to-compute-the-optimal-set-of-blast-hits/

ADD REPLY • link 6.6 years ago by Matteo Schiavinato ★ 3.6k