Question: standalone BLAST parametes
0
gravatar for cvu
3.0 years ago by
cvu170
India
cvu170 wrote:

Hi All,

I've predicted genes in genome. Now I want to identify proteins, For that, I've blasted all predicted proteins against uniprot database.

blast parameters

blastp -query proteins.fasta -db Uniprotdb -max_target_seqs 1 -max_hsps 1 -out output.blastp -outfmt 6 -evalue 0.001

what should be my blastp parameters, to get only significant match ?

Thank you in advance!!

blast gene genome • 1.0k views
ADD COMMENTlink modified 3.0 years ago by Macspider3.2k • written 3.0 years ago by cvu170

What is a significant match for you ? If you expect to always find nearly perfect matches in Uniprot, then restrictive parameters would work but if you also expect imperfect matches you need to have parameters to accommodate them. It may be easier to let blast report more hits then filter these with your favourite scripting language.

ADD REPLYlink written 3.0 years ago by Jean-Karim Heriche23k

Thanks for the reply. I want perfect matches, but which parameters to set to get a good match?

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by cvu170

If you're only looking for identical matches blast is the wrong tool for the job. Just use grep or the string matching function of a scripting language or an implementation of a global alignment algorithm (e.g. needle in the EMBOSS suite). If you insist on blast, filter the output on alignment length and percent identity, i.e. only keep alignments (HSPs) that are full length relative to the query and 100% identity.

ADD REPLYlink written 3.0 years ago by Jean-Karim Heriche23k
0
gravatar for Macspider
3.0 years ago by
Macspider3.2k
Vienna - BOKU
Macspider3.2k wrote:

what should be my blastp parameters, to get only significant match ?

This question groups together with "what is the cure of cancer".

Functional annotation is a pain in the a**, you have to deal with it. There are no "optimal" parameters, and many results in the databases are either wrong or "unknown", "uncharacterized", "undefined".

From your command I can see that you are limiting the hits to 1 and the high-scoring pairs to 1 (hsp). Why? Are you working in a non-model organism I assume (like I do) so you did a gene prediction because there was none. However, you could allow for more hsps than 1 because many times you have more than 1 hsp per sequence. There is one script (this: find-best-hit.py ) which allows you to find the best combination of HSPs in a blast run from an xml output file. Give it a try!

ADD COMMENTlink written 3.0 years ago by Macspider3.2k

Thanks for the reply. I want only one best match for each protein, Hence I set hsp 1.

ADD REPLYlink written 3.0 years ago by cvu170

"HSP corresponds to the matching region between the query sequence and the database hit sequence." from High Scoring Pairs (HSP) in BLAST output

There can be many HSPs per match, and limiting your blast run to one per match may reduce / underestimate / misestimate the overall sequence identity.

I think you should make the effort to read the literature about it before doing your analysis blindly.

EDIT: this is the literature you need to read! http://jeff.wintersinger.org/posts/2014/07/designing-an-algorithm-to-compute-the-optimal-set-of-blast-hits/

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Macspider3.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1892 users visited in the last hour