How to make blast go on to the next protein
3
0
Entering edit mode
4.6 years ago

Hi, i would like to know if theres a way like: a have a sequence of proteins, if protein1 blasted against my dataset and found a positive result, it goes to protein2?

because i have 4700 protein sequences and my dataset has over 37000; i dont need that my proteins blast against ALL 37000 proteins, if they matched at least one of this 37000, its fine; its so time consuming to have it being blasted against all 37000

blast • 1.2k views
ADD COMMENT
1
Entering edit mode

Look at command line BLAST help and explore the following parameters:

 -num_descriptions <Integer, >=0>
   Number of database sequences to show one-line descriptions for
   Not applicable for outfmt > 4
   Default = `500'
    * Incompatible with:  max_target_seqs

 -num_alignments <Integer, >=0>
   Number of database sequences to show alignments for
   Default = `250'
    * Incompatible with:  max_target_seqs

 -max_target_seqs <Integer, >=1>
   Maximum number of aligned sequences to keep
   (value of 5 or more is recommended)
   Default = `500'
    * Incompatible with:  num_descriptions, num_alignments
ADD REPLY
0
Entering edit mode

Also -num_threads could be useful. If you have multiple CPUs, setting -num_threads to a number larger than 1 will speed up the search.

ADD REPLY
0
Entering edit mode

BLAST will already not do an full alignment with all the 37000 reference proteins. There are like "pre-filter" steps, it works with a k-mer kind of mechanism. If you want to know how it works you can look it up yourself. And if your input (query) fasta contains more then one sequence, BLAST will automatically go trough that and blast all the sequences. So first protein 1 tries to search then protein 2 etc etc.

maybe this helps -max_target_seqs 1

https://www.ncbi.nlm.nih.gov/books/NBK279684/

Did you already tried to blast? 4700 and 37000 does not sounds like a big number of sequences to me. How long is it taking?

ADD REPLY
0
Entering edit mode

im trying right now this -max_target_seqs 1, and its going a lot faster

i used the following command:

blastp -db Gut -query Kp.fasta -out results.txt -outfmt ‘10 qseqid’ -evalue 10e-3 and its going for 4h now

i’m going to try this max target, cause i only want what ID’s from my proteins had at least one positive result within the dataset

ADD REPLY
0
Entering edit mode

Don't forget to set the parameter num_threads. And not sure why you use the -evalue threshold but if that comes from an paper or something you know why you use it.

Here someone else that also used the evalue:

C: Choosing the cutoff for e-value when using very small Sequences

ADD REPLY
1
Entering edit mode
4.6 years ago

I'm afraid there is no such option in BLAST. Moreover that is somewhat conflicting with how blast works. Blast works by first looking for all potential matches of a query versus the database. In a next step it then does alignment on all the initial potential matches. As such it will already have done much more work than you are referring to in your question. There is not much computational (time) difference between finding a single potential initial match or finding several dozens.

What you can try to do (but this has some consequences as well) is to limit the number of potential initial matches it will retain to do the computational more heavy alignment on. The option to do so is -max_targets , and you can set that to 1 for instance (== it will only report the alignment for a 1 match) . Keep in mind that this will then not necessarily be the best match you could have got, this has been extensively discussed in other threads on Biostars.

On your specific setup: 37K proteins in the DB is not that much (rather low tbh), on a decent machine this should be done in a few hours (all vs all, with default param settings). If you really want to speed up this process is to split up your input file in smaller files (eg. 470 per file) and blast all those smaller files in parallel to the same DB , preferably using a compute cluster.

ADD COMMENT
0
Entering edit mode
4.6 years ago
Mensur Dlakic ★ 28k

This can't be done. But even if it could, would you really want this? Let's say your protein has 10 matches and the worst one of them comes up first during the search. Do you really want your search to stop on a worst match?

BLAST is known for being very fast. What you describe should take couple of hours at most to do properly. Are you not able or not willing to spend that time? You should know that there are very few programs out there faster than BLAST, and it is very unlikely that you will be forever searching against databases that are 37000 sequences. Most of modern protein sequence databases are tens or hundreds of millions large, so you may want to start building your patience for those tasks.

ADD COMMENT
0
Entering edit mode

so, yes, cause like, i don’t care about if its a good match or a bad match; i set a evalue and i just want to see if that protein matches or not, cause im going to discard the protein that matches so, i only need the program to tell me which of my proteins matches my dataset

ADD REPLY
0
Entering edit mode

But when is it a match? When does it match or not in your opinion? Is 30% identity a match, or 40%, or 50%.

ADD REPLY
0
Entering edit mode

that's unfortunate (or bad practise at best), because you should care about the hit quality. If not just randomly pick some proteins.

As gb also points out, there is no such thing as ' a match' , it's only a match given the thresholds and parameter settings.

ADD REPLY
0
Entering edit mode
4.6 years ago
ashish ▴ 680

Why don't you try DIAMOND. Its faster than blast. Also there is a slight compromise with accuracy to increase speed but from the discussion above it seems accuracy is not the first priority here. I think diamond will be good for this job.

ADD COMMENT

Login before adding your answer.

Traffic: 2247 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6