Question: BLAST, setting maximum number of hits
1
gravatar for apelin20
4.3 years ago by
apelin20470
Canada
apelin20470 wrote:

Hello,

I am trying to set the number of maximum hits to 5, so that the procedure can finish sooner, but I still get 100s of hits found.

# TBLASTX 2.2.29+
# Query: Locus_40_Transcript_185/186_Confidence_0.224_Length_4778
# Database: ../../Genome/Genome
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 714 hits found

I am running:

tblastx -db ../../Genome/Genome -query all_merged_k125.fa -evalue 1e-10 -outfmt 7 -out tblastx/all_merged_k125.fmt7 -num_threads 16 -max_target_seqs 5

Any idea why it's still reporting so many hits?

Adrian

 

 

 

blast tblastx parallel • 15k views
ADD COMMENTlink modified 4.3 years ago by RamRS21k • written 4.3 years ago by apelin20470
6
gravatar for 5heikki
4.3 years ago by
5heikki8.4k
Finland
5heikki8.4k wrote:

You can get e.g. 10 hits from one long target sequence (-max_target_seqs 1), i.e. max_target_seqs doesn't specify the maximum number of hits per query, but the maximum number of target sequences for hits per query.

ADD COMMENTlink modified 3 months ago by RamRS21k • written 4.3 years ago by 5heikki8.4k

Ok that makes sense. How do I limit the amount of hits?

ADD REPLYlink modified 3 months ago by RamRS21k • written 4.3 years ago by apelin20470
4

Setting -max_target_seqns to 1 will give only 1 subject/hit but several HSPs if they are present.

Setting -max_hsps to 1 will give only 1 HSP per subject but for all subject/hits in the database.

If you really want only 5 HSPs per subject, set the -max_target_seqns to 1 and -max_hsps to 5.

ADD REPLYlink modified 3 months ago by RamRS21k • written 4.3 years ago by Siva1.6k
1

Makes sense, thank you!

ADD REPLYlink modified 3 months ago by RamRS21k • written 4.3 years ago by Adrian Pelin2.3k

(typo: should be -max_target_seqs instead of -max_target_seqns)

ADD REPLYlink modified 3 months ago by RamRS21k • written 13 months ago by al-ash100
2

I would do it post blast (with -outfmt 6 output):

Make sure the file is sorted based on query and best hits (here bitscore > evalue > perc identity):

export LC_ALL=C LC_LANG=C; sort -k1,1 -k12,12gr -k11,11g -k3,3gr outputFile > sortedFile

Then get the top 5 hits for every query:

for next in $(cut -f1 sortedFile | sort -u); do grep -w -m 5 "$next" sortedFile; done > topFivePerQuery
ADD REPLYlink modified 3 months ago by RamRS21k • written 4.3 years ago by 5heikki8.4k

Yes, the above command made my day!!!!!!!!!

Thanks

ADD REPLYlink written 4 weeks ago by archana.bioinfo87120

I guess you can always give a relatively stringent e-value and filter the resulting hits later.

ADD REPLYlink written 4.3 years ago by RamRS21k

What I wanted is to speed up the blasting.
 

ADD REPLYlink written 4.3 years ago by apelin20470
1

I doubt limiting the number of hits like that would speed up your blasting significantly. It still has to go through the whole db for every query, so the only difference would be in how long it takes to write 5 or 10 lines (or whatever) to the output file. Instead, if your db is small (or you have a ton of RAM), you should parallelize blast (e.g. with GNU Parallel) by running multiple single-threaded blasts on split input instead of using -num_threads X..

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by 5heikki8.4k

^True. You will benefit from multi-threading, and trying both tblastx and blastall -p tbalstx before choosing one of them. For shorter query sequences, I've seen the latter be significantly faster than the former.

ADD REPLYlink modified 3 months ago • written 4.3 years ago by RamRS21k
0
gravatar for RamRS
4.3 years ago by
RamRS21k
Houston, TX
RamRS21k wrote:

There seems to a problem with -outfmt 7. Can you check if this problem persists if you use the default output format?

ADD COMMENTlink modified 3 months ago • written 4.3 years ago by RamRS21k

I am trying outfmt 6. Same thing, gives out many results, beyond 5.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by apelin20470

Try not giving the outfmt param. If it works, then it could be a out format group specific problem.

ADD REPLYlink written 4.3 years ago by RamRS21k
0
gravatar for dago
4.3 years ago by
dago2.5k
Germany
dago2.5k wrote:

If I am not wrong with -outfmt > 4, -max_target_seqs is ignored. At least this is true for psiblast

ADD COMMENTlink modified 3 months ago by RamRS21k • written 4.3 years ago by dago2.5k

max_target_seqs - Number of aligned sequences to keep. Use with report formats that do not have separate definition line and alignment sections such as tabular (all outfmt > 4). Not compatible with num_descriptions or num_alignments.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by apelin20470

No, it is recommended to be used with outfmt>4. See here: http://www.ncbi.nlm.nih.gov/books/NBK1763/#_CmdLineAppsManual_Appendix_C_Options_for_

ADD REPLYlink written 4.3 years ago by RamRS21k

You are right. I checked my code and it refers to `-num_descriptions`

ADD REPLYlink written 4.3 years ago by dago2.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 736 users visited in the last hour