Question: how to extract best hits of blast in terms of e-value, identity,...?
3
gravatar for seta
3.9 years ago by
seta1.1k
Sweden
seta1.1k wrote:

Hi everybody,

I used the following code to do blastx (by last version of ncbi-blast+) of my transcriptome assembly against the proteome reference,

blastx -query file1 -db pr_database  -out file1_1.out -e-value 1e-3 -max_target_seqs 20 -outfmt 6

Now, I have two questions; given my command, are all blast hit is the best or I look at also other parameters, like identity and alignment length?.  Sharing your factors to select the best hit would be highly appreciated.

Also, could you please provide me how to extract desired hit in terms of e-value and identity or alignment length (if it is necessary)?

Thanks

rna-seq blast alignment • 5.9k views
ADD COMMENTlink modified 9 months ago by chriswymant0 • written 3.9 years ago by seta1.1k
7
gravatar for 5heikki
3.9 years ago by
5heikki8.4k
Finland
5heikki8.4k wrote:
export LANG=C; export LC_ALL=C; sort -k1,1 -k12,12gr -k11,11g -k3,3gr output | sort -u -k1,1 --merge > bestHits

 

Sort by 1. query name, 2. bitscore, 3. evalue, 4. nucleotide identity, and extract the best line for each query (bitscore more important than evalue, evalue more important than nucleotide identity). I've been using en_US.UTF-8 locale, but I think this should also work with C (and be somewhat faster).

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by 5heikki8.4k

Just giving this a try and it doesn't output the correct columns for my last BLAST output -- is this for standard BLAST output?  What flags did you set in your BLAST for this to work?

ADD REPLYlink written 3.9 years ago by Josh Herr5.6k
1

Standard outfmt 6 plus some extra fields after that. I'm guessing it's not really working if you used blastx (instead of IMO the far superior protein prediction + blastp approach), meaning that all your query names from the same contigs (or whatever) are identical. Tabular blastx output is pretty much unsuitable for automated sorting, unless input is short reads and you're not really expecting more than 1 protein per query sequence..

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by 5heikki8.4k

+1, You're right -- I did use blastx.  I'll check on the next BLAST run.  Thanks for all your help.

ADD REPLYlink written 3.9 years ago by Josh Herr5.6k
Don't forget the outfmt 6 flag. I recommend prodigal for protein prediction..
ADD REPLYlink written 3.9 years ago by 5heikki8.4k

Thanks 5heikki,

It works for me, for me -outfmt 6.

just for clarification, as I used -max_target_seqs 20 in my blast command, I expected that all hits were best hit, but using suggested command I got about 27000 hit from 32000 hits as best hit. Please let me know how to explain this difference? sorry for this question, I'm a new this filed and may be have a stupid question in your professional view! thanks

ADD REPLYlink written 3.9 years ago by seta1.1k

20 -max_target_seqs for each query contig/read.

ADD REPLYlink written 3.9 years ago by 5heikki8.4k

thanks dear friend

ADD REPLYlink written 3.9 years ago by seta1.1k
0
gravatar for chriswymant
9 months ago by
chriswymant0 wrote:

WARNING: sort -u --merge appears to fail spectacularly and silently on MacOS (at least for me):

$ sort --version
2.3-Apple (99)
$ echo -e "foo\nbar\neggs" | sort -u --merge
$

i.e. it produces no output. Compare with the sort on (my) Ubuntu:

$ sort --version
sort (GNU coreutils) 8.21
[other details suppressed]
$ echo -e "foo\nbar\neggs" | sort -u --merge
foo
bar
eggs

This means the sort -u --merge-based commands for keeping the top blast hit, i.e. for doing this kind of thing

$ echo -e "foo,1\nbar,\nfoo,2\neggs," | sort -t, -k1,1 -k2,2 | sort -t, -k1,1 -u --merge
bar,
eggs,
foo,1

(here keeping the line with the smallest value of the second comma-separated field) work with that linux sort but simply produce nothing with the MacOS sort.

To do what we want robustly I wrote https://github.com/ChrisHIV/shiver/blob/master/tools/KeepBestLinesInDataFile.py, which you call from the command line telling it which fields to sort & merge by.

ADD COMMENTlink written 9 months ago by chriswymant0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 769 users visited in the last hour