How Can I Sort Blast Output Based On E-Value?
2
2
Entering edit mode
7.7 years ago
biostar ▴ 150

Hello: I have a blast output with about 50000 sequence list. The problem I have is that I need to sort those results based on the smallest e-value to greater e-value. Could you guys please suggest me anything that I can resort to? Thanks! The blast output looks like this:

    Tx_332323_c0_seq1    110815944    RecName: Full=Gag-Pol polyprotein; AltName: Full=Pr160Gag-Pol; Contains: RecName: Full=Matrix protein p17; Short=MA; Contains: RecName: Full=Capsid protein p24; Short=CA; Contains: RecName: Full=Spacer peptide p2; Contains: RecName: Full=Nucleocapsid protein p7; Short=NC; Contains: RecName: Full=Transframe peptide; Short=TF; Contains: RecName: Full=p6-pol; Short=p6*; Contains: RecName: Full=Protease; AltName: Full=PR; AltName: Full=Retropepsin; Contains: RecName: Full=Reverse transcriptase/ribonuclease H; AltName: Full=Exoribonuclease H; AltName: Full=p66 RT; Contains: RecName: Full=p51 RT; Contains: RecName: Full=p15; Contains: RecName: Full=Integrase; Short=IN    1446    5    457    107    1028    1138    4e-06    66    35.00    -3/0
Tv_333332_c0_seq2    206558251    RecName: Full=Zinc finger MYM-type protein 5    656    8    41    796    405    615    6e-14    92    27.63    2/0
Tv_144391_c0_seq1    75057844    RecName: Full=Nitrogen permease regulator 2-like protein; Short=NPR2-like protein; AltName: Full=Tumor suppressor candidate 4    380    4    1410    304    7    380    6e-118    65    48.27    -3/0
Ty_116400_c0_seq1    73920872    RecName: Full=Longitudinals lacking protein, isoforms F/I/K/T    970    1    29    190    902    956    4e-09    33    45.45    2/0
Ty_144400_c0_seq2    73920872    RecName: Full=Longitudinals lacking protein, isoforms F/I/K/T    970    1    29    190    902    956    2e-08    20    45.45    2/0
Tx_444402_c0_seq1    74834619    RecName: Full=Cathepsin L-like proteinase; Flags: Precursor    324    10    57    1034    1    323    9e-75    91    42.09    3/0
blast output sort • 5.9k views
ADD COMMENT
2
Entering edit mode

you know, there is a linux tool named 'sort': It sorts.

ADD REPLY
1
Entering edit mode

Thanks, but I tried that and gives faulty result.

ADD REPLY
2
Entering edit mode

Try this way: sort -k 10,10g input -o output.

ADD REPLY
1
Entering edit mode

Also, there is always the manual way...

ADD REPLY
4
Entering edit mode
7.7 years ago
PoGibas 4.9k

As Pierre suggested you can try sort (take a look at -g option (for general numerical value)).

With a given example this works:

awk '{print $(NF-3),$0}' INPUT | sort -gk1  
# Used awk as original INPUT needs a little bit munging
ADD COMMENT
1
Entering edit mode

Thanks Pgibas! The awk command works as needed!

ADD REPLY
3
Entering edit mode
7.7 years ago

Don't sort by e-value, sort by score instead! The reason is that there can be many more ties in e-values due to numerical accuracy (e-value is exponentially decreasing wrt score) and database sizes. Often, one will for example get many alignments with e-value of 0 but different scores. Based on the numerical value of e-value these will be ranked randomly. Except for ties yielding random ranking, sorting by bit-score of HSPs of the same query and subject, will yield the same ranking of HSPs as sorting by e-values.

ADD COMMENT
0
Entering edit mode

Thanks Mike, good to know that as well.

ADD REPLY

Login before adding your answer.

Traffic: 1698 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6