How to sort blast results by a minimum % of identical matches?
1
1
Entering edit mode
3.4 years ago
A_heath ▴ 160

Hi all,

I made BLASTp analysis on a specific genome against large databases. Here is my command line for example:

blastp -query SEQUENCE.fasta -db database -out blast_results -outfmt 6 -evalue 0.01 -max_target_seqs 5

As expected, my output files are very large and I would like to select only the matches with a percentage of identical matches of 60 or higher.

Here is an example of an output file with only the first few lines:

Prokka_00003    VFDB_prot_DB_VFG012837(gi:82524546)_(ipaH2.5)_invasion_plasmid_antigen,_fragment_[Mxi-Spa_TTSS_effectors_controlled_by_MxiE_(CVF465)]_[Shigella_dysenteriae_Sd197]  28.947  114 77  3   134 247 342 451 7.17e-05    44.7
Prokka_00007    VFDB_prot_DB_VFG042740(gi:15599489)_(pprA)_two-component_sensor_PprA_[type_IV_pili_(AI098)]_[Pseudomonas_aeruginosa_PAO1]   29.496  278 175 5   463 735 654 915 8.18e-20    94.7
Prokka_00007    VFDB_prot_DB_VFG038759(gb|YP_857329)_(fleS/flrB)_two-component_sensor_kinase_[ND_(AI144)]_[Aeromonas_hydrophila_subsp._hydrophila_ATCC_7966]    24.194  372 224 11  366 734 17  333 4.81e-19    89.4
Prokka_00007    VFDB_prot_DB_VFG038760(gb|YP_008044159)_(flrB)_flagellar-related_two-component_sensor_kinase_[Polar_flagella_(VF0473)]_[Aeromonas_hydrophila_ML09-119]  24.194  372 224 11  366 734 17  333 6.71e-19    89.0
Prokka_00007    VFDB_prot_DB_VFG038762(gi:330830289)_(flrB)_two_component_system_flagellar_sensor_histidine_kinase_FlrB_[Polar_flagella_(CVF786)]_[Aeromonas_veronii_B565]  26.141  241 163 5   497 734 105 333 7.11e-18    85.9
Prokka_00007    VFDB_prot_DB_VFG038761(gi:145298503)_(flrB)_two-component_system_flagellar_sensor_histidine_kinase_FlrB_[Polar_flagella_(CVF786)]_[Aeromonas_salmonicida_subsp._salmonicida_A449]   25.311  241 165 5   497 734 105 333 1.28e-17    85.1

Do you know how would I be able to do that?

Thank you so much for your much appreciated help!!

Have a nice day

blast • 1.4k views
ADD COMMENT
1
Entering edit mode

word of caution: these values are on a per match/HSP basis (so do not confuse them with a per hit basis). This means that in some cases you might get two (or more) values for the same query-hit pair.

getting the % identity for each query-hit pair is quite tricky, but in most cases the proxy- approach of Michael Dondrup will do indeed (and I would do the same tbh)

ADD REPLY
3
Entering edit mode
3.4 years ago
Michael 54k

According to the blast documentation, percent-identity is given in the third column of the standard output format "6". So you can simply sort or filter by the third column. In your example, there is no entry that satisfies your threshold.

awk -F "\t"  '{if ($3 >=60) print $0}' blast-output.txt

Something like the above might work.

ADD COMMENT
0
Entering edit mode

Thank you so much Michael Dondrup! That's exactly what I needed

ADD REPLY
1
Entering edit mode

Always happy to help 🙂

ADD REPLY

Login before adding your answer.

Traffic: 1781 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6