All vs all blast - unexpected results
Entering edit mode
15 months ago


I am running an all-vs-all blastp search using BLAST+ version 2.10.1+ via the subprocess module as part of a reciprical best hit python software. I am aggregating all of the proteins from various genomes, writing them to a fasta that is used as both the query and to make a db, then finding the best hit within each genome (as opposed to the overall best hit) for each query.

Interestingly some of the queries do not have hits in the genome from which they were sourced (1 query from a 7038-protein genome) and others do not have themselves as a top hit for their host genome (5-10 proteins per genome for 4 genomes of a similar size). Does anyone know why this is?

The blast call parameters are:

'-out', fr'{results_out_path}', #filepath
'-query', fr'{query_user}', #filepath
'-db', fr'{db_user}', #filepath
'-evalue', fr'{evalue_user}', #0.001
'-outfmt', '5', #xml
'-num_threads', fr'{thread_num}', #4

#Gabriel Moreno-Hagelsieb, Kristen Latimer, Choosing BLAST options for better detection of 
#orthologs as reciprocal best hits, Bioinformatics, Volume 24, Issue 3, February 2008, Pages 
'-seg', 'yes',
'-soft_masking', 'true',

#biopython defaults 
'-num_alignments', '200'

The parameters are equivalent to those on the command line - but if it would help for me to write it in command line format please let me know.

One thing I had considered was that the proteins in question could be low complexity, and seg could be filtering them to the extent that they are no longer recognisable as the best hit. Has anyone seen this before, or can think of another reason?

Thanks, Tim

alignment seg blastp • 714 views
Entering edit mode
15 months ago
Mensur Dlakic ★ 27k

Very short sequences - proteins or DNA/RNA - may not hit themselves in the database unless BLAST is specifically running in a short mode (-task blastp-short). Off the top of my head, very short would be < 35 residues.

The order of hits is determined by the length of the matching sequence in addition to absolute scores. Also, in the case of identical matches any one of them can be a top hit. If your protein has 2 other identical matches in the database, the hit to itself could be placed anywhere between spots 1-3. BLAST doesn't look into protein headers to decide that a self-hit should always be #1 - it looks only at sequences.

And yes, filtering low-complexity regions could also give unexpected results.

Entering edit mode

Hi Mensur, thanks for your reply. I was parsing the results in the xml file myself (get the top hsp for each alignment via coveage, blast score and evalue, then pick the best alignment) so I'm not sure that the ordering would be an issue. But the other points are definitely something I'll look into, many thanks!


Login before adding your answer.

Traffic: 1186 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6