I'm trying to figure out why I am seeing different blast results based on database size. Here's what I'm trying to accomplish:
I have a thousand or so sequences of approximately 2000 bp in length. I want to subdivide the sequences into groups that don't share more than 13 bp of homology to other sequences in the group. I want to use blast to identify homologies between sequences.
First I create a local database from a fasta file of all sequences. Then I run a blastn query of the fasta file against the database.
makeblastdb -in seqlist.fasta -dbtype 'nucl' -out seqlist blastn -task blastn -db seqlist -query seqlist.fasta -word_size 13 -out results.txt -outfmt 6
In the resulting output, the smallest region of homology found is 15 bp. Based on the blast result, I make subgroups of 20 or so sequences that do not share homology. To check my work, I repeat the blast process using only a single subgroup.
makeblastdb -in seqlist2.fasta -dbtype 'nucl' -out seqlist2 blastn -db seqlist2 -query seqlist2.fasta -word_size 13 -out results2.txt -outfmt 6
Now when I run the same query on a database of 20 or so sequences instead of 1000 or so sequences, the blast output finds all sorts of regions of homology of length 13 and 14 bp. I'm trying to understand why these outputs did not appear in the original blast query. Does blast use a different algorithm based on database size? Is there a parameter I can pass to change this?
Per some forum posts I have found on using blastn to search for short alignments, I have tried including parameters like
-dust no -soft_masking false -task blastn-small
None of these parameters get the large database query to output the 13 bp regions of homology found in the small database query. Additionally reducing the word search size doesn't help. Any advice or information on this would be appreciated.