Question: Blastn Gives Different Results Based on Database Size
0
gravatar for khv
3 months ago by
khv0
khv0 wrote:

I'm trying to figure out why I am seeing different blast results based on database size. Here's what I'm trying to accomplish:

I have a thousand or so sequences of approximately 2000 bp in length. I want to subdivide the sequences into groups that don't share more than 13 bp of homology to other sequences in the group. I want to use blast to identify homologies between sequences.

First I create a local database from a fasta file of all sequences. Then I run a blastn query of the fasta file against the database.

makeblastdb -in seqlist.fasta -dbtype 'nucl' -out seqlist
blastn -task blastn -db seqlist -query seqlist.fasta -word_size 13 -out results.txt -outfmt 6

In the resulting output, the smallest region of homology found is 15 bp. Based on the blast result, I make subgroups of 20 or so sequences that do not share homology. To check my work, I repeat the blast process using only a single subgroup.

makeblastdb -in seqlist2.fasta -dbtype 'nucl' -out seqlist2
blastn -db seqlist2 -query seqlist2.fasta -word_size 13 -out results2.txt -outfmt 6

Now when I run the same query on a database of 20 or so sequences instead of 1000 or so sequences, the blast output finds all sorts of regions of homology of length 13 and 14 bp. I'm trying to understand why these outputs did not appear in the original blast query. Does blast use a different algorithm based on database size? Is there a parameter I can pass to change this?

Per some forum posts I have found on using blastn to search for short alignments, I have tried including parameters like

-dust no
-soft_masking false
-task blastn-small

None of these parameters get the large database query to output the 13 bp regions of homology found in the small database query. Additionally reducing the word search size doesn't help. Any advice or information on this would be appreciated.

blast sequence alignment • 201 views
ADD COMMENTlink modified 3 months ago by lieven.sterck2.4k • written 3 months ago by khv0

an initial comment (to which I'm quite sensitive):

I want to subdivide the sequences into groups that don't share more than 13 bp of homology to other sequences in the group. I want to use blast to identify homologies between sequences.

what you are looking for is similarities !! you can not have 13bp homology !

Homology is a 'boolean' thing (= yes or no) , there is no such thing as more or less or percentage homology.

ADD REPLYlink written 3 months ago by lieven.sterck2.4k
2
gravatar for lieven.sterck
3 months ago by
lieven.sterck2.4k
Belgium, Ghent, VIB
lieven.sterck2.4k wrote:

makes all perfect sense (except the homology part, cfr comment above ;)

the database size influences the HSP scoring and even more the e-value calculation. It is very likely that doing the blast on the small DB gives more (or other) hits than the big one, especially since you use the same score threshold.

Yes, there is a parameter you can set to avoid this behavior, namely the following two:

-dbsize <Int8>
   Effective length of the database
 -searchsp <Int8, >=0>
   Effective length of the search space

these set the DB size fixed and you will thus end up with the same scoring stats regardless of the actual size of the DB. To set them have a look at the output of the large DB blast where it says: blastdb-size (or such ) and use the same value when doing the small db blast. Personally I would also set the -ungapped parameter

ADD COMMENTlink written 3 months ago by lieven.sterck2.4k

Thanks for the reply, this solved the issue

ADD REPLYlink written 3 months ago by khv0

-max_target_seqs (default 500) can also be a thing here..

ADD REPLYlink written 3 months ago by 5heikki7.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1658 users visited in the last hour