Different blast results between CLCBio and local blast
2
0
Entering edit mode
10.0 years ago
biobio ▴ 50

Hi,

I've been using CLCBio to blast assembled contigs, but it's really slow. I decided to try setting up a local blast database and using that to blast my contigs, but I'm getting different results even though I'm using the same parameters. The parameters are below:

CLCBio:

Query genetic code: 1 Standard
Limit by entrez query: All organisms
Filter low complexity
Expect: 10
Word Size: 3
Matrix: BLOSUM62
Gap cost: Existence 11, Extension 1
Max number of hit sequences: 3

Local Blast:

blastx -db nr \
  -query ../results/contigs/CLC-contigs.fa \
  -evalue 10 \
  -matrix 'BLOSUM62' \
  -word_size 3 \
  -gapopen 11 \
  -gapextend 1 \
  -max_target_seqs 3 \
  -outfmt "10 std stitle" \
  -out ../results/blast/blast-005.csv \
  -num_threads 4

Since CLCBio and blast+ are using the same parameters and the same query sequence and the same database, I should the same results right? But I'm getting 226 hits in CLCBio that aren't in the local blast. Of these species, 2 are extremely important and are known to be in the query sequence.

Any ideas?

Thanks!

clcbio nr blast blast-plus • 3.6k views
ADD COMMENT
0
Entering edit mode

check that you database is indeed the same, could be different version,

ADD REPLY
0
Entering edit mode

How can I check that? Is there a list of when the databases are updated? The CLCBio blast was done remotely a few days before the local blast so this could be the problem.

ADD REPLY
2
Entering edit mode
10.0 years ago
biobio ▴ 50

I think I've figured out what the issue is. One of my contigs returns this result in CLCBio: RecName: Full=Coat protein >gi|3212290|pdb|1A34|A Chain A, Satellite Tobacco Mosaic VirusRNA COMPLEX >gi|530203|gb|AAA47785.1| coat protein [Tobacco necrosis satellite virus]

However, in my local blast results, this is the description: RecName: Full=Coat protein

It seems to cut it off after the > in blast+, but not in the CLCBio blast. It looks like blast+ does this in several instances, leading to different organisms in both blasts when I count the organisms.

ADD COMMENT
1
Entering edit mode
10.0 years ago
hpmcwill ★ 1.2k

The information from the CLCBio BLAST has:

  • "Query genetic code: 1 Standard" - default for NCBI BLAST blastx
  • "Limit by entrez query: All organisms" - a basic search uses all sequences in the database, so this would be default behaviour, filtering would only apply when using '-remote'
  • "Filter low complexity" - suggests the use of '-seg' to filter low complexity regions
  • "Expect: 10" - default for NCBI BLAST or set explicitly with '-evalue 10.0'
  • "Word Size: 3" - set with '-word_size 3'
  • "Matrix: BLOSUM62" - default for NCBI BLAST protein searches.
  • "Gap cost: Existence 11, Extension 1" - default for BLOSUM62 scoring matrix
  • "Max number of hit sequences: 3" - probably using '-max_target_seqs 3'

So it looks like you are missing the low complexity filtering parameter, which might explain the missing hits.

Given that the CLCBio BLAST uses the NCBI's BLAST Web Service, you could try running the search with the '-remote' flag to see if it gives different results on the NCBI's service vs. a local search.

ADD COMMENT
0
Entering edit mode

Thanks for the reply! My understanding is that blast+ filters low complexity by default: http://rothlab.ucdavis.edu/genhelp/blast+.html#_FILTERING_OUT_LOW_COMPLEXITY_SEQUEN so that shouldn't be a problem here.

ADD REPLY
0
Entering edit mode

The default behaviour for filtering is specific to the BLAST program being used. Check the defaults for the program you are using by running with '-help'.

ADD REPLY
0
Entering edit mode

Here's what blastx says:

*** Query filtering options
 -seg <String>
   Filter query sequence with SEG (Format: 'yes', 'window locut hicut', or
   'no' to disable)
   Default = `12 2.2 2.5'
 -soft_masking <Boolean>
   Apply filtering locations as soft masks
   Default = `false'
 -lcase_masking
   Use lower case filtering in query and subject sequence(s)?

None of these really look like "Filter low complexity" to me but I could be wrong. Is there a place that explains what these are?

Edit: actually this seems to suggest that soft_masking is what I am looking for: "masking - Also known as filtering. The removal of repeated or low complexity regions from a sequence in order to improve the sensitivity of sequence similarity searches performed with that sequence."

ADD REPLY

Login before adding your answer.

Traffic: 2973 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6