Question: Different blast results between CLCBio and local blast
0
gravatar for biobio
5.9 years ago by
biobio40
United States
biobio40 wrote:

Hi,

I've been using CLCBio to blast assembled contigs, but it's really slow. I decided to try setting up a local blast database and using that to blast my contigs, but I'm getting different results even though I'm using the same parameters. The parameters are below:

CLCBio:

Query genetic code: 1 Standard

Limit by entrez query: All organisms

Filter low complexity

Expect: 10

Word Size: 3

Matrix: BLOSUM62

Gap cost: Existence 11, Extension 1

Max number of hit sequences: 3

Local Blast: blastx -db nr -query ../results/contigs/CLC-contigs.fa -evalue 10 -matrix 'BLOSUM62' -word_size 3 -gapopen 11 -gapextend 1 -max_target_seqs 3 -outfmt "10 std stitle" -out ../results/blast/blast-005.csv -num_threads 4

Since CLCBio and blast+ are using the same parameters and the same query sequence and the same database, I should the same results right? But I'm getting 226 hits in CLCBio that aren't in the local blast. Of these species, 2 are extremely important and are known to be in the query sequence.

Any ideas?

Thanks!

blast+ blast clcbio nr • 2.5k views
ADD COMMENTlink modified 5.9 years ago • written 5.9 years ago by biobio40

check that you database is indeed the same, could be different version,

ADD REPLYlink written 5.9 years ago by Istvan Albert ♦♦ 84k

How can I check that? Is there a list of when the databases are updated? The CLCBio blast was done remotely a few days before the local blast so this could be the problem.

ADD REPLYlink written 5.9 years ago by biobio40
2
gravatar for biobio
5.9 years ago by
biobio40
United States
biobio40 wrote:

I think I've figured out what the issue is. One of my contigs returns this result in CLCBio: RecName: Full=Coat protein >gi|3212290|pdb|1A34|A Chain A, Satellite Tobacco Mosaic VirusRNA COMPLEX >gi|530203|gb|AAA47785.1| coat protein [Tobacco necrosis satellite virus]

However, in my local blast results, this is the description: RecName: Full=Coat protein

 

It seems to cut it off after the '>' in blast+, but not in the CLCBio blast. It loooks like blast+ does this in several instances, leading to different organisms in both blasts when I count the organisms.

ADD COMMENTlink modified 5.9 years ago • written 5.9 years ago by biobio40
1
gravatar for hpmcwill
5.9 years ago by
hpmcwill1.1k
United Kingdom
hpmcwill1.1k wrote:

The information from the CLCBio BLAST has:

  • "Query genetic code: 1 Standard" - default for NCBI BLAST blastx
  • "Limit by entrez query: All organisms" - a basic search uses all sequences in the database, so this would be default behaviour, filtering would only apply when using '-remote'
  • "Filter low complexity" - suggests the use of '-seg' to filter low complexity regions
  • "Expect: 10" - default for NCBI BLAST or set explicitly with '-evalue 10.0'
  • "Word Size: 3" - set with '-word_size 3'
  • "Matrix: BLOSUM62" - default for NCBI BLAST protein searches.
  • "Gap cost: Existence 11, Extension 1" - default for BLOSUM62 scoring matrix
  • "Max number of hit sequences: 3" - probably using '-max_target_seqs 3'

So it looks like you are missing the low complexity filtering parameter, which might explain the missing hits.

Given that the CLCBio BLAST uses the NCBI's BLAST Web Service, you could try running the search with the '-remote' flag to see if it gives different results on the NCBI's service vs. a local search.

 

ADD COMMENTlink modified 5.9 years ago • written 5.9 years ago by hpmcwill1.1k

Thanks for the reply! My understanding is that blast+ filters low complexity by default: http://rothlab.ucdavis.edu/genhelp/blast+.html#_FILTERING_OUT_LOW_COMPLEXITY_SEQUEN so that shouldn't be a problem here.

ADD REPLYlink written 5.9 years ago by biobio40

The default behaviour for filtering is specific to the BLAST program being used. Check the defaults for the program you are using by running with '-help'.

ADD REPLYlink written 5.9 years ago by hpmcwill1.1k

Here's what blastx says:

*** Query filtering options
 -seg <String>
   Filter query sequence with SEG (Format: 'yes', 'window locut hicut', or
   'no' to disable)
   Default = `12 2.2 2.5'
 -soft_masking <Boolean>
   Apply filtering locations as soft masks
   Default = `false'
 -lcase_masking
   Use lower case filtering in query and subject sequence(s)?

 

None of these really look like "Filter low complexity" to me but I could be wrong. Is there a place that explains what these are?

 

Edit: actually this [1] seems to suggest that soft_masking is what I am looking for: "masking - Also known as filtering. The removal of repeated or low complexity regions from a sequence in order to improve the sensitivity of sequence similarity searches performed with that sequence."

[1] http://www.ncbi.nlm.nih.gov/books/NBK1763/

ADD REPLYlink modified 5.9 years ago • written 5.9 years ago by biobio40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1328 users visited in the last hour