Question

Alternative To Local Blastplus To Blast 10000'S Sequences On Nr, Swissprot And Nt

1

Entering edit mode

13.5 years ago

Eric Normandeau 11k

Hi,

Here is a case scenario that happens quite often to me: I need to blast from 1,000 to 20,000 sequences in order to find the proteins these sequences code for. These sequences come from fish cDNA libraries, so I expect most of them, although not all, to code for proteins.

I presently use 'blastplus' locally to query both swissprot and nr, but this approach is not so satisfactory for a few reasons:

It is very slow (up to a few days for nr)
I would also like to query nt (I did not succeed, it took much too long)
With a faster method, I would consider blasting a number of sequences a few orders of magnitudes higher

I was investigating the Usearch set of tools, but the ublast method cannot do the equivalent of a blastx, searching for nucleotide hits on a protein database.

What method would you suggest?

Cheers!

sequence blast software • 6.9k views

ADD COMMENT • link updated 13.5 years ago by Darked89 4.6k • written 13.5 years ago by Eric Normandeau 11k

0

Entering edit mode

Have you tried translating each cDNA sequence into protein and then just use the longest ORF to BLAST? - this should speed things up about 3 times already.

ADD REPLY • link 13.5 years ago by Michael Schubert ★ 7.1k

score 4 · Answer 1 · 2010-10-29

Salut Eric,

I would stick with blast if possible. It's the one standard thing everyone (reviewers!) is familiar with.

get access to a "big" server (there must be some in laval!). I'm running that kind of blasts on a 24-core machine all the time. It makes things a lot faster (and keeps my macbook from overheating!)
keep only the top hit: The more you output, the more details blast needs to calculate (eg: i think it optimizes the local alignment if displayed)
increase the minimum e-value param (same reason as 2.)
do you need to do vs. NR? How about "only" swissprot + some fish datasets?... its unlikely that the 12th Dipteran proteome will add that much info you don't already have in the other 11...
changing wordsize has huge impacts on blast speed (longer = faster). But you'll also lose some sensitivity.
Do you need to query nt with all of your sequences? or only those that didn't have a protein-db match?

++ y

score 1 · Answer 2 · 2010-10-28

1

Entering edit mode

13.5 years ago

Rm 8.3k

To scale up blast runs, You can use Timelogic "Tera Blast", DeCypher® FPGA Biocomputing Systems

Its a commercial one though.

We recently implemented one such system at our department with multiple Acceleration cards.

(If you use Ublast: translate the nucleotide sequences and then run against protein database.)

Adding : FastHMM and FastBLAST: Tools for Analyzing Large Protein Sequence Databases

I havent tried it

ADD COMMENT • link 13.5 years ago by Rm 8.3k

0

Entering edit mode

Hi RaghuM, I would much prefer a free solution, but I'll have a look at your proposed software. Concerning Ublast, you suggest that I make all the 6 possible proteins out of my sequences and then ublast them on my protein database in fasta format? Cheers

ADD REPLY • link 13.5 years ago by Eric Normandeau 11k

0

Entering edit mode

yes translate to six frmaes and search. I have added FastBLASt link , see if it is useful to you

ADD REPLY • link 13.5 years ago by Rm 8.3k

Ram · Answer 3 · 2010-10-29

reduce the query set by:
- filtering i.e using seqclean
- check for possible retroelements and ribosomal RNA in your EST set
- cluster them i.e @90% identity using uclust, or do a quick and dirty assembly using i.e cap3
reduce the database size (see Yannic's post). Use i.e. UniRef instead of nr, possibly reduced further.
perform a two step search, where you search first against clustered all known fish or vertebrate proteins, set a threshold, blast everything not finding a strong hit against larger database. This is suitable for EST set not contaminated by other DNA. I have seen plant(?) ESTs hitting genomic bacterial contigs.
consider using a cluster and possibly other implementation of blast. see here
Not sure if it works, but according to this page, you may use -q=dnax and -t=prot for blastx-like blat searches.

Edit: reformated for clarity