Question: Alternative To Local Blastplus To Blast 10000'S Sequences On Nr, Swissprot And Nt
1
gravatar for Eric Normandeau
7.1 years ago by
Eric Normandeau9.6k
Quebec, Canada
Eric Normandeau9.6k wrote:

Hi,

Here is a case scenario that happens quite often to me: I need to blast from 1,000 to 20,000 sequences in order to find the proteins these sequences code for. These sequences come from fish cDNA libraries, so I expect most of them, although not all, to code for proteins.

I presently use 'blastplus' locally to query both swissprot and nr, but this approach is not so satisfactory for a few reasons:

  1. It is very slow (up to a few days for nr)
  2. I would also like to query nt (I did not succeed, it took much too long)
  3. With a faster method, I would consider blasting a number of sequences a few orders of magnitudes higher

I was investigating the Usearch set of tools, but the ublast method cannot do the equivalent of a blastx, searching for nucleotide hits on a protein database.

What method would you suggest?

Cheers!

software sequence blast • 4.2k views
ADD COMMENTlink written 7.1 years ago by Eric Normandeau9.6k

Have you tried translating each cDNA sequence into protein and then just use the longest ORF to BLAST? - this should speed things up about 3 times already.

ADD REPLYlink written 7.1 years ago by Michael Schubert6.7k
4
gravatar for Yannick Wurm
7.1 years ago by
Yannick Wurm2.2k
Queen Mary University London
Yannick Wurm2.2k wrote:

Salut Eric,

I would stick with blast if possible. It's the one standard thing everyone (reviewers!) is familiar with.

  1. get access to a "big" server (there must be some in laval!). I'm running that kind of blasts on a 24-core machine all the time. It makes things a lot faster (and keeps my macbook from overheating!)
  2. keep only the top hit: The more you output, the more details blast needs to calculate (eg: i think it optimizes the local alignment if displayed)
  3. increase the minimum e-value param (same reason as 2.)
  4. do you need to do vs. NR? How about "only" swissprot + some fish datasets?... its unlikely that the 12th Dipteran proteome will add that much info you don't already have in the other 11...
  5. changing wordsize has huge impacts on blast speed (longer = faster). But you'll also lose some sensitivity.
  6. Do you need to query nt with all of your sequences? or only those that didn't have a protein-db match?

++ y

ADD COMMENTlink written 7.1 years ago by Yannick Wurm2.2k

Hi Yannick. All very sensitive suggestions that I'll implement. I'm in the process of gaining access to a new super computer we got on campus, maybe I'll try to use it for that purpose, else I'll use the 48 old cores we have at the Institute to do the job. I am reblasting everything (even those with matches) on nr, but I'll follow your suggestion and make a mask on it to keep only the vertebrates, at most. Thanks again!

ADD REPLYlink written 7.1 years ago by Eric Normandeau9.6k
1
gravatar for Rm
7.1 years ago by
Rm7.6k
Danville, PA
Rm7.6k wrote:

To scale up blast runs, You can use Timelogic "Tera Blast", DeCypher® FPGA Biocomputing Systems

Its a commercial one though.

We recently implemented one such system at our department with multiple Acceleration cards.

(If you use Ublast: translate the nucleotide sequences and then run against protein database.)

Adding : FastHMM and FastBLAST: Tools for Analyzing Large Protein Sequence Databases

I havent tried it

ADD COMMENTlink modified 7.1 years ago • written 7.1 years ago by Rm7.6k

Hi RaghuM, I would much prefer a free solution, but I'll have a look at your proposed software. Concerning Ublast, you suggest that I make all the 6 possible proteins out of my sequences and then ublast them on my protein database in fasta format? Cheers

ADD REPLYlink written 7.1 years ago by Eric Normandeau9.6k

yes translate to six frmaes and search. I have added FastBLASt link , see if it is useful to you

ADD REPLYlink written 7.1 years ago by Rm7.6k
1
gravatar for Darked89
7.1 years ago by
Darked894.1k
Barcelona, Spain
Darked894.1k wrote:

1) reduce the query set by:

  • filtering i.e using seqclean
  • check for possible retroelements and ribosomal RNA in your EST set
  • cluster them i.e @90% identity using uclust, or do a quick and dirty assembly using i.e cap3

2) reduce the database size (see Yannic's post). Use i.e. UniRef instead of nr, possibly reduced further.

3) perform a two step search, where you search first against clustered all known fish or vertebrate proteins, set a threshold, blast everything not finding a strong hit against larger database. This is suitable for EST set not contaminated by other DNA. I have seen plant(?) ESTs hitting genomic bacterial contigs.

4) consider using a cluster and possibly other implementation of blast. see i.e:

http://openwetware.org/wiki/Wikiomics:BLAST_tutorial#BLAST_implementations

5) Not sure if it works, but according to this page:

http://falcon.roswellpark.org:9090/goldenPath/help/blatSpec.html

you may use -q=dnax and -t=prot for blastx-like blat searches.

Edit: reformated for clarity

ADD COMMENTlink modified 7.1 years ago • written 7.1 years ago by Darked894.1k

Hi darked89. Thanks for the additional info. I'll look into the other blast implementations if the other suggestions are not totally satisfying. Cheers

ADD REPLYlink written 7.1 years ago by Eric Normandeau9.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1392 users visited in the last hour