Are there some faster alternatives to BLAST (specifically nucleotide BLAST)? I like that I can search through all GenBank+EMBL+DDBJ+PDB+RefSeq sequences (nt collection), but I feel like there must be a faster way. If I wanted to identify thousands or millions of sequences, it's somewhat inefficient.
If you have millions of query sequences it's not a bad idea to cluster them and only blast the representative sequences. Further more, with millions of query sequences and no cluster at hand, it might be a good idea to select a smaller reference database such as UniRef90, but this depends on your research questions. I think DIAMOND is one of the most recent blast alternatives. As far as I recall, they overview some other alternatives in the article (don't have access from home). If you want to do just nucleotide-nucleotide another option would be blat.
Well, DIAMOND is blastx-like so nucletide-vs-protein, which is almost always better than nucleotide-nucleotide if you want to detect putative homologs. You haven't really told us anything about your research questions nor the type of your query sequences (length, source, etc.) so it's hard to say. Also blat has output option that is similar to tabular blast output, which is the way to go IMO.
Sorry if I was being too vague. I am trying to identify contaminants in raw sequencing data. For example, the reads should be human, but only 50% align to human. What are the other reads? I can check some likely contaminants, but I'd like to check against all known sequences.
Unfortunately, DIAMOND is for protein (not nucleotide) alignment.
Blat is a good suggestion. Not sure how easy it would be to summarize the results.
Well, DIAMOND is blastx-like so nucletide-vs-protein, which is almost always better than nucleotide-nucleotide if you want to detect putative homologs. You haven't really told us anything about your research questions nor the type of your query sequences (length, source, etc.) so it's hard to say. Also blat has output option that is similar to tabular blast output, which is the way to go IMO.
Sorry if I was being too vague. I am trying to identify contaminants in raw sequencing data. For example, the reads should be human, but only 50% align to human. What are the other reads? I can check some likely contaminants, but I'd like to check against all known sequences.
If I were you, I would take a small subsample of the non-human mapping reads and blast then against nt to see what is going on..
See this thread: http://seqanswers.com/forums/showthread.php?t=60696
Hopefully you are not the same person as the originator of that thread.