Entering edit mode
6.8 years ago
maciwuk
•
0
I have 60,000 sequences that I want to BLAST against the default 'Nucleotide collection (nt/nr)' database.
Is it possible to do this without setting up a standalone, local version of BLAST? (I of course have BLAST (blast+-2.6.0
) installed, but I wonder if it is possible to run the search non-locally).
blastn -db nt -query input-sequences.fasta -remote -out blast_output.out
I get quite a huge list of errors that contain strings such as: Unavailable feature GNUTLS, Failed to initialize secure session, Service not found, stack is empty, etc.
QUESTIONS:
- Am I doing something wrong in my command?
- Is it faster to build a local database and search locally on my own computer for such a large number of sequences?
I don't recollect if
v. 2.6.0
moved to usinghttps
connections. NCBI has completely moved to usinghttps
for all connectivity so upgrading to latest blastv. 2.7.1
may not be a bad idea.If you need to blast 60K sequences then consider doing those in chunks. You don't want to abuse your privileges at NCBI by sending a massive amount of blast searches their way. Consider using a loop/building in sleep times etc.
If you have enough local resources available then doing the search locally will give you more control over things.
Great. I will update to
2.7.1
and will try again. I have access to a computer with192 GB RAM
and12 physical CPU cores
(each @2.2 GHz). Do you think BLASTing 60 thousand sequences will take a substantial amount of time?What kind of sequences are these? NGS or regular fasta? You may want to use DIAMOND (since you have enough resources available locally) instead of blast. That can speed things up significantly.
These are short DNA sequences (all between 15-30 nt) extracted directly from UCSC.hg38 and UCSC.mm10 fasta (chromosome) files. They have some modifications introduced, where usually one nucleotide is either replaced by 'H' (not G) or 'N' (any nucleotide). Supposing a certain sequence is from chromosome 1 on hg38, I want to know whether my sequence with the modification can be found on a chromosome other than chr1. I simply want to do a BLAST search to see if I can match any of these sequences to any other chromosomes with 100% similarity where that matched hit is NOT the chromosome my sequence was originally found on. The reason BLAST impeccably fits this situation is that it can (1) optimize the sequence and cut few nucleotide from each end (and that is exactly what I want too, because I am also interested in shorter arms in both ends of the sequence, so cutting a few nucleotides from each end is more than fine), and (2) BLAST is totally fine with 'N' and 'H' nucleotides I have introduced in my sequences and it is capable of dealing with those in a way that is highly applicable to my end-goal. For this reason, I thought BLAST will be even faster than a regular expression search. Though I am still not sure whether I should do it locally!
Ah sorry then DIAMOND would not be an option. I suggest doing
blastn
search locally against a smaller subset (mouse and human genomes) than entirent
. That will help speed things up.Remember to use
--task blastn-short
since you have short sequences.Edit: Blat from UCSC may be very fast but I am not sure if it will handle IUPAC codes. Look into it as well.