Efficient strategy for remote blasting of 3000 sequences
3
1
Entering edit mode
9.7 years ago

Hi,

I have a single fasta file containing 3136 partial 18S rRNA gene sequences (on average 235 nucleotides long and never longer than 260 or shorter than 185 nt) for which I would like to get the top 10 blast hits against the nt database in table format. Preferably I would also like to get their source organism taxonomy (GB field) in the same table, but let's consider that optional for now.

I am contemplating what would be the best strategy for this.

I would prefer not to have to download the entire nr database to my computer.

Therefore I consider currently two strategies:

  1. Use the NCBI BLAST+ suite as described here: http://www.ncbi.nlm.nih.gov/books/NBK1763/ under BLAST+ remote blast, the issue is that here it is described for only a single sequence submission and for my number of sequences quite the number of RID's get automatically generated, which I do not want to format manually afterwards. But I am afraid that there's no way to avoid this?
  2. Alternatively, I could use bioperl or biopython to run a remoteblast loop and try to format the output appropriately

Which of these two strategies would be most efficient?

Any pointers are warmly welcome...

Kind regards.

FM

edit: the sequences are fungal 18S reads of which consensus sequences for OTU's were obtained with mothur. Half of the original dataset could not be classified deeper than "eukaryotes" with SILVA or RDP. Therefore, I am looking to the closest BLAST matches in the nt database, maybe I could also consider the env_nt database but first I'd like to check out the nt.

blast • 5.2k views
ADD COMMENT
1
Entering edit mode
9.7 years ago
pld 5.1k

From the NCBI website:

http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=DeveloperInfo

You best bet is to break your queries into multiple files with <= 10kbp and submit each one as an individual query. You have almost 1Mbp, I'd break it into 10 separate files. I'm guessing BLAST has optimizations for running multiple queries at once.

You may also want to select the tabular output format unless you need XML formatted results. BLAST has options to further adjust the output format, just check the documentation.

Be very careful to pay attention to the times for large loads and do not send more than one query every three seconds. The remote NCBI tools are great, but when you overload them they simply fail. It can be a massive pain, so be careful. For these bigger jobs I'd wait maybe a few minutes in between sending each query just to be safe.

Outside of that, there's really not much you can do in the way of optimization. Even if your jobs only take a fraction of a second, you can only send 1 per 3 seconds, for your 3k sequences submitted individually, this is going to take at best 2.5 hours.

ADD COMMENT
1
Entering edit mode
9.7 years ago
5heikki 11k

Why would you want to blast 16S rRNA sequences against a protein database (nr)? There would be no point whatsoever to blast them against the non-redundant nucleotide database either (it's called nt). Your reference database should obviously be a 16S database, e.g. GreenGenes or Silva. Also, blast is a very poor way to assign taxonomy to short 16S reads. You should look into installing QIIME or Mothur to your computer. Basically you'll first cluster your reads into OTUs at some percent identity (like 97% similarity), and then you assign taxonomy to the OTUs. There are easy to follow guides for both QIIME and Mothur on their respective sites..

ADD COMMENT
0
Entering edit mode

Hi, I clearly made a few mistakes in writing down my question. I am actually blasting fungal 18S consensus sequences obtained by denoising and cleaning up raw 454 flowgrams with mothur and getting the representative sequence for each otu. The issue is that the common databases such as SILVA, GreenGenes and RDP fail to properly classify half of my dataset, so I am looking for the closest blast match with the nt database, not the nr (my bad, I must have typed it wrongly).

I am very wel aware of the mothur and qiime guides and prefer mothur myself for analysis. I am also aware that I can use blast as a alignment engine in mothur but not for classification or remote blasting.

But thank you for helping me get my question more accurate!

ADD REPLY
1
Entering edit mode

Ah, ok. With fungal ITS best-hit blast assignment makes sense. Your reference db should then be UNITE. There's a QIIME tutorial for fungal ITS here. Mothur has a dedicated site for UNITE as well.

ADD REPLY
0
Entering edit mode

Thanks for the tip but my primers don't target ITS, they target 18S (The primers used were: NS1 5’-GTA GTC ATA TGC TTG TCTC and Fung 5’-ATT CCC CGT TAC CCG TTG). Hence I need a SSU reference database such as SILVA, which I already tried and failed to classify half of my sequences.

ADD REPLY
1
Entering edit mode

If you've already run a typical analysis for your goals and you are seeing less than 50% classification, it might be worth checking that there isn't any contamination or problems in your data.

Blast might be a good start, you could hit the nt database and only search against fungal sequences. Remove reads without significant hits (e.g. expect > 1e-10) and run your analysis again to see if the rate improves. It'd be better to search against the whole nt database and exclude anything with either non-significant hits against fungi or significant hits against non-fungi species, however this might be problematic with running blast remotely.

Are you sure you can't run it locally? As long as you have enough ram to load the database, you should not have any trouble. If you have enough memory, you can run multiple instances of blast simultaneously.

ADD REPLY
0
Entering edit mode

I will investigate this possibility. I have 32 Gb of RAM, I hope that will be enough to load the nt database.

How would you encode your conditions (i.e. exclude anything with either non-significant hits against fungi or significant hits against non-fungi species) in a local blastn search syntax-wise?

Thanks in advance.

ADD REPLY
0
Entering edit mode

I forget how large the nt database is, just start a local query with the nt database and see how much memory it uses up.

BLAST doesn't have much in the way of filtering results, I usually handle this later in SQL.

In the documentation, for tabular and CSV formatted results, you can have taxonomic information for the subject stored with the results. It looks like you can store the super kingdoms for hits, as well as the scientific names. So this should provide an axis to select on, then rank by your blast hit metric of choice.

ADD REPLY
0
Entering edit mode
9.7 years ago
Ram 43k

Hi,

Maybe use Galaxy to upload your query sequence and tweak parameters to get your results? That way you'll avoid downloading as well as sending multiple queries. I am not 100% sure it is possible, but it should theoretically work.

Do let me know!

EDIT: This thread might be of help.

ADD COMMENT
0
Entering edit mode

I am not very familiar with the galaxy framework. Can I blast with it? How so? Is there a tutorial on blasting multiple sequences with galaxy?

ADD REPLY
1
Entering edit mode

Install galaxy locally on your computer. Install Blast+ locally as well. Then download and install Blast+ wrappers for galaxy. Then use galaxy to automate your blast. I just gave you the general overview. Find your way own. That's not hard. Let me know if u got any problem

ADD REPLY

Login before adding your answer.

Traffic: 3149 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6