Question: NCBI BLAST on AWS
0
gravatar for philipp
4 days ago by
philipp10
philipp10 wrote:

Hi,

I need to BLAST 75k sequences against refseg_genomic. I set up an Amazon Server (c3.8xlarge). The instance is running and I uploaded all files of the refseq_genomic database to the /blast/blastdb_custom folder and my sequences in TXT files in the folder /home.

Then I run the following code on my local machine:

import paramiko
from openpyxl import Workbook
from openpyxl import load_workbook

workbook = load_workbook("Sequences.xlsx")
worksheet = workbook.get_sheet_by_name(name = "Sequences")

num_rows = worksheet.max_row

i = 2
k = 1

# Connect to the AWS and BLAST
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(hostname = "XXX.amazonaws.com", port = 22, username = "ubuntu", key_filename = "AWS_KEY.pem", compress = False)

while i < num_rows + 1:

    SequenceID = worksheet.cell(row = i, column = k).value
    Command = "blastn -query " + str(SequenceID) + ".txt -db refseq_genomic -evalue 1 -word_size 11 -gapopen 5 -gapextend 2 -penalty -3 -reward 2 -max_target_seqs 100 -num_threads 32 -outfmt 5 -out RefSeq-" + str(SequenceID) + ".xml"

    print(Command)

    # Execute command on AWS
    ssh_stdin, ssh_stdout, ssh_stderr = ssh.exec_command(str(Command))

    ssh_stdout.readlines()

    i = i+1

print("Finished")

Unfortunately, only the first command in the loop is started but then the computation runs for hours and whether an output is calculated (although the output file is created) nor is there an error message. I think with the computational power BLASTing a sequence should not take longer than a few seconds. Probably something went wrong but I don't know what.

I appreciate any hints what I have to change to get the BLAST running.

Thanks! Philipp

blast aws ncbi • 72 views
ADD COMMENTlink written 4 days ago by philipp10

I don't know a lot of python, but is there a chance "while i < num_rows + 1" is doing the < before the +1, and converting while(false+1) into an infinite loop?

Anyway, why not log into the remote instance with a terminal and look at what it's doing instead of using any local python at all?

ADD REPLYlink written 4 days ago by karl.stamm3.1k

I would ssh in and run the blast with a set ID. For example:

SequenceID = worksheet.cell(row = 2, column = k).value

That could tell you if the blast works?

Just curious why you're looping? For each call to blastn the whole database is loaded. It might be faster if cat your inputs and use that the query. You can then split/ parse the XML as needed.

ADD REPLYlink written 3 days ago by Jacob Warner290

I'm very curious what you mean with "cat your inputs and use that the query". Could you briefly elaborate that? I have 75k sequences that I have to blast against the refseq_genomic database. I only know the sequence but have no further information like species etc.

For me (not an IT specialist) the logical approach was to loop through all the sequences. If you have a better solution that would be great since I'm stuck at this point for quite a while now.

Thanks!

ADD REPLYlink written 1 day ago by philipp10

Hi Philipp, It appears that you have one text file per sequence since you are looping over the files with + str(SequenceID) + ".txt. I am assuming that these are in fasta format:

">Sequence_iD

ATCGCCGA..."

If that's the case, you could concatenate (cat) the files together into one multifasta, then you can pass this file to Blast.

ADD REPLYlink modified 1 day ago • written 1 day ago by Jacob Warner290
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 984 users visited in the last hour