Question: How To Blast Hundreds Of Fasta Files Using Local Database And Perl
3
gravatar for zhaolin20042008
5.6 years ago by
zhaolin2004200830 wrote:

I have to blast a zip file including hundreds of protein fasta files. Since it is impossible to blast them one by one, I plan to use perl to blast in a local database. I am new to perl and look for models online. My question is how to unite a unzip perl with a blast perl.

  1. Is $name in unzip.pl the result of unzipped fasta? what should I do at # Do something here ?
  2. $f in BlastList.pl is the query. How could I change it to connect result of unzip.pl.
  3. Is any other solution better than this one? Thank you

Here is unzip.pl .

my $zipfile = "fasta.zip";
my $u = new IO::Uncompress::Unzip $zipfile
or die "Cannot open $zipfile: $UnzipError";
my $status;
for ($status = 1; $status > 0; $status = $u->nextStream())
{
my $name = $u->getHeaderInfo()->{Name};
warn "Processing member $name\n" ;
my $buff;
while (($status = $u->read($buff)) > 0) {
**# Do something here**
}
last if $status < 0;
}
die "Error processing $zipfile: $!\n"
if $status < 0 ;

Here is the blast model using local database: BlastList.pl.

$DBNAME = "mydb";   
$dirtoget="C:/Blast";
opendir(IMD, $dirtoget) || die("Cannot open directory");
# delete the old "DosBlast.bat" 
$dosfile = "DosBlast.bat";
unlink($dosfile);
# Get the list of the new sequence files to blast
@thefiles= readdir(IMD);
closedir(IMD);
 # Create a new file "DosBlast.bat" 
 open(OUT,">DosBlast.bat") || die "cannot open file for writing: $!";
 foreach $f (@thefiles) 
 {
  unless ( ($f eq ".") || ($f eq "..")  || ($f eq "DosBlast.bat") || ($f eq "BlastList.pl") || ($f eq $DBNAME)||
 ($f eq "BlastOut") || ($f eq "blastall.exe") || ($f eq "formatdb.exe")|| 
 ($f eq "formatdb.log")|| ($f eq "ReadMe.doc")){ 
  @myarray = split(/\./,$f);   # Old file name
  $extension =".txt";         # This is the new file extension
  @newname=@myarray[0].$extension;
   print(OUT "blastp -db $DBNAME -query $f -out @newname -num_descriptions 1");
   } # end of unless
   } # end of foreach
perl blast+ • 4.6k views
ADD COMMENTlink modified 5.6 years ago by Dan Gaston7.1k • written 5.6 years ago by zhaolin2004200830
3

"Since it is impossible to blast them one by one" : why ?

ADD REPLYlink written 5.6 years ago by Pierre Lindenbaum121k

From the looks of this code I think you are really lost. I don't mean that to be negative, but I'm being honest when I say you cannot improve this in any way except taking another approach. Try to simplify things. Take a step back and try to get the blast command working on a single file. When that is working, try to do a simple procedure on the zipped directory like counting the files. If you can accomplish those things, then it should be possible to put things together like Eric Normandeau suggested using a shell command. This way you can build up your command and not worry about also getting the Perl syntax right.

ADD REPLYlink written 5.6 years ago by SES8.2k

Looks like they are making a BAT file because they are in a windows environment. The perl script is just a way of generating that, which is what they will actually run. Its not a terrible idea, but it would probably just be simpler to, instead of writing out to a bat file, make a direct system call to blastp. But to be honest I haven't done this sort of thing in a Windows setting other than cygwin.

ADD REPLYlink written 5.6 years ago by Dan Gaston7.1k

It's not a bad idea, but instead of dealing with two monsters at once, blast and perl, I think it would be advisable to just work out the blast steps since that is the goal. It looks like the Perl aspect of this task is really a stumbling block and it may be better to remove that from the equation for now. At least, until it's clear the sequences can be processed and blasted correctly. Another possibility on windows would be to use bioperl's run package to handle the blast and use bioperl to parse the results.

ADD REPLYlink written 5.6 years ago by SES8.2k

Thank you all for the suggestions. Frankly, I was lost when I met the problem. What I could do was to blast them one by one manually.

ADD REPLYlink written 5.6 years ago by zhaolin2004200830
4
gravatar for Eric Normandeau
5.6 years ago by
Quebec, Canada
Eric Normandeau10k wrote:

Like Pierre suggests, you certainly can blast all the files separately. Also, your approach seems way too convoluted. Why not simply use the blast program directly?

EDIT: These solutions will work under Linux or MacOSX but not Windows.

If you can unzip your sequence archive and have a folder full of fasta files, the following will blast all the files:

for i in *.fasta; do blastp -db /path/do/db -query $i -out $i.blasted; done

This supposes only one folder with all the files with a .fasta extension.

You can do better with gnu parallel if you want (here is Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them by its creator):

find . -iname "*.fasta" | parallel blastp -db /path/do/db -query {} -out {.}.blasted

This still supposes you have .fasta extensions (which you can change in the command) but will blast all sequence files in your directory, even if they are nested in subdirectories. (It may also suppose you don't have spaces in your directory or file names)

I'm sure someone here will know how to do the same without unziping your archive. (Could somebody show us how to do this?)

ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by Eric Normandeau10k

This is the approach I would take although it looks like the user is in a Windows environment. Unless he is using cygwin I'm not sure how command-line blast may differ on a straight windows system. But it could probably be wrapped inside a perl script basically how you did above as well.

I think there is no way around unzipping the archive. Also it would be good to know whether the fasta files contain only a single sequence, or multiple sequences. If they are only a single sequence the OP is better off putting them in to one file and running a parallel BLAST from the getgo and parsing the results out as needed afterwards.

ADD REPLYlink written 5.6 years ago by Dan Gaston7.1k

I have overlooked the OS in the question. My mistake. I don't know what the solution would be in Windows. If it has to be half as complicated as the OP has started to produce, and if he/she is likely to have to play with sequences like that in his/her project, I think it should be enough incentive for that person to start learning and using Linux for his project.

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by Eric Normandeau10k

Yes. my os is windows. I tried the Eric's suggestions. Neither works. I unzipped my sequence archive and have a folder full of fasta files. If I don't use perl and use blastp directly, how blastp -query could recognized hundreds of fasta files in a folder?

By the way, I really think of changing a computer of Linux or Unix system. Any other suggestions?

ADD REPLYlink written 5.6 years ago by zhaolin2004200830

Install and try Ubuntu or try a Mac computer. Ubuntu is one of the easiest Linux distributions to start with. It is not a limited Linux in any way, only simple to use. This should give you much more power to do what you are trying to do. It will require some learning though, but in the medium and long term it should improve your capacities greatly.

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by Eric Normandeau10k

Aside from Eric's suggestion, another option is to run Ubuntu within a VM. This depends on how many resources you have on your system; if it's fairly decent then set up a VM (maybe using virtualbox).

You can also set up and run Ubuntu very quickly on Amazon EC2 (AWS). I thought someone out there had an AMI with everything set up on it, such as CloudBioLinux.

ADD REPLYlink written 5.6 years ago by Chris Fields2.1k

I don't see how a VM is 'aside from my suggestion', still counts as an install to me :) From my perspective of getting into AWS in the last year and not finding it transparent or straight forward, I suspect suggesting to someone with no programming or 'NIX experience to set up and run Ubuntu on EC2, could potentially sound scary.

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by Eric Normandeau10k

Eric, apologies, but your suggestion re: Ubuntu above makes no mention of using a VM, and so at least to me this indicated a complete reinstallation or at the very least a dual-boot.

So, from my vantage point I'm not sure why the suggestion of jumping right in and installing Ubuntu over one's current OS wouldn't sound scary to someone who likely has little UNIX/scripting experience, yet using a VM or suggesting AWS (neither of which has a downside beyond some possible cost for AWS) is? Maybe I'm reading too much into this?

Just so it's clear, I'm really not disagreeing with you, I just think the options need to be clarified. I completely agree, they should try Ubuntu (or Fedora or <insert linux="" flavor="" here="">) out. However, based on all the modern options available, I don't think there is an immediate need to jump into the deep end of the pool until they test the waters a bit and feel comfortable enough to manage a full linux install on their own.

ADD REPLYlink written 5.6 years ago by Chris Fields2.1k

Hi Chris, no apologies are needed at all. I was just feeling mischievous and wanted to bend reality a bit by including installing Ununtu inside a VM in the bigger category of installing Ubuntu, just to pretend I had already made the suggestion. :) I agree that jumping into the Linux pool can be scary. The easiest thing to do I think is to use a bootable CD or even better USB stick. Even creating these booting media can be scary for some though. I do not advise replacing one's OS with Linux as a first move either, so it looks like we agree 100% here!

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by Eric Normandeau10k

blastp will only handle one file at a time. However, my earlier comments asked about why the files need to be separate fasta files in the first place? Are there more than one sequence per fasta file? Why are they divided into multiple files in the first place if you need to run blast on all of the sequences against the same database? Sometimes there are very good reasons for this of course, but they might not need to be split up in to hundreds of files.

ADD REPLYlink written 5.6 years ago by Dan Gaston7.1k
0
gravatar for Dan Gaston
5.6 years ago by
Dan Gaston7.1k
Canada
Dan Gaston7.1k wrote:

Out of curiosity are there multiple sequences per fasta file or just one? If there is no compelling reason for having them all as individual files you can put them into one large file and let blastp handle the parallelism itself and just run the blast. If you are using BioPerl you can use that to very easily parse your results on a query by query basis however you like.

ADD COMMENTlink written 5.6 years ago by Dan Gaston7.1k

I am new to bioinformatics. How to use bioperl to solve the problem? Thank you.

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by zhaolin2004200830
2

If you are new to bioinformatics but have no scripting experience (e.g. perl, python, or ruby) or UNIX background then the various bio* won't help (I speak as one of the core developers of bioperl). If you are doing anything high-throughput and plan on doing this on your own, you need to look into taking some short course on the above. Might I recommend Software Carpentry?

ADD REPLYlink written 5.6 years ago by Chris Fields2.1k

I agree. Get your feet wet with tutorials on generic coding with Software Carpentry or Code Academy if you lack experience. Then perhaps try the BioPerl tutorials for processing different types of data. Parsing BLAST output is one of the beginning tutorials if I recall correctly.

ADD REPLYlink written 5.6 years ago by Dan Gaston7.1k

Thank you. I 'll try.

ADD REPLYlink written 5.6 years ago by zhaolin2004200830
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 921 users visited in the last hour