I am running the pgenthreader.sh script and runpsipred script and it took approx 3 to 4 hours to run the runpsipred script and approx 8 hours to run the pgenthreader.sh script for a single protein fasta file.
Is this the usual time these scripts take to predict the fold and secondary structure of the single protein or I am doing any mistake while running the program. But after running both the scripts for so long, I am getting the accurate result file.
If it takes that much of time, Can anyone help me if they know ..how to minimize the time duration because I need to run 7000 proteins.
The runpsipred script is
#!/bin/tcsh
# This is a simple script which will carry out all of the basic steps
# required to make a PSIPRED prediction. Note that it assumes that the
# following programs are in the appropriate directories:
# blastpgp - PSIBLAST executable (from NCBI toolkit)
# makemat - IMPALA utility (from NCBI toolkit)
# psipred - PSIPRED V4 program
# psipass2 - PSIPRED V4 program
# NOTE: Script modified to be more cluster friendly (DTJ April 2008)
# The name of the BLAST data bank
set dbname = /media/kakarot/ppi/genthreader/uniref_test_db/uniref100.fasta
# Where the NCBI programs have been installed
# NOTE: ensure you omit any trailing / from this setting or some terminals
# may seg fault
set ncbidir = /media/kakarot/ppi/ncbi/blast-2.2.26/bin
# Where the PSIPRED V4 programs have been installed
set execdir = /media/kakarot/ppi/psipred/bin
# Where the PSIPRED V4 data files have been installed
set datadir = /media/kakarot/ppi/psipred/data
set basename = $1:r
set rootname = $basename:t
# Generate a "unique" temporary filename root
set hostid = `hostid`
set tmproot = psitmp$$$hostid
\cp -f $1 $tmproot.fasta
echo "Running PSI-BLAST with sequence" $1 "..."
$ncbidir/blastpgp -b 0 -j 3 -h 0.001 -v 5000 -d $dbname -i $tmproot.fasta -C $tmproot.chk >& $tmproot.blast
if ($status != 0) then
tail $tmproot.blast
echo "FATAL: Error whilst running blastpgp - script terminated!"
exit $status
endif
echo "Predicting secondary structure..."
echo $tmproot.chk > $tmproot.pn
echo $tmproot.fasta > $tmproot.sn
$ncbidir/makemat -P $tmproot
if ($status != 0) then
echo "FATAL: Error whilst running makemat - script terminated!"
exit $status
endif
echo Pass1 ...
$execdir/psipred $tmproot.mtx $datadir/weights.dat $datadir/weights.dat2 $datadir/weights.dat3 > $rootname.ss
if ($status != 0) then
echo "FATAL: Error whilst running psipred - script terminated!"
exit $status
endif
echo Pass2 ...
$execdir/psipass2 $datadir/weights_p2.dat 1 1.0 1.0 $rootname.ss2 $rootname.ss > $rootname.horiz
if ($status != 0) then
echo "FATAL: Error whilst running psipass2 - script terminated!"
exit $status
endif
# Remove temporary files
echo Cleaning up ...
#\rm -f $tmproot.* error.log
echo "Final output files:" $rootname.ss2 $rootname.horiz
echo "Finished."
Results of runpsipred is in the file seq.horiz is:
# PSIPRED HFORMAT (PSIPRED V4.0)
Conf: 998887899999999999999982144201402887799986530177874598887799
Pred: CCCCCHHHHHHHHHHHHHHHHHHCCCCCCHHHCCCCCCCCCCCCCCHHHHCCCCCCCCCC
AA: MQLRNPELHLGCALALRFLALVSWDIPGARALDNGLARTPTMGWLHWERFMCNLDCQEEP
10 20 30 40 50 60
Conf: 757899999999999999029991964899737556986789999658900089517999
Pred: CCCCCHHHHHHHHHHHHHHCHHHHCCCEEEECCCCCCCCCCCCCCCCCCCCCCCCCHHHH
AA: DSCISEKLFMEMAELMVSEGWKDAGYEYLCIDDCWMAPQRDSEGRLQADPQRFPHGIRQL
70 80 90 100 110 120
Conf: 999998698036463445578889997379289999999997999999769999961668
Pred: HHHHHHCCCCEEEEECCCCCCCCCCCCCCCHHHHHHHHHHHHCCCEEEECCCCCCCHHHH
AA: ANYVHSKGLKLGIYADVGNKTCAGFPGSFGYYDIDAQTFADWGVDLLKFDGCYCDSLENL
130 140 150 160 170 180
Conf: 889999999999859996005884011499689964578775350007777567566767
Pred: HHHHHHHHHHHHHHCCCCCCCCCCCHHCCCCCCCCCHHHHHHCCCCCCCCCCCCCCCCHH
AA: ADGYKHMSLALNRTGRSIVYSCEWPLYMWPFQKPNYTEIRQYCNHWRNFADIDDSWKSIK
190 200 210 220 230 240
Conf: 750353503576899869998889215340799999999999999999997176750796
Pred: HHHCHHHCCHHHHHHHCCCCCCCCCCCEECCCCCCCHHHHHHHHHHHHHHHHHHHHHCCH
AA: SILDWTSFNQERIVDVAGPGGWNDPDMLVIGNFGLSWNQQVTQMALWAIMAAPLFMSNDL
250 260 270 280 290 300
Conf: 669999999976997884147988776689770797799999879997899999899998
Pred: HHCCHHHHHHHCCHHHHEECCCCCCCCCEEEEECCCEEEEEEECCCCCEEEEEEECCCCC
AA: RHISPQAKALLQDKDVIAINQDPLGKQGYQLRQGDNFEVWERPLSGLAWAVAMINRQEIG
310 320 330 340 350 360
Conf: 118999999992987668884499998868953564642403899979996899999968
Pred: CCEEEEEEHHHHCCCCCCCCCEEEEEECCCCCEEEEEECCCEEEEEECCCCEEEEEEEEC
AA: GPRSYTIAVASLGKGVACNPACFITQLLPVKRKLGFYEWTSRLRSHINPTGTVLLQLENT
370 380 390 400 410 420
Conf: 866525629
Pred: CCCCHHHHC
AA: MQMSLKDLL
The pgenthreader.sh script is
#!/bin/bash
#------------------------------------------------------------------------------
# pGenTHREADER
#
# simple shell script for running pGenTHREADER code
#
# Jan 2009 alobley@cs.ucl.ac.uk
#------------------------------------------------------------------------------
#JOB name
JOB=sequence
#input file
FSA=/media/kakarot/ppi/hprd_fasta/seq.fasta
#path to psiblast
PSIB=/media/kakarot/ppi/ncbi/blast-2.2.26/bin
#path to database
DB=/media/kakarot/ppi/genthreader/uniref_test_db/uniref100.fasta
#path to PSIPRED data files
PDATA=/media/kakarot/ppi/psipred/data
#path to pGenTHREADER data files
DATA=/media/kakarot/ppi/genthreader/data
#path to PGenThreader binary directory
PGT=/media/kakarot/ppi/genthreader/bin
#path to PSIPRED binary directory
PSIP=/media/kakarot/ppi/psipred/bin
#path to fold library
TDB=/media/kakarot/ppi/genthreader/tdb/cath_domain_tdb
#TDB=/scratch0/NOT_BACKED_UP/dbuchan/tdb
#make a masked copy of input file
#$PSIP/pfilt -b $FSA > $JOB.fsa
cp $FSA $JOB.fsa
export TDB_DIR=$TDB
export THREAD_DIR=/media/kakarot/ppi/genthreader/data
echo started `date` $HOST > $JOB.pgt.log
#Run 3 iterations of PSI-BLAST
$PSIB/blastpgp -a 4 -F T -t 1 -j 3 -v 5000 -b 0 -h 0.001 -i $JOB.fsa -d $DB -C $JOB.chk -F T > /dev/null
echo "Finished 3 iterations PSI-BLAST"
echo $JOB.fsa > $JOB.sn
echo $JOB.chk > $JOB.pn
#make a profile matrix
$PSIB/makemat -P $JOB
#rename checkpoint file
mv $JOB.chk $JOB.iter3.chk
mv $JOB.mtx $JOB.iter3.mtx
# Run PSI-PRED for PGT
# $PSIP/psipred $JOB.iter3.mtx $PDATA/weights.dat $PDATA/weights.dat2 $PDATA/weights.dat3 > $JOB.pgen.ss
# #$PSIP/psipred $JOB.iter3.mtx $PDATA/weights.dat $PDATA/weights.dat2 $PDATA/weights.dat3 $PDATA/weights.dat4 > $JOB.pgen.ss
# $PSIP/psipass2 $PDATA/weights_p2.dat 1 1.0 1.0 $JOB.pgen.ss2 $JOB.pgen.ss > $JOB.horiz
if [ ! -s "$JOB.pgen.ss2" ]
then
echo "PSIPRED failed ... exiting early"
echo "PSIPRED failed ... exiting early" >> $JOB.pgt.log
exit;
fi
echo "Finished PSI-PRED"
#Run further 3 iterations of psi-blast
$PSIB/blastpgp -a 4 -F T -t 1 -i $JOB.fsa -R $JOB.iter3.chk -d $DB -j 3 -v 5000 -b 0 -h 0.001 -C $JOB.chk > /dev/null
echo "Finished 6 iterations of PSI BLAST"
#make PSSM from 6th iteration
echo $JOB.fsa > $JOB.sn
echo $JOB.chk > $JOB.pn
#make a profile matrix
$PSIB/makemat -P $JOB
#rename checkpoint file
mv $JOB.chk $JOB.iter6.chk
mv $JOB.mtx $JOB.iter6.mtx
# Run PGenThreader process
$PGT/pseudo_bas -c11.0 -C20 -h0.2 -F$JOB.pgen.ss2 $JOB.iter6.mtx $JOB.pgen.pseudo $DATA/psichain.lst
if [ ! -s "$JOB.pgen.pseudo" ]
then
echo "pseudo_bas failed"
echo "pseudo_bas failed" >> $JOB.pgt.log
exit;
fi
$PGT/svm_prob $JOB.pgen.pseudo | sort -k 2,2rn -k 6,6rn -k 5,5g > $JOB.pgen.presults
if [ ! -s "$JOB.pgen.presults" ]
then
echo "sortprob failed"
echo "sortprob failed" >> $JOB.pgt.log
exit;
fi
$PGT/pseudo_bas -S -p -c11.0 -C20 -h0.2 -F$JOB.pgen.ss2 $JOB.iter6.mtx $JOB.pgen.pseudo $JOB.pgen.presults > $JOB.pgen.align
echo Finished `date` >> $JOB.pgt.log
Result of pgenthreader.sh is in sequence.pgen.presults is:
CERT 236.652 6e-23 -533.3 -16.9 1383.0 395 395 429 3hg3A0
CERT 227.154 6e-22 -478.9 -17.7 1330.5 385 388 429 1ktbA0
CERT 196.975 7e-19 -443.1 -11.7 1134.0 361 362 429 1uasA0
CERT 189.506 4e-18 -449.7 -21.5 1078.0 369 397 429 3a5vA0
CERT 167.082 7e-16 -404.7 -5.4 934.0 374 452 429 3lrkA0
CERT 160.299 3e-15 -440.5 -9.2 879.0 368 417 429 1sznA0
CERT 145.860 9e-14 -426.8 -12.3 781.0 374 614 429 3a21A0
CERT 127.787 6e-12 -364.7 -12.6 671.0 363 433 429 3cc1A0
HIGH 56.007 1e-04 -229.9 -1.6 209.0 334 742 429 3mi6A0
LOW 30.643 0.041 -205.8 -7.9 38.0 332 585 429 1wzlA0
LOW 30.609 0.041 -115.6 -9.7 61.0 314 441 429 2zxdA0
LOW 30.497 0.042 -188.4 -1.6 45.0 314 597 429 3edfA0
LOW 29.872 0.049 -165.1 0.9 46.0 316 522 429 3ii1A0
LOW 29.829 0.049 -228.9 -3.9 31.0 311 684 429 1cgtA0
LOW 29.058 0.059 -191.4 -2.7 37.0 293 469 429 2wc7A0
LOW 28.982 0.060 -147.7 -2.9 43.0 318 449 429 2osxA0
LOW 28.743 0.064 -238.7 -5.1 24.0 295 476 429 2aaaA0
LOW 28.510 0.067 -129.3 -4.0 48.0 289 588 429 1j0hA0
LOW 27.383 0.087 -147.1 5.0 33.0 319 722 429 3k1dA0
GUESS 24.437 0.173 -202.1 -5.6 27.0 142 230 429 1wv2A0
GUESS 24.437 0.173 -202.1 -5.6 27.0 142 230 429 1wv2A0
GUESS 24.314 0.178 -77.8 -1.3 60.0 106 187 429 1j3qA0
GUESS 24.023 0.191 -162.3 -3.3 32.0 151 208 429 3dciA0
GUESS 23.130 0.234 -111.4 -1.3 45.0 101 145 429 3dobA0
GUESS 23.017 0.241 -23.4 1.1 28.0 331 583 429 1ea9C0
GUESS 22.732 0.257 -112.1 1.9 43.0 98 148 429 3dqgA0
GUESS 22.693 0.260 -10.9 -1.8 66.0 95 487 429 1bpoA0
GUESS 22.665 0.261 -79.4 0.8 50.0 100 320 429 2ichA0
GUESS 22.473 0.273 -41.6 2.2 58.0 93 151 429 1ufgA0
GUESS 22.370 0.280 -129.1 -0.9 37.0 95 120 429 1wj5A0
GUESS 22.365 0.280 -159.3 -4.8 20.0 153 226 429 2tpsA0
Sounds like its taking too long to me. What size databases and query sequences are you using? If its got to search a large database, or the protein itself is big, it could add to the time taken.
What hardware do you have? I don't know which, if any, of those programs is using inherent multithreading, but you could cut down your total time for all 7000 by using
GNU parallel
to parallelise the workflow.My query is :
and I am using
Uniref100
database of size104.9 GB
. But Blastpgp doesn't take that much of time in the script. I am usingLinux (Ubuntu)
with OS type 64 bit, having processor Intel® Xeon(R) Gold 5118 CPU @ 2.30GHz × 24 (disk space 398.3 GB with external hard disk of 2TB).Can you please explain me more about
GNU pannel
?Do you know the exact parts of the script, and the sub programs, where most time is being spent? You need to narrow down the specific step/task so that we can understand what is the limiting factor.
Your hardware seems reasonable, so its unlikely to be a bottleneck there. You say an external hard disk though - is this where you're storing your database? Is the external disk a spinning-disk drive or solid state, and what USB interface is it using (USB A Gen3, USB C etc).
GNU parallel
is a common Linux tool which can run multiple files in batches continuously. If you have 32 cores say, parallel will intelligently maximise the number of jobs that can be run on those cores, so in the time it takes you to do 1, you could do as many as 32 or even 64. If your run times are still ~10 hours per sequence though, this won't get you far with 7000.Sorry, the hard drive is internally mounted. I am trying to use GNU parallel. Thank you so much for your help.
I also tried to install
GPU-BLAST
(to speed up the BLASTp), but on running it'sinstall
script assh install
The error is:The problem is in the function (written in 6th line).. The
install
script which I have downloaded with GPU-BLAST is :I have downloaded the software from
http://archimedes.cheme.cmu.edu/?q=gpublast
and refer the paper :https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3018811/
. I am just stuck what to do. Note: I have cut some part of code because maximum 5000 words I can post. If you need to check the whole code, check by installing the software by above link. It's just of 1.4 MB.I'm not familiar with
GPU-BLAST
myself, but I would guess its because you're trying to run abash
script withsh
.Try
bash install.sh
.Bash supports some syntax that
sh
doesn't so it might be complaining about that.