Question: time taken by pgenthreader.sh script
0
gravatar for shalinikaushik1293
7 months ago by
shalinikaushik12930 wrote:

I am running the pgenthreader.sh script and runpsipred script and it took approx 3 to 4 hours to run the runpsipred script and approx 8 hours to run the pgenthreader.sh script for a single protein fasta file.

Is this the usual time these scripts take to predict the fold and secondary structure of the single protein or I am doing any mistake while running the program. But after running both the scripts for so long, I am getting the accurate result file.

If it takes that much of time, Can anyone help me if they know ..how to minimize the time duration because I need to run 7000 proteins.

The runpsipred script is

#!/bin/tcsh

# This is a simple script which will carry out all of the basic steps
# required to make a PSIPRED prediction. Note that it assumes that the
# following programs are in the appropriate directories:
# blastpgp - PSIBLAST executable (from NCBI toolkit)
# makemat - IMPALA utility (from NCBI toolkit)
# psipred - PSIPRED V4 program
# psipass2 - PSIPRED V4 program

# NOTE: Script modified to be more cluster friendly (DTJ April 2008)

# The name of the BLAST data bank
set dbname = /media/kakarot/ppi/genthreader/uniref_test_db/uniref100.fasta

# Where the NCBI programs have been installed
# NOTE: ensure you omit any trailing / from this setting or some terminals
#       may seg fault
set ncbidir = /media/kakarot/ppi/ncbi/blast-2.2.26/bin

# Where the PSIPRED V4 programs have been installed
set execdir = /media/kakarot/ppi/psipred/bin

# Where the PSIPRED V4 data files have been installed
set datadir = /media/kakarot/ppi/psipred/data

set basename = $1:r
set rootname = $basename:t

# Generate a "unique" temporary filename root
set hostid = `hostid`
set tmproot = psitmp$$$hostid

\cp -f $1 $tmproot.fasta

echo "Running PSI-BLAST with sequence" $1 "..."

$ncbidir/blastpgp -b 0 -j 3 -h 0.001 -v 5000 -d $dbname -i $tmproot.fasta -C $tmproot.chk >& $tmproot.blast

if ($status != 0) then
    tail $tmproot.blast
    echo "FATAL: Error whilst running blastpgp - script terminated!"
    exit $status
endif

echo "Predicting secondary structure..."

echo $tmproot.chk > $tmproot.pn
echo $tmproot.fasta > $tmproot.sn

$ncbidir/makemat -P $tmproot

if ($status != 0) then
    echo "FATAL: Error whilst running makemat - script terminated!"
    exit $status
endif

echo Pass1 ...

$execdir/psipred $tmproot.mtx $datadir/weights.dat $datadir/weights.dat2 $datadir/weights.dat3 > $rootname.ss

if ($status != 0) then
    echo "FATAL: Error whilst running psipred - script terminated!"
    exit $status
endif

echo Pass2 ...

$execdir/psipass2 $datadir/weights_p2.dat 1 1.0 1.0 $rootname.ss2 $rootname.ss > $rootname.horiz

if ($status != 0) then
    echo "FATAL: Error whilst running psipass2 - script terminated!"
    exit $status
endif

# Remove temporary files

echo Cleaning up ...
#\rm -f $tmproot.* error.log

echo "Final output files:" $rootname.ss2 $rootname.horiz
echo "Finished."

Results of runpsipred is in the file seq.horiz is:

# PSIPRED HFORMAT (PSIPRED V4.0)

Conf: 998887899999999999999982144201402887799986530177874598887799
Pred: CCCCCHHHHHHHHHHHHHHHHHHCCCCCCHHHCCCCCCCCCCCCCCHHHHCCCCCCCCCC
  AA: MQLRNPELHLGCALALRFLALVSWDIPGARALDNGLARTPTMGWLHWERFMCNLDCQEEP
              10        20        30        40        50        60


Conf: 757899999999999999029991964899737556986789999658900089517999
Pred: CCCCCHHHHHHHHHHHHHHCHHHHCCCEEEECCCCCCCCCCCCCCCCCCCCCCCCCHHHH
  AA: DSCISEKLFMEMAELMVSEGWKDAGYEYLCIDDCWMAPQRDSEGRLQADPQRFPHGIRQL
              70        80        90       100       110       120


Conf: 999998698036463445578889997379289999999997999999769999961668
Pred: HHHHHHCCCCEEEEECCCCCCCCCCCCCCCHHHHHHHHHHHHCCCEEEECCCCCCCHHHH
  AA: ANYVHSKGLKLGIYADVGNKTCAGFPGSFGYYDIDAQTFADWGVDLLKFDGCYCDSLENL
             130       140       150       160       170       180


Conf: 889999999999859996005884011499689964578775350007777567566767
Pred: HHHHHHHHHHHHHHCCCCCCCCCCCHHCCCCCCCCCHHHHHHCCCCCCCCCCCCCCCCHH
  AA: ADGYKHMSLALNRTGRSIVYSCEWPLYMWPFQKPNYTEIRQYCNHWRNFADIDDSWKSIK
             190       200       210       220       230       240


Conf: 750353503576899869998889215340799999999999999999997176750796
Pred: HHHCHHHCCHHHHHHHCCCCCCCCCCCEECCCCCCCHHHHHHHHHHHHHHHHHHHHHCCH
  AA: SILDWTSFNQERIVDVAGPGGWNDPDMLVIGNFGLSWNQQVTQMALWAIMAAPLFMSNDL
             250       260       270       280       290       300


Conf: 669999999976997884147988776689770797799999879997899999899998
Pred: HHCCHHHHHHHCCHHHHEECCCCCCCCCEEEEECCCEEEEEEECCCCCEEEEEEECCCCC
  AA: RHISPQAKALLQDKDVIAINQDPLGKQGYQLRQGDNFEVWERPLSGLAWAVAMINRQEIG
             310       320       330       340       350       360


Conf: 118999999992987668884499998868953564642403899979996899999968
Pred: CCEEEEEEHHHHCCCCCCCCCEEEEEECCCCCEEEEEECCCEEEEEECCCCEEEEEEEEC
  AA: GPRSYTIAVASLGKGVACNPACFITQLLPVKRKLGFYEWTSRLRSHINPTGTVLLQLENT
             370       380       390       400       410       420


Conf: 866525629
Pred: CCCCHHHHC
  AA: MQMSLKDLL

The pgenthreader.sh script is

#!/bin/bash

#------------------------------------------------------------------------------
# pGenTHREADER
#
# simple shell script for running pGenTHREADER code
#
# Jan 2009 alobley@cs.ucl.ac.uk
#------------------------------------------------------------------------------

#JOB name 
JOB=sequence
#input file
FSA=/media/kakarot/ppi/hprd_fasta/seq.fasta
#path to psiblast
PSIB=/media/kakarot/ppi/ncbi/blast-2.2.26/bin
#path to database
DB=/media/kakarot/ppi/genthreader/uniref_test_db/uniref100.fasta
#path to PSIPRED data files
PDATA=/media/kakarot/ppi/psipred/data 
#path to pGenTHREADER data files
DATA=/media/kakarot/ppi/genthreader/data
#path to PGenThreader binary directory
PGT=/media/kakarot/ppi/genthreader/bin
#path to PSIPRED binary directory
PSIP=/media/kakarot/ppi/psipred/bin
#path to fold library
TDB=/media/kakarot/ppi/genthreader/tdb/cath_domain_tdb
#TDB=/scratch0/NOT_BACKED_UP/dbuchan/tdb


#make a masked copy of input file
#$PSIP/pfilt -b $FSA > $JOB.fsa
cp $FSA $JOB.fsa
export TDB_DIR=$TDB
export THREAD_DIR=/media/kakarot/ppi/genthreader/data


echo started `date` $HOST > $JOB.pgt.log

#Run 3 iterations of PSI-BLAST


$PSIB/blastpgp -a 4 -F T -t 1 -j 3 -v 5000 -b 0 -h 0.001 -i $JOB.fsa -d $DB -C $JOB.chk -F T > /dev/null


 echo "Finished 3 iterations PSI-BLAST"
 echo $JOB.fsa > $JOB.sn
 echo $JOB.chk > $JOB.pn

#make a profile matrix
 $PSIB/makemat -P $JOB

#rename checkpoint file
  mv $JOB.chk $JOB.iter3.chk
  mv $JOB.mtx $JOB.iter3.mtx

# Run PSI-PRED for PGT
# $PSIP/psipred $JOB.iter3.mtx $PDATA/weights.dat $PDATA/weights.dat2 $PDATA/weights.dat3 > $JOB.pgen.ss
# #$PSIP/psipred $JOB.iter3.mtx $PDATA/weights.dat $PDATA/weights.dat2 $PDATA/weights.dat3 $PDATA/weights.dat4 > $JOB.pgen.ss

# $PSIP/psipass2 $PDATA/weights_p2.dat 1 1.0 1.0 $JOB.pgen.ss2 $JOB.pgen.ss > $JOB.horiz


if [ ! -s "$JOB.pgen.ss2" ] 
then
   echo "PSIPRED failed ... exiting early"
   echo "PSIPRED failed ... exiting early" >> $JOB.pgt.log
   exit;
fi

echo "Finished PSI-PRED"

#Run further 3 iterations of psi-blast

$PSIB/blastpgp -a 4 -F T -t 1 -i $JOB.fsa -R $JOB.iter3.chk -d $DB -j 3 -v 5000 -b 0 -h 0.001 -C $JOB.chk > /dev/null

echo "Finished 6 iterations of PSI BLAST"

#make PSSM from 6th iteration

 echo $JOB.fsa > $JOB.sn
 echo $JOB.chk > $JOB.pn

#make a profile matrix
 $PSIB/makemat -P $JOB

#rename checkpoint file
  mv $JOB.chk $JOB.iter6.chk
  mv $JOB.mtx $JOB.iter6.mtx

# Run PGenThreader process

$PGT/pseudo_bas -c11.0 -C20 -h0.2 -F$JOB.pgen.ss2 $JOB.iter6.mtx $JOB.pgen.pseudo $DATA/psichain.lst

if [ ! -s "$JOB.pgen.pseudo" ]
then
    echo "pseudo_bas failed" 
    echo "pseudo_bas failed" >> $JOB.pgt.log
    exit;
fi

$PGT/svm_prob $JOB.pgen.pseudo | sort -k 2,2rn -k 6,6rn -k 5,5g  > $JOB.pgen.presults

if [ ! -s "$JOB.pgen.presults" ]
then
    echo "sortprob failed"
    echo "sortprob failed" >> $JOB.pgt.log
    exit;
fi

$PGT/pseudo_bas -S -p -c11.0 -C20 -h0.2 -F$JOB.pgen.ss2 $JOB.iter6.mtx $JOB.pgen.pseudo $JOB.pgen.presults > $JOB.pgen.align

echo Finished `date` >> $JOB.pgt.log

Result of pgenthreader.sh is in sequence.pgen.presults is:

CERT   236.652  6e-23  -533.3 -16.9 1383.0  395  395  429 3hg3A0
CERT   227.154  6e-22  -478.9 -17.7 1330.5  385  388  429 1ktbA0
CERT   196.975  7e-19  -443.1 -11.7 1134.0  361  362  429 1uasA0
CERT   189.506  4e-18  -449.7 -21.5 1078.0  369  397  429 3a5vA0
CERT   167.082  7e-16  -404.7  -5.4  934.0  374  452  429 3lrkA0
CERT   160.299  3e-15  -440.5  -9.2  879.0  368  417  429 1sznA0
CERT   145.860  9e-14  -426.8 -12.3  781.0  374  614  429 3a21A0
CERT   127.787  6e-12  -364.7 -12.6  671.0  363  433  429 3cc1A0
HIGH    56.007  1e-04  -229.9  -1.6  209.0  334  742  429 3mi6A0
LOW     30.643  0.041  -205.8  -7.9   38.0  332  585  429 1wzlA0
LOW     30.609  0.041  -115.6  -9.7   61.0  314  441  429 2zxdA0
LOW     30.497  0.042  -188.4  -1.6   45.0  314  597  429 3edfA0
LOW     29.872  0.049  -165.1   0.9   46.0  316  522  429 3ii1A0
LOW     29.829  0.049  -228.9  -3.9   31.0  311  684  429 1cgtA0
LOW     29.058  0.059  -191.4  -2.7   37.0  293  469  429 2wc7A0
LOW     28.982  0.060  -147.7  -2.9   43.0  318  449  429 2osxA0
LOW     28.743  0.064  -238.7  -5.1   24.0  295  476  429 2aaaA0
LOW     28.510  0.067  -129.3  -4.0   48.0  289  588  429 1j0hA0
LOW     27.383  0.087  -147.1   5.0   33.0  319  722  429 3k1dA0
GUESS   24.437  0.173  -202.1  -5.6   27.0  142  230  429 1wv2A0
GUESS   24.437  0.173  -202.1  -5.6   27.0  142  230  429 1wv2A0
GUESS   24.314  0.178   -77.8  -1.3   60.0  106  187  429 1j3qA0
GUESS   24.023  0.191  -162.3  -3.3   32.0  151  208  429 3dciA0
GUESS   23.130  0.234  -111.4  -1.3   45.0  101  145  429 3dobA0
GUESS   23.017  0.241   -23.4   1.1   28.0  331  583  429 1ea9C0
GUESS   22.732  0.257  -112.1   1.9   43.0   98  148  429 3dqgA0
GUESS   22.693  0.260   -10.9  -1.8   66.0   95  487  429 1bpoA0
GUESS   22.665  0.261   -79.4   0.8   50.0  100  320  429 2ichA0
GUESS   22.473  0.273   -41.6   2.2   58.0   93  151  429 1ufgA0
GUESS   22.370  0.280  -129.1  -0.9   37.0   95  120  429 1wj5A0
GUESS   22.365  0.280  -159.3  -4.8   20.0  153  226  429 2tpsA0
blastpgp psipred pgenthreader • 207 views
ADD COMMENTlink modified 7 months ago • written 7 months ago by shalinikaushik12930

Sounds like its taking too long to me. What size databases and query sequences are you using? If its got to search a large database, or the protein itself is big, it could add to the time taken.

What hardware do you have? I don't know which, if any, of those programs is using inherent multithreading, but you could cut down your total time for all 7000 by using GNU parallel to parallelise the workflow.

ADD REPLYlink written 7 months ago by Joe17k

My query is :

>NP_000160.1 alpha-galactosidase A precursor [Homo sapiens]
MQLRNPELHLGCALALRFLALVSWDIPGARALDNGLARTPTMGWLHWERFMCNLDCQEEPDSCISEKLFM
EMAELMVSEGWKDAGYEYLCIDDCWMAPQRDSEGRLQADPQRFPHGIRQLANYVHSKGLKLGIYADVGNK
TCAGFPGSFGYYDIDAQTFADWGVDLLKFDGCYCDSLENLADGYKHMSLALNRTGRSIVYSCEWPLYMWP
FQKPNYTEIRQYCNHWRNFADIDDSWKSIKSILDWTSFNQERIVDVAGPGGWNDPDMLVIGNFGLSWNQQ
VTQMALWAIMAAPLFMSNDLRHISPQAKALLQDKDVIAINQDPLGKQGYQLRQGDNFEVWERPLSGLAWA
VAMINRQEIGGPRSYTIAVASLGKGVACNPACFITQLLPVKRKLGFYEWTSRLRSHINPTGTVLLQLENT
MQMSLKDLL

and I am using Uniref100 database of size 104.9 GB. But Blastpgp doesn't take that much of time in the script. I am using Linux (Ubuntu) with OS type 64 bit, having processor Intel® Xeon(R) Gold 5118 CPU @ 2.30GHz × 24 (disk space 398.3 GB with external hard disk of 2TB).

Can you please explain me more about GNU pannel?

ADD REPLYlink written 7 months ago by shalinikaushik12930

Do you know the exact parts of the script, and the sub programs, where most time is being spent? You need to narrow down the specific step/task so that we can understand what is the limiting factor.

Your hardware seems reasonable, so its unlikely to be a bottleneck there. You say an external hard disk though - is this where you're storing your database? Is the external disk a spinning-disk drive or solid state, and what USB interface is it using (USB A Gen3, USB C etc).

GNU parallel is a common Linux tool which can run multiple files in batches continuously. If you have 32 cores say, parallel will intelligently maximise the number of jobs that can be run on those cores, so in the time it takes you to do 1, you could do as many as 32 or even 64. If your run times are still ~10 hours per sequence though, this won't get you far with 7000.

ADD REPLYlink written 7 months ago by Joe17k

Sorry, the hard drive is internally mounted. I am trying to use GNU parallel. Thank you so much for your help.

I also tried to install GPU-BLAST (to speed up the BLASTp), but on running it's install script as sh install The error is:

install: 6: install: function: not found
-ne .
-ne .

The problem is in the function (written in 6th line).. The install script which I have downloaded with GPU-BLAST is :

#!/bin/bash 

#
# Progress Indicator
#
function progress_ind {
  # Sleep at least 1 second otherwise this algorithm will consume
  # a lot system resources.
  interval=1

  while true
  do
    echo -ne "."
    sleep $interval
  done
}

#
# Stop distraction
#
function stop_progress_ind {
  exec 2>/dev/null
  kill $1
  echo -en "\n"
}

ncbi_blast_version="ncbi-blast-2.2.28+-src"
working_dir=`pwd`
nvcc=`which nvcc`

if [ $? -ne 0 ]; then
    echo -e "\nError: nvcc cannot be found. Exiting..."
    exit 1
fi

echo "Do you want to install GPU-BLAST on an existing installation of \"blastp\" [yes/no]"
echo "yes: you will be asked for the installation directory of the \"blastp\" executable"
echo "no: will download and install \"${ncbi_blast_version}\""
read user_input

if [ $user_input == "yes" ]; then
    echo "Please input the installation directory of \"blastp\" of \"${ncbi_blast_version}\""
    read installed_dir
    #check if the executable "blastp" exists in the given directory
    if [ ! -x $installed_dir/blastp ]; then
    echo "Executable \"blastp\" cannot be found"
    echo "Exiting..."
    exit 0
    else
    blastp_version=`$installed_dir/blastp -version | grep blastp | awk '{print $NF}'`
    if [ $blastp_version == "2.2.28+" ]; then 
        echo "\"blastp\" version $blastp_version is compatible"
    else
        echo "\"blastp\" version $blastp_version is not supported"
        echo "Exiting..."
        exit 0
    fi
    fi
    #check if the source file "blast_engine.c" as a first check that the source code exists
    if [ ! -f "$installed_dir/../../src/algo/blast/core/blast_engine.c" ]; then
    echo "Source code not detected."
    echo "You must have the source code of \"$ncbi_blast_version\" to add GPU-BLAST"
    echo "Exiting..."
    exit 0
    else
    echo "Continuing with the installation of GPU-BLAST..." 
    download=0
    fi
elif [ $user_input == "no" ]; then
    echo "Continuing with the downloading of $ncbi_blast_version..."
    download=1
else
    echo "Please enter \"yes\" or \"no\""
    echo "Exiting.."
    exit 0
fi

if [ $download == 1 ]; then


#Download the version 2.2.28 of NCBI BLAST
    echo "Downloading NBCI BLAST"
    wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.2.28/${ncbi_blast_version}.tar.gz
    if [ $? -ne 0 ]; then
    echo -e "\nError downloading ${ncbi_blast_version}. Verify that the file exists in the server. Exiting..."
    exit 1
    fi
    echo -e "\nExtracting NCBI BLAST"
    progress_ind &
    pid=$!
    trap "stop_progress_ind $pid; exit" INT TERM EXIT

    tar -xzf ${ncbi_blast_version}.tar.gz

    stop_progress_ind $pid

#Move to the c++ directory which contains the configuration script
    cd ${working_dir}/${ncbi_blast_version}/c++/

#Configure NCBI BLAST
    progress_ind &
    pid=$!
    trap "stop_progress_ind $pid; exit" INT TERM EXIT
    configuration_options="./configure --without-debug --with-mt --without-sybase --without-ftds --without-fastcgi --without-ncbi-c --without-sssdb --without-sss --without-geo --without-sp --without-orbacus --without-boost"

I have downloaded the software from http://archimedes.cheme.cmu.edu/?q=gpublast and refer the paper : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3018811/. I am just stuck what to do. Note: I have cut some part of code because maximum 5000 words I can post. If you need to check the whole code, check by installing the software by above link. It's just of 1.4 MB.

ADD REPLYlink written 7 months ago by shalinikaushik12930

I'm not familiar with GPU-BLAST myself, but I would guess its because you're trying to run a bash script with sh.

Try bash install.sh.

Bash supports some syntax that sh doesn't so it might be complaining about that.

ADD REPLYlink written 7 months ago by Joe17k
0
gravatar for shalinikaushik1293
7 months ago by
shalinikaushik12930 wrote:

Is it necessary to use pfilt on uniref100 database before we use BLAST on the query?

The information related to pfilt is there in the given link:

http://bioinfadmin.cs.ucl.ac.uk/downloads/pfilt/

Anyone please suggest.

ADD COMMENTlink written 7 months ago by shalinikaushik12930
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1598 users visited in the last hour