Interproscan get stuck in "90% completed"
0
1
Entering edit mode
7.0 years ago
wangdp123 ▴ 340

Hi,

Whenever using interproscan on about 10000 protein sequences, it will get stuck in "90% completed".

Could you please help me out about this?

Many thanks,

Regards,

Tom

interproscan annotation protein • 2.7k views
ADD COMMENT
0
Entering edit mode

Hi, please be more specific. What does stuck at 90% completed mean? Is there any message, did you check for running processes with top or ps? Which version of Interproscan are you using on which operating system and which infrastructure? Was the disk full? I assume that there were some of the long running tools still running when you interrupted the process. The time to complete 10k sequences can be several days depending on your machine size, and the % indication is not always a good estimate.

ADD REPLY
0
Entering edit mode

Hi,

Please have a look at the following log file when interproscan is running. The version of Interproscan is 5.23-62.0 and the Linux version is CentOS release 6.6 (Final). I used qstat to submit the job shell file and provided 16 cores for this job and I utilized the default interproscan.properties file provided by interproscan package. I am sure the disk is not full and I have tried many times with different numbers of protein sequences as input (5000,10000,30000) and when it reached the 90% complete, the program hung up there and none of new resultant files were generated after that time point. I speculate that it should finish within 5 days using 16 cores for no more than 30000 proteins but it didn't.

It is quite odd since I have checked with the server administrator that it is not a matter of maximum memory usage issue.

Alternatively, when I input 10 protein sequences with the same command line, it is working OK. I am wondering if there are two different mechanisms adopted in interproscan for a small number of proteins and a large number of proteins, which might lead to this issue?

Many thanks,

Tom


Sat Apr 22 04:48:49 BST 2017

22/04/2017 04:48:53:798 Welcome to InterProScan-5.23-62.0

22/04/2017 04:49:05:053 Running InterProScan v5 in STANDALONE mode... on Linux

22/04/2017 04:49:14:115 Loading file pep.fa

22/04/2017 04:49:14:135 Running the following analyses:

[CDD-3.14,Coils-2.2.1,Gene3D-4.1.0,Hamap-201701.18,MobiDBLite-1.0,Pfam-30.0,PIRSF-3.01,PRINTS-42.0,ProDom-2006.1,ProSitePatterns-20.132,ProSiteProfiles-20.132,SFLD-2,SMART-7.1,SUPERFAMILY-1.75,TIGRFAM-15.0]

Available matches will be retrieved from the pre-calculated match lookup service.

Matches for any sequences that are not represented in the lookup service will be calculated locally.

22/04/2017 04:51:43:636 Uploaded/Stored 10799 sequences for analysis

22/04/2017 05:53:07:009 25% completed

22/04/2017 06:32:50:993 50% completed

22/04/2017 06:39:42:676 75% completed

22/04/2017 06:48:50:980 90% completed


ADD REPLY
0
Entering edit mode

Is pep.fa a single multifasta? Is it reaching 90% of a single file, or does it successfully analyse up to 90% of your proteins? Have you tried splitting the multifasta up and running 10,000 short jobs instead?

ADD REPLY
0
Entering edit mode

Yes. pep.fa is a single multifasta file. I have tried to choose a smaller set of 5000 proteins to test the program but it come up with the same issue. There is NO any final result (such as tsv, gff and so on) generated till this step (90% completed) and there are something in the "temporary" directory only. Thus, I think no usable results for any proteins will come out unless it is 100% completed. I don't think chucking the 10000 sequences into 10000 independent single-fasta files is a good idea which means it will use interproscan 10000 times and I believe interproscan is designed to support multifasta file as input. What do you think?

ADD REPLY
0
Entering edit mode

Yeah it does seem silly. If Interpro supports a multifasta it should be capable of running on all of them. My only suggestion would be to try progressively larger datasets, working up from a number of proteins you know will work (maybe try 10, 100, 1000 and 5000 proteins) and see where it breaks. I would expect Interpro to write an error file if it is encountering any, but you could maybe consider redirecting the STDERR stream in to a file 2>file.txt, in case it is throwing errors you arent actually seeing yet. Something else to consider might be that one of the protein fasta's in the file is invalid in some way? Perhaps run your multifasta through some other fasta parsers and make sure it behaves as expected.

ADD REPLY
0
Entering edit mode

Please try the following:

  • get an interactive login on one of the hosts, then try the following commands with the protein file that come with ips:

    ./interproscan.sh -i test_proteins.fasta # check with remote lookup
    

and

./interproscan.sh -i test_proteins.fasta -dp # check without remote lookup

both commands should terminate without error and provide test_proteins.csv ...xml, etc.

Then in case the program hangs again at 90%, check with top -u username which processes are running.

ADD REPLY
0
Entering edit mode

Hi! Have you been able to resolve this issue? I'm in the same situation now, interproscan was working during couple of days in 10 threads, and now it's just one "java" thread, and it stuck on 90% for five another days, and it's still there.

ADD REPLY
0
Entering edit mode

I think that this could be the final summarization of results and mainly doing IO. Note that one normally measures the running time of interproscan in several weeks for a medium sized genome. So you just have to be patient.

ADD REPLY
0
Entering edit mode

Had the same issue with version 5.23-62.0. Tried it with version 5.24-63.0, there the run finished after some hours.

Maybe it is because of increased java max heap size (-Xmx parameter) in the interpro.sh script?

ADD REPLY

Login before adding your answer.

Traffic: 1889 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6