Question: Problem with ks_debruijn4 on large dataset in KisSplice
0
gravatar for david.b.rombaut
6 weeks ago by
david.b.rombaut0 wrote:

Hello,

for the purpose of differential splicing analysis (on human samples), I have been running KisSplice on several datasets, with success. But recently, a particularly heavy batch of rna-seq data is causing me trouble.

The largest analysis so far was a total of 60 paired-end samples, with something like 40M reads per sample (up to 110M reads for 3 samples). I don't quite remember the total size of the uncompressed fastq, but I think it was something like 600Go of data. At first, HDD space was an issue (as we need 5 times the input volume, in free space, if I remember correctly), but a new 12To's HDD solved this. I always run it with default parameters, but for this batch, the timeout value was reach (the final results with KissDE were good nonetheless). This run still took 13 days to complete, on a Intel Xeon E5-2609 v4, with 64Go RAM.

kissplice -t 16 -r ... -d ... -o ...

One of the thing I don't quite understand is that the job still seems to be running on only one core even with -t 16, well I should say, on only one of the 16 CPU shown in the system monitor in Ubuntu. Even if Intel tells me this intel xeon has 8 cores and 8 threads (although maybe something is beyond my understanding, because I'm not really good in this field).

The error come from the current dataset I'm analysing. There are 32 samples (paired-end), ~120M reads by sample, for a total size of 1.6To of uncompressed fastq. It runed for a whole month before stopping with :

Problem with /usr/local/libexec/kissplice/ks_debruijn4

And that's it. Nothing more in the log than the input command. There was still a few thing that I don't usually get on smaller dataset on the console ( [...] for shortening a LOT of content ) :

[09:59:21 26/08/2019] --> Building de Bruijn graph...
Graph will be written in /[...].[edges/nodes]
taille cell 32 
Sequentially counting ~5655653 MB of kmers with 189 partition(s) and 42 passes using 1 thread(s), ~1024 MB of memory and ~137841 MB of disk space
| First step: Converting input file into Binary format                                               |
[-------------------------------------------------------------------------------------------]
| Counting kmers                                                                                     |
1      %     elapsed:    617 min 46    sec      estimated remaining:  61158 min

[...]

100    %     elapsed:  31632 min 58    sec      estimated remaining:      0 min 0     sec 
-------------------Counted kmers time Wallclock  1.90888e+06 s

------------------ Counted kmers and kept those with abundance >=2,     
 Writing positive Bloom Kmers 2867940000
6867663442 kmers written
-------------------Write all positive kmers time Wallclock  14315.3 s
Build Hash table 26840000End of debloom partition  26843546 / 26843545 

6811627761 false positives written , partition 0 
Build Hash table 53680000End of debloom partition  26843546 / 26843545 

[...]

927413959 false positives written , partition 105 
Build Hash table 2867940000Total nb false positives stored in the Debloom hashtable 880386881 
-------------------Debloom time Wallclock  364595 s
Insert solid Kmers in Bloom 2867940000-------------------build DBG time Wallclock  384405 s
______________________________________________________ 
_______________________________________ minigraph_____ 
______________________________________________________ 

Extrapolating the number of branching kmers from the first 3M kmers: 150153807
Looping through branching kmer n° 431379600 / 431379809     
-------------------nodes construction time Wallclock  31794.9 s

Problem with /usr/local/libexec/kissplice/ks_debruijn4

Just the beginning : "using 1 thread(s), ~1024 MB of memory ", is this normal ? All the other info never showed up for the other datasets.

I tried a run with only 4 of these samples, and it went without a hitch (but it still took something like 50 hours).

I don't get what could be the problem. I have plenty of free space in the temp directory (a little more than 11To) and 64Go of RAM + 10Go of swap memory. The computer never seemed to be particularly stressed, one CPU at 100% and the 15 other below 5%, and the RAM never got beyond 8 or 9Go (although I can't be sure for this, but I never had a message about not having enough RAM, which I did get while running some other tools (not at the same time as kissplice of course)).

I almost forgot, before each run, I do :

ulimit -s unlimited

To set the stack size on unlimited, else it's at 8192kb by default. (not doing that was a problem on the first analysis I did with kissplice, so now it is mandatory).

EDIT: I have tried again with -z and -C option (0.10), but a power cut made me lose 3 weeks of analysis... I will try again, maybe.

EDIT2: I now have access to a computer cluster, but the data space is largely insufficient (both for the fastq storage, and the working directory). So return to step 1.

Thank you in advance for any help I could get with this!

ADD COMMENTlink modified 7 days ago by vincent.lacroix110 • written 6 weeks ago by david.b.rombaut0
0
gravatar for vincent.lacroix
7 days ago by
vincent.lacroix110 wrote:

Dear David,

Apologies for the slow reply, I will try to answer your questions.

Indeed KisSplice 2.4.0 does not scale well to very large datasets. We are aware of this point on working on it.

Version 2.5.0 is not completely ready yet, but I can send you a beta version by email if you want to try it. Graph building is now based on bcalm (https://github.com/GATB/bcalm) and is much faster. I expect it will solve your current problem ("Problem with /usr/local/libexec/kissplice/ks_debruijn4"). I must however warn you that bubble quantification will still be slow. We have some ideas to improve this also but not had time to implement them.

I must say I am impressed you already managed to run KisSplice on 60 samples of 40M reads. The fact that there is only 1 CPU working instead of 16 is indeed a shame. The parallelisation would be much better if the graph could be split in biconnected components of equal size, since we enumerate bubbles in each bcc in parallel. However, in practice, there is always a giant biconnected component, and our parallelisation is therefore quite inefficient at this step. The rationale for the existence of this giant component is that many mRNAs are not yet fully spliced when sampled (even if you select polyA+ RNAs), and the repetitive sequences contained in their introns (especially Alu sequences) tend to glue genes together, when we would like to have one gene per component.

The fact that the timeout was reached for your 60 samples run indicates that some bubbles in the giant component were not enumerated. You may therefore have missed some alternative splicing events. For this issue, I would suggest you use the --experimental option. It corresponds to an alternative approach to enumerate bubbles, which is quite naive and is not expected to perform well on all types of graphs, but appears to be really efficient on all datasets we tested so far. The --experimental option is the one we used here: http://kissplice.prabi.fr/pipeline_ks_farline/ and will be set by default in KisSplice 2.5.0.

Best regards,

Vincent

ADD COMMENTlink written 7 days ago by vincent.lacroix110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1304 users visited in the last hour