Question: Problem with ks_debruijn4 on large dataset in KisSplice
0
gravatar for david.b.rombaut
6 months ago by
david.b.rombaut0 wrote:

Hello,

for the purpose of differential splicing analysis (on human samples), I have been running KisSplice on several datasets, with success. But recently, a particularly heavy batch of rna-seq data is causing me trouble.

The largest analysis so far was a total of 60 paired-end samples, with something like 40M reads per sample (up to 110M reads for 3 samples). I don't quite remember the total size of the uncompressed fastq, but I think it was something like 600Go of data. At first, HDD space was an issue (as we need 5 times the input volume, in free space, if I remember correctly), but a new 12To's HDD solved this. I always run it with default parameters, but for this batch, the timeout value was reach (the final results with KissDE were good nonetheless). This run still took 13 days to complete, on a Intel Xeon E5-2609 v4, with 64Go RAM.

kissplice -t 16 -r ... -d ... -o ...

One of the thing I don't quite understand is that the job still seems to be running on only one core even with -t 16, well I should say, on only one of the 16 CPU shown in the system monitor in Ubuntu. Even if Intel tells me this intel xeon has 8 cores and 8 threads (although maybe something is beyond my understanding, because I'm not really good in this field).

The error come from the current dataset I'm analysing. There are 32 samples (paired-end), ~120M reads by sample, for a total size of 1.6To of uncompressed fastq. It runed for a whole month before stopping with :

Problem with /usr/local/libexec/kissplice/ks_debruijn4

And that's it. Nothing more in the log than the input command. There was still a few thing that I don't usually get on smaller dataset on the console ( [...] for shortening a LOT of content ) :

[09:59:21 26/08/2019] --> Building de Bruijn graph...
Graph will be written in /[...].[edges/nodes]
taille cell 32 
Sequentially counting ~5655653 MB of kmers with 189 partition(s) and 42 passes using 1 thread(s), ~1024 MB of memory and ~137841 MB of disk space
| First step: Converting input file into Binary format                                               |
[-------------------------------------------------------------------------------------------]
| Counting kmers                                                                                     |
1      %     elapsed:    617 min 46    sec      estimated remaining:  61158 min

[...]

100    %     elapsed:  31632 min 58    sec      estimated remaining:      0 min 0     sec 
-------------------Counted kmers time Wallclock  1.90888e+06 s

------------------ Counted kmers and kept those with abundance >=2,     
 Writing positive Bloom Kmers 2867940000
6867663442 kmers written
-------------------Write all positive kmers time Wallclock  14315.3 s
Build Hash table 26840000End of debloom partition  26843546 / 26843545 

6811627761 false positives written , partition 0 
Build Hash table 53680000End of debloom partition  26843546 / 26843545 

[...]

927413959 false positives written , partition 105 
Build Hash table 2867940000Total nb false positives stored in the Debloom hashtable 880386881 
-------------------Debloom time Wallclock  364595 s
Insert solid Kmers in Bloom 2867940000-------------------build DBG time Wallclock  384405 s
______________________________________________________ 
_______________________________________ minigraph_____ 
______________________________________________________ 

Extrapolating the number of branching kmers from the first 3M kmers: 150153807
Looping through branching kmer n° 431379600 / 431379809     
-------------------nodes construction time Wallclock  31794.9 s

Problem with /usr/local/libexec/kissplice/ks_debruijn4

Just the beginning : "using 1 thread(s), ~1024 MB of memory ", is this normal ? All the other info never showed up for the other datasets.

I tried a run with only 4 of these samples, and it went without a hitch (but it still took something like 50 hours).

I don't get what could be the problem. I have plenty of free space in the temp directory (a little more than 11To) and 64Go of RAM + 10Go of swap memory. The computer never seemed to be particularly stressed, one CPU at 100% and the 15 other below 5%, and the RAM never got beyond 8 or 9Go (although I can't be sure for this, but I never had a message about not having enough RAM, which I did get while running some other tools (not at the same time as kissplice of course)).

I almost forgot, before each run, I do :

ulimit -s unlimited

To set the stack size on unlimited, else it's at 8192kb by default. (not doing that was a problem on the first analysis I did with kissplice, so now it is mandatory).

EDIT: I have tried again with -z and -C option (0.10), but a power cut made me lose 3 weeks of analysis... I will try again, maybe.

EDIT2: I now have access to a computer cluster, but the data space is largely insufficient (both for the fastq storage, and the working directory). So return to step 1.

Thank you in advance for any help I could get with this!

ADD COMMENTlink modified 2 days ago by vincent.lacroix130 • written 6 months ago by david.b.rombaut0
2
gravatar for vincent.lacroix
5 months ago by
vincent.lacroix130 wrote:

Dear David,

Apologies for the slow reply, I will try to answer your questions.

Indeed KisSplice 2.4.0 does not scale well to very large datasets. We are aware of this point on working on it.

Version 2.5.0 is not completely ready yet, but I can send you a beta version by email if you want to try it. Graph building is now based on bcalm (https://github.com/GATB/bcalm) and is much faster. I expect it will solve your current problem ("Problem with /usr/local/libexec/kissplice/ks_debruijn4"). I must however warn you that bubble quantification will still be slow. We have some ideas to improve this also but not had time to implement them.

I must say I am impressed you already managed to run KisSplice on 60 samples of 40M reads. The fact that there is only 1 CPU working instead of 16 is indeed a shame. The parallelisation would be much better if the graph could be split in biconnected components of equal size, since we enumerate bubbles in each bcc in parallel. However, in practice, there is always a giant biconnected component, and our parallelisation is therefore quite inefficient at this step. The rationale for the existence of this giant component is that many mRNAs are not yet fully spliced when sampled (even if you select polyA+ RNAs), and the repetitive sequences contained in their introns (especially Alu sequences) tend to glue genes together, when we would like to have one gene per component.

The fact that the timeout was reached for your 60 samples run indicates that some bubbles in the giant component were not enumerated. You may therefore have missed some alternative splicing events. For this issue, I would suggest you use the --experimental option. It corresponds to an alternative approach to enumerate bubbles, which is quite naive and is not expected to perform well on all types of graphs, but appears to be really efficient on all datasets we tested so far. The --experimental option is the one we used here: http://kissplice.prabi.fr/pipeline_ks_farline/ and will be set by default in KisSplice 2.5.0.

Best regards,

Vincent

ADD COMMENTlink written 5 months ago by vincent.lacroix130
0
gravatar for david.b.rombaut
4 months ago by
david.b.rombaut0 wrote:

Hello,

sorry for the late answer/update. As stated in my answer to your mail, I am definitively interested in an updated version of KisSplice. Should I have some free CPU time before having the occasion to test the new version, I will try the --experimental option. I will indicate here if, either the new version or the experimental option have bring a change.

Thank you!

ADD COMMENTlink written 4 months ago by david.b.rombaut0
0
gravatar for david.b.rombaut
10 weeks ago by
david.b.rombaut0 wrote:

Hello,

it is time for the update, and the good news. The 2.5.0-beta version of KisSplice was the answer for my problem.

As a reminder, our (large) dataset consist of 32 samples (paired-end), 120M reads per sample, and 1.6 To of uncompressed fastq.

The specs of the computer (Dell Precision 7810) are as follows :

  • CPU : 2 Intel Xeon E5-2609 v4 (16 cores total)
  • RAM : 64 Go
  • HDD : 12To of free space for the run.

With KisSplice 2.4, it took an entire month for the run to finally crash. And it appears there was no multithreading support. The 2.5.0-beta version has brought a considerable reduction in execution time (with the multithreading support), and it managed to finish without crashing. To be fair, I still had to use -c 10 (instead of the default -c 5) to avoid failure (I am currently testing with a slight increase in RAM memory to see if -c 5 is possible, but I don't really expect a significant difference in the results between 10 and 5).

Here is a comparison of the two steps that benefit from the update. Keep in mind that it is slightly inaccurate, as the (successful) run of the 2.4 version took place on a computer cluster who was clearly more powerful that our local computer. But it just serve to highlight the efficiency of the new beta version on a computer that is not (too) expensive.

  • Building de Bruijn graph...
    • v2.4.0 --> 136h
    • v2.5.0-beta --> 18h
  • Enumerating all bubbles...
    • v2.4.0 --> 77h
    • v2.5.0-beta --> 5h

I will make an update, if I manage to run it with -c 5.

EDIT: I forgot to mention that, in total, the run took something like 5 day to complete. Which seems raisonnable for an (local) assembler, on such a large dataset.

EDIT2: I was able to run it with the default -c 5 just by increasing the memory swap size. So it was just a problem of memory. It took approximately 2 more days to complete, so 7 in total.

Thanks!

ADD COMMENTlink modified 7 weeks ago • written 10 weeks ago by david.b.rombaut0
0
gravatar for vincent.lacroix
2 days ago by
vincent.lacroix130 wrote:

Dear David and all,

KisSplice v2.5.0 is now available for all users and can be downloaded here: http://kissplice.prabi.fr It indeed is much faster for graph construction thanks to the integration of bcalm (https://github.com/GATB/bcalm).

For large datasets, we advise users to increase the value of the -c parameter. It will speed up a lot, at the expense of losing rare variants.

By default, -c is set to 2. This means we report variants composed of kmers seen at least twice in the full dataset (i.e. accross all samples). This setting optimises sensitivity.

When using 10 samples, setting -c 10 may still be reasonable. It will filter out variants locally supported by less than 10 reads. Those 10 reads can be in one sample, or split in the 10 samples.

Vincent

ADD COMMENTlink written 2 days ago by vincent.lacroix130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1775 users visited in the last hour