Question

DiscoSNP++ 2.2.0 Segmentation Fault

0

Entering edit mode

9.9 years ago

tkitapci ▴ 60

Hi,

I am getting a segmentation fault when running discoSNP++ on a machine with 30GB memory. Complete output that I see on the terminal can be seen in the link below (sorry it is too long to copy-paste here).

https://docs.google.com/document/d/1jpooJySV1rKTQEronzGyprgdSISPwdhcbBJSDOflR_4/edit?usp=sharing

is this a memory problem (I had a similar memory related problem with the older version 2.1.7) and I need a bigger memory machine or something else?

Thanks a lot

Best Regards
T. Hamdi Kitapci

segmentation-fault memory discosnp • 3.8k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 9.9 years ago by tkitapci ▴ 60

Ram · Answer 1 · 2015-09-09

Can this problem caused due to the very large number of temporary files created during the run? On our cluster we have a limit for the number of files that can be generated in a single directory. If that number exceeds that may be causing a problem. Is there anyway to group those temporary files in separate directories (like 1000 file per directory) instead of creating all of them in the same directory? Also how many temporary files are created during a run? Is it a fixed number or does it depend on the dataset size?

Thanks
Hamdi

Ram · Answer 2 · 2015-09-09

0

Entering edit mode

9.9 years ago

edrezen ▴ 730

Hello,

It looks obviously like a file system issue during the DSK kmers counting step (see the traces and the bunch of HDF5 errors). HDF5 seems to have some issues that need to be investigated; issue in HDF5 itself? issue in HDF5 usage by discosnp?

It doesn't look like a full disk because there is plenty of free space (disk_current_dir : 157396.8 => 157 GB) with regard to the amount of data to write in that output HDF5 file (kmers_nb_solid : 2143612212).

Note however the following traces:

max_file_nb                              : 32768
nb_partitions                            : 880

The first line tells how many files can be open at the same time. This number is used to compute the "nb_partitions". Since "max_file_nb" is huge (a more classical value is 1024), the "nb_partitions" is huge as well and I think we never tried such high values.

Currently, in order to try to understand the issue, I would suggest two ideas:

Try to limit the max_file_nb value. Since it is value set by the operating system, you must be administrator on the machine if you want to decrease it (to 1024 for instance). I think that the ulimit shell command does the job.
Try to limit the disk usage by using the -max-disk parameter of the dbgh5 command. I'm not sure that DiscoSnp++ knows this option, so you should first try to type something like /home/cmb-02/sn1/tkitapci/software/DiscoSNP++-2.2.0-Source/build//ext/gatb-core/bin/dbgh5 -in buffalo_fof.txt_removemeplease -out /staging/sn1/tkitapci/NOHA/buffalo_variant_call/Buffalo_k_31_c_auto -kmer-size 31 -abundance-min auto -abundance-max 2147483647 -solidity-kind one -max-disk 50000

With the second solution, you should get a lower value for nb_partitions and potentially a bigger value for nb_passes. If the dbgh5 is successful with this parameter, we will have to understand the actual issue.

Can you tell if any of the two suggestions work? And provide the output as you did before?

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 9.9 years ago by edrezen ▴ 730

0

Entering edit mode

I tried myself to force the "max_file_nb" value to 32768 and test on some reads but I got no problem.

By the way, do you get exactly the same error if you relaunch your command ?

ADD REPLY • link 9.9 years ago by edrezen ▴ 730

0

Entering edit mode

Why do you think it is a DSK problem Erwan? Seems that the DSK step completes fine, and the HDF5 errors appear during cascading step. This line in particular is suspicious:

H5FD_sec2_write(): file write failed:  [..] filename = 'trashme_48702_debloom_partitions.h5', [..] error message = 'No such file or directory', [..] bytes actually written = 18446744073709551615, offset = 0

157 GB of free space seems low for a 2 billion kmers analysis. Could you try freeing more space? (around 300-400 GB free total)

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.9 years ago by Rayan Chikhi ★ 1.6k

0

Entering edit mode

Hi Rayan, you're right, it was misleading to talk about DSK, I meant the de Bruijn graph building in general.

Normally, the size of the DSK contribution in the final HDF5 file is about 16*nbSolidKmers bytes, so in this case the DSK contribution is about 32 GB, less than the 157 GB available disk space. It means that the steps after DSK should have in theory 157-32=125GB available, which should be enough.

The strange part is that the issue occurs just "between" two steps of dbgh5 (debloom and branching); a lack of disk space during any HDF5 write operation should occur in the middle of any dbgh5 step and not just between two of them. The first HDF5 error "H5Gclose(): unable to close group" seems also to tell that the debloom step tries to release correctly the used resources (including HDF5 resources) but something then goes wrong.

@tkitapci, can you tell how many disk space is left after the issue occured ? As Rayan suggests, you can also try to free some disk space and relaunch the command.

ADD REPLY • link 9.9 years ago by edrezen ▴ 730

0

Entering edit mode

Hi Erwan, also, I thought that the "bytes actually written" was a red flag, but it's actually "-1" in 64 bits representation, which is the value it is supposed to be when a write fails.

ADD REPLY • link 9.9 years ago by Rayan Chikhi ★ 1.6k

0

Entering edit mode

Hi,

Thanks for the reply. I re-run the command on a machine with 128GB memory I got the same error (or similar)

https://docs.google.com/document/d/18tejd1ems_CJXhnzhij9uanToJFMfe1YsDescDB3_y4/edit?usp=sharing

I am running this on our cluster I will see how can I free more space or tell the program to use a seperate disk space.

ADD REPLY • link 9.9 years ago by tkitapci ▴ 60

0

Entering edit mode

One more question: how can I check where is this disk_current_dir : 157396.8 => 157 GB located ? In the folder that I ran the command there is more than 10TB of free space. Program must be writing these files somewhere else I don't know where that 157GB of free disk came from) maybe this is the default directory that temporary files are written ?

Thanks a lot

ADD REPLY • link 9.9 years ago by tkitapci ▴ 60

0

Entering edit mode

157 GB should correspond to the directory where the dbgh5 command is launched, in your case:

/home/cmb-02/sn1/tkitapci/software/DiscoSNP++-2.2.0-Source/build/ext/gatb-core/bin

so there is something odd if you checked that this directory has 10 TB of free space.

By default, all temporary files will be created in this directory. It is possible to force dbgh5 to use a specific directory for temporary files (option -out-tmp X). So, you could try the following line, where XXX is a directory that has a lot of free disk space.

/home/cmb-02/sn1/tkitapci/software/DiscoSNP++-2.2.0-Source/build//ext/gatb-core/bin/dbgh5 \
  -in buffalo_fof.txt_removemeplease \
  -out /staging/sn1/tkitapci/NOHA/buffalo_variant_call/Buffalo_k_31_c_auto \
  -kmer-size 31 \
  -abundance-min auto \
  -abundance-max 2147483647 \
  -solidity-kind one \
  -out-tmp XXX

Once you have launched the command, you could check where the temporary files are actually written (they look like trashme_PID_dsk_partitions.parts, where PID is the process id).

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.9 years ago by edrezen ▴ 730

0

Entering edit mode

Thanks for the reply. In my case all the trashme_* files are created in the directory that I specify with the -p command. There is about 150 TB free space in that disk so space is clearly not an issue. I think there was some sort of file system problem causing this error (which may be related to the number of files allowed in a directory). I changed my output directory to another disk and so far it is running fine. Thanks!

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.9 years ago by tkitapci ▴ 60

1

Entering edit mode

I have solved the problem. It was likely to be related to the file system that I am using due to the large number of files being opened at the same time. I changed my output to a different file system and now I can run fine.

Thanks
Hamdi

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.8 years ago by tkitapci ▴ 60