Question: ABySS - Out of memory using 64 cores (520GB)
0
gravatar for wjar6718
3.7 years ago by
wjar67180
Australia
wjar67180 wrote:

Hi ABySS,

Now, I used Hi-seq X10 to generate M001_R1.fastq (170GB, 450,000,000 reads) and rM001_R1.fastq (170GB, 450,000,000 reads), with 37xcoverage,95% reads > Q30, and most have length 150bp.

My compute resource is only a single machine with 64 cores and total memory 520GB. My simple question here is whether I can run the 2 fastq files in ABySS1.9.0.

I did some tests of the data using 4 and 8 cores using my computer and I got the following results:

1. Used 4 cores

$ cat abysspe91.sh.o3767790
/opt/openmpi/bin/mpirun -np 4 ABYSS-P -k64 -q3 -v   --coverage-hist=coverage.hist -s FR07886691_Human_WEEJAR_R1R2-bubbles.fa  -o FR07886691_Human_WEEJAR_R1R2-1.fa

M001_R1.fastq

M001_R2.fastq 
ABySS 1.9.0
ABYSS-P -k64 -q3 -v --coverage-hist=coverage.hist -s FR07886691_Human_WEEJAR_R1R2-bubbles.fa -o FR07886691_Human_WEEJAR_R1R2-1.fa

M001_R1.fastq

M001_R2.fastq
Running on 4 processors
0: Running on host omega-0-9.local
1: Running on host omega-0-9.local
2: Running on host omega-0-9.local
3: Running on host omega-0-9.local
0: Reading `HCCJFCCXX_2_150527_FR07886691_Human__R_150526_WEEJAR_FGS_M001_R1.fastq'...
1: Reading `HCCJFCCXX_2_150527_FR07886691_Human__R_150526_WEEJAR_FGS_M001_R2.fastq'...

[cut]

0: Read 6900000 reads. 0: Hash load: 229724096 / 536870912 = 0.428 using 8.01 GB
1: Read 7000000 reads. 1: Hash load: 229433787 / 536870912 = 0.427 using 8 GB

[Job stopped here without error message. I think out of memory]

2. Used 8 cores

$ cat abysspe81_eager.sh.o3769063
/opt/openmpi/bin/mpirun -np 8 ABYSS-P -k64 -q3 -v   --coverage-hist=coverage.hist -s FR07886681_Human_WEEJAR_R1R2_T8-bubbles.fa  -o FR07886681_Human_WEEJAR_R1R2_T8-1.fa M001_R1.fastq

M001_R2.fastq 
ABySS 1.9.0
ABYSS-P -k64 -q3 -v --coverage-hist=coverage.hist -s FR07886681_Human_WEEJAR_R1R2_T8-bubbles.fa -o FR07886681_Human_WEEJAR_R1R2_T8-1.fa

M001_R1.fastq

M001_R2.fastq
Running on 8 processors
0: Running on host omega-0-17.local
1: Running on host omega-0-17.local
2: Running on host omega-0-17.local
3: Running on host omega-0-17.local
4: Running on host omega-0-17.local
5: Running on host omega-0-17.local
6: Running on host omega-0-17.local
7: Running on host omega-0-17.local
0: Reading `HCCJFCCXX_1_150527_FR07886681_Human__R_150526_WEEJAR_FGS_M001_R1.fastq'...
1: Reading `HCCJFCCXX_1_150527_FR07886681_Human__R_150526_WEEJAR_FGS_M001_R2.fastq'...

[cut]

0: Read 16800000 reads. 0: Hash load: 232538291 / 536870912 = 0.433 using 8.11 GB
1: Read 16500000 reads. 1: Hash load: 231000632 / 536870912 = 0.43 using 8.06 GB
0: Read 16900000 reads. 0: Hash load: 233529399 / 536870912 = 0.435 using 8.15 GB                                           [Job stopped here without error message. I think out of memory]

I concluded that if I used 64 cores, ABySS could load only 132,000,000 reads of each fastq file, so my job would be failed using ABySS.

Again, can you guys help me to get a result of genome assembly? I really like to use ABySS due to high accuracy. I try to split my large file into 60 small files, but ABySS still uses the same memory to load the small files. Thank you in advance

Cheers, Weerachai

abyss assembly • 2.0k views
ADD COMMENTlink written 3.7 years ago by wjar67180

It is not clear if "your computer" which you ran the tests on is the same computer as the one with 64 cores and 520Gb memory. How much memory was available for the test run? 

Also, did you perform quality checking and trimming, adapter trimming, error correction, maybe digital normalization? These steps should lower memory requirements.

ADD REPLYlink written 3.7 years ago by h.mon24k

Thanks for this. Actually, I am thinking to do these now. I did cut adapters, but not for others, as I can see that Aybss can do quality check.

It is interesting! In your case, how well can error correction and digital normalisation cut down the number of reads?

Cheers, weerachai

ADD REPLYlink written 3.7 years ago by wjar67180

I wonder what the output of "free -g" is when ABySS hangs - that would tell you whether it actually runs out of memory or whether it has started to write to swap (the latter would explain why it seemingly hangs, it just takes forever once it starts writing to disk), or whether there's actually still enough memory left

ADD REPLYlink written 3.7 years ago by Philipp Bayer6.0k

Hi Philipp,

I would like to know too, but to my knowledge, it is difficult to check this. I am using a cluster that too many people are using and 64 cores would be reasonable for me to wait in a queue. I have submitted my job to SGE using qsub. Actually all compute nodes have 64 cores with 520 GB. They are probably set to provide all assigned cpu resources to finish a job. The details I have known are as follows;

machine_type x86_64
os_name Linux
os_release 2.6.32-504.8.1.el6.x86_64
sys_clock Thu, 13 Aug 2015 04:15:58 +1000
Uptime 16659 days, 18:16:09
 
Constant Metrics
cpu_num 64 CPUs
cpu_speed 2599 MHz
mem_total 529414720 KB
swap_total

268435456 KB

Cheers, weerachai 

ADD REPLYlink written 3.7 years ago by wjar67180

SGE lets you redirect stdout and stderr to a file, did you check the file for errors?
 

ADD REPLYlink written 3.7 years ago by h.mon24k

I think I deleted STDERR files, but I checked both. They were just nothing (0 byte). Weerachai

ADD REPLYlink written 3.7 years ago by wjar67180

I never used SGE, but Torque will send these messages to the output, in fact, I got one today:

=>> PBS: job killed: mem job total 10549688 kb exceeded limit 10485760 kb

Are you redirecting Abyss output to some file? It is difficult to troubleshoot if there is no output and no clues.

 

ADD REPLYlink written 3.7 years ago by h.mon24k

Thx h.mon for your great support,

1. I reran the test and got these:

-rw-r--r-- 1 weejar HumanComparativeandProstateCancerGe            0 Aug 13 16:56 abysspe81_64_nslots_Q10q15k70.sh.e3781043
-rw-r--r-- 1 weejar HumanComparativeandProstateCancerGe            0 Aug 13 18:07 abysspe81_64_nslots_Q10q15k70.sh.e3781361
-rw-r--r-- 1 weejar HumanComparativeandProstateCancerGe        10910 Aug 13 18:41 abysspe81_64_nslots_Q10q15k70.sh.o3781361
-rw-r--r-- 1 weejar HumanComparativeandProstateCancerGe            0 Aug 13 18:07 abysspe81_64_nslots_Q10q15k70.sh.pe3781361
-rw-r--r-- 1 weejar HumanComparativeandProstateCancerGe            0 Aug 13 18:07 abysspe81_64_nslots_Q10q15k70.sh.po3781361

-bash-4.1$ cat abysspe81_64_nslots_Q10q15k70.sh
#!/bin/bash
#

#

RUN_DIR=/share/Temp/weejar/fulcrum_v_043

./abyss-pe -C $RUN_DIR np=$NSLOTS v=-v k=30 n=10 name=FR07886681_R1R2 \
in='Clean2_FR07886681_Human_WEEJAR_R1_fix_kmer_q15_N0_L70_fastx_maskq10.fasta Clean2_FR07886681_Human_WEEJAR_R2_fix_kmer_q15_N0_L70_fastx_maskq10.fasta' \
aligner=bwa

As you see, nothing was in *.pe and *.e files and the *.o file stopped at loading reads again

2. I reran another test in login node and not submit to SGE. I got the following:

-bash-4.1$ ./abyss-pe -C $RUN_DIR np=$NSLOTS v=-v k=64 n=5 name=FR07886681_R1R2_login_clean3 in='Clean3_FR07886681_Human_WEEJAR_R1_fix_kmer_q15_N0_L70_fastx_maskq10_N0.fasta Clean3_FR07886681_Human_WEEJAR_R2_fix_kmer_q15_N0_L70_fastx_maskq10_N0.fasta' aligner=bwa
make: Entering directory `/share/Temp/weejar/fulcrum_v_043'
ABYSS -k64 -q3 -v   --coverage-hist=coverage.hist -s FR07886681_R1R2_login_clean3-bubbles.fa  -o FR07886681_R1R2_login_clean3-1.fa Clean3_FR07886681_Human_WEEJAR_R1_fix_kmer_q15_N0_L70_fastx_maskq10_N0.fasta Clean3_FR07886681_Human_WEEJAR_R2_fix_kmer_q15_N0_L70_fastx_maskq10_N0.fasta 
ABySS 1.9.0
ABYSS -k64 -q3 -v --coverage-hist=coverage.hist -s FR07886681_R1R2_login_clean3-bubbles.fa -o FR07886681_R1R2_login_clean3-1.fa Clean3_FR07886681_Human_WEEJAR_R1_fix_kmer_q15_N0_L70_fastx_maskq10_N0.fasta Clean3_FR07886681_Human_WEEJAR_R2_fix_kmer_q15_N0_L70_fastx_maskq10_N0.fasta
Reading `Clean3_FR07886681_Human_WEEJAR_R1_fix_kmer_q15_N0_L70_fastx_maskq10_N0.fasta'...
Read 100000 reads. Hash load: 7964774 / 1073741824 = 0.00742 using 721 MB
Read 200000 reads. Hash load: 15531331 / 1073741824 = 0.0145 using 1.04 GB
[cut]
Read 14500000 reads. Hash load: 880548730 / 4294967296 = 0.205 using 32.5 GB
Read 14600000 reads. Hash load: 885521364 / 4294967296 = 0.206 using 32.7 GB
Read 14700000 reads. Hash load: 890496836 / 4294967296 = 0.207 using 32.8 GB
Read 14800000 reads. Hash load: 895432212 / 4294967296 = 0.208 using 33 GB
Read 14900000 reads. Hash load: 900378123 / 4294967296 = 0.21 using 33.2 GB

make: *** [FR07886681_R1R2_login_clean3-1.fa] Killed
make: Leaving directory `/share/Temp/weejar/fulcrum_v_043'
-bash-4.1$

It seems clear to me that it would be something about memory usage if I consider the following total memory of the login node:
machine_type    x86_64
os_name    Linux
os_release    2.6.32-504.8.1.el6.x86_64
sys_clock    Fri, 14 Aug 2015 11:59:43 +1000
Uptime    44 days, 23:18:50
Constant Metrics
cpu_num    8 CPUs
cpu_speed    2599 MHz
mem_total    33014604 KB
swap_total    16777212 KB

Cheers, Weerachai

 

ADD REPLYlink written 3.7 years ago by wjar67180

It could be that the system administrators set limits to the memory used by users of the login node, considering that it's just for submitting jobs. So the node may have enough memory but you're not allowed to use it. Back at UQ I got angry automated emails when I ran tasks on login nodes...

`ulimit -a` may tell you more about your allowed limits, but I wouldn't use the login node for anything.

Can you ssh into your computing node while the job is running?

ADD REPLYlink written 3.7 years ago by Philipp Bayer6.0k

Now I have no clues of IP address or ssh'able hostname for compute nodes. The admin would not allow me to get them I think. Thx anyway Philipp

ADD REPLYlink written 3.7 years ago by wjar67180

$ulimit -a

Last login: Thu Aug 13 18:36:05 2015 from 129.94.14.94
Rocks 6.0 (Mamba)
Profile built 12:24 30-Jun-2015

Kickstarted 12:37 30-Jun-2015
-bash-4.1$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 257782
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
-bash-4.1$ 

ADD REPLYlink written 3.7 years ago by wjar67180

You can try to start an interactive job, but I do not know how to do it on SGE. You will attract the admins ire if you keep running abyss at the login node, depending on local policies and how much you abuse you could be blocked from using the cluster.

P.S.: have you been using the login node all this time?

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by h.mon24k

Login nodes for testing are okay and the abyss tests were run only for few hours before stopped. Weerachai

ADD REPLYlink written 3.7 years ago by wjar67180

Have you solved your problem?

I face the same problem. I am gonna try to split the fastq files into smaller ones. But I don't how if it is feasible to join the resultant assemblies.

ADD REPLYlink written 3.4 years ago by jerviedog20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 886 users visited in the last hour