Running metaSPAdes with 71M paired-end reads: RAM and swap memory concerns
1
2
Entering edit mode
7 weeks ago

Hi, I am a beginner in shotgun metagenomic assembly and am currently trying to run metaSPAdes.

Input data: One sample with ~71,251,478 paired-end reads.
System: 125 GB RAM + 2 TB swap memory.
Command used:

metaspades.py \
  -1 sample_R1.fastq.gz \
  -2 sample_R2.fastq.gz \
  -t 16 \
  -m 250 \
  -k 21,33,55,77,99,127 \
  -o output_dir

Note: I only used the assembler module (without error correction).

My doubts are:

  1. Will this command run properly with my dataset and resources?
  2. Do I need to enable error correction explicitly for better assembly results?
  3. Are there recommended parameter changes for large shotgun datasets like mine?

Any suggestions or guidance will be greatly appreciated.

metaspades assembly metagenomics • 13k views
ADD COMMENT
2
Entering edit mode
7 weeks ago
Mensur Dlakic ★ 30k

The answers, in the order you asked:

  • Most likely it will run fine, but it is generally unrealistic to expect anyone to tell you whether a set of commands will run on a given configuration. There is no harm in trying even without asking, as you are not going to hurt the computer.
  • Yes.
  • Your dataset isn't that large - I have at least 20 datasets on my computer that are 300-500 million reads. All of them assembled fine on a computer with 256 GB, though I use MEGAHIT. Generally, there is no point in setting the memory to twice the value you have, as you are explicitly telling the program to go ahead and use swapping. Speaking of which, 2 TB swap is enormous. How much disk space do you have to be able to afford that kind of swap?

My advice is to run this with error checking, set the memory max to what you actually have (presumably 128 GB instead of 125), and let it go. If it fails due to memory problems, I suggest you do the same but with MEGAHIT, as it has significantly lower memory requirements.

ADD COMMENT
0
Entering edit mode

Thank you very much, sir, for your helpful suggestions. I really appreciate your guidance. I still have a few more questions, if you don’t mind clarifying:

I have 8 more samples of similar size, but at the moment I am testing with just one sample. When I tried MEGAHIT with default parameters on this dataset, the assembly seemed highly fragmented (around 18 million contigs) and N50 value of 475bp.

Currently, I am running metaSPAdes with both assembly and error correction enabled, but it has already been running for ~38 hours and is still in the error correction phase. Is it expected for error correction to take this long with a dataset of ~71 million paired-end reads? So should i continue with this and those kmer values are good to go or should i change them?

Your advice would be very valuable for me to understand whether I should wait or try adjusting my approach.

Thanks again for your time and support!

ADD REPLY
0
Entering edit mode

It is a weekend, and I think you may want to let the computer work through it instead of you. No, it is not normal for error correction to take that long, but I know nothing about your computer resources except the memory size.

If you are worried about MEGAHIT getting a poor assembly, I can assure you that's not because this program can't assemble. So something is off with your data, or with the way you did it. Maybe add --presets meta-sensitive and run again? Or it could be that your dataset is not very complex and was sequenced at a very high depth (500x or more), which will cause fragmentation. Error correction in SPAdes will help a little if that's the issue, but in addition you may need to do downsampling of your reads to 100x or at most 200x.

This will likely require patience and research on your part as you probably have to systematically compare several options. Looks like you are already doing this, but just in case: it is best to figure this out on a single sample before you move on to others.

PS How is the error correction running for 38 hours when you said to be using a command without error correction?

ADD REPLY
0
Entering edit mode

Thank you so much, sir, for patiently clarifying all my queries. Your explanations are very helpful. I am going to try all the suggestions you shared and incorporate them into my work.

Regarding the "How is the error correction running for 38 hours when you said to be using a command without error correction?" you mentioned, you are absolutely right, and I sincerely apologize for the confusion. That was a typo in my first post. I actually used error correction along with assembly, but in a hurry while typing I mistakenly wrote “without error correction.” You can also see from the command I shared earlier that I did not restrict it to only assembly. I’m really sorry for the mistake and the confusion it caused.

Thank you again for your valuable guidance, sir. I will definitely contact you if I have more queries after trying out your suggestions.

ADD REPLY
0
Entering edit mode

Hello Sir,

Thank you for your earlier suggestion. I wanted to share what actually happened with my assembly runs and also ask for your advice.

I was running the assembly on a system with 125 GB RAM and 2 TB swap memory. However, during the run with error correction enabled, I noticed that the swap memory was not being used at all, and the run seemed stuck at one point. I have shared the log file below. Could you kindly help me understand whether this issue is due to (i) error correction step, (ii) memory allocation, or (iii) just the long runtime?

"2:18:51.036    63G / 76G   INFO    General                 (main.cpp                  : 149)   Clustering Hamming graph.
  3:20:14.071    63G / 76G   INFO    General                 (main.cpp                  : 156)   Extracting clusters:
  3:20:14.075    63G / 76G   INFO    General                 (concurrent_dsu.cpp        :  19)   Connecting to root
  3:20:27.511    63G / 76G   INFO    General                 (concurrent_dsu.cpp        :  35)   Calculating counts
mimalloc: warning: unable to allocate aligned OS memory directly, fall back to over-allocation (73018638336 bytes, address: 0x752a2bc00000, alignment: 67108864, commit: 1)
  4:05:15.718   131G / 131G  INFO    General                 (concurrent_dsu.cpp        :  64)   Writing down entries
mimalloc: warning: unable to allocate aligned OS memory directly, fall back to over-allocation (32069648384 bytes, address: 0x7522b0800000, alignment: 67108864, commit: 1)"

Before your suggestion of using --presets meta-sensitive, I tried two runs with MEGAHIT:

First run:
megahit -1 Sample_R1.fastq.gz -2 Sample_R2.fastq.gz --presets meta-large
Results (after filtering <500 bp):

num_seqs: 463,127,
sum_len: 359,022,984 bp,
min_len: 500bp,
avg_len: 775.2 bp,
max_len: 45,722 bp,
N50 = 730 bp
L50 = 157,485 contigs

Second run:
megahit -1 Sample_R1.fastq.gz -2 Sample_R2.fastq.gz \
--k-list 21,29,39,59,79 -t 24 -o ./megahit_output --keep-tmp-files --tmp-dir ./megahit_tmp

Results (after filtering <500 bp):

num_seqs: 264,761,
sum_len: 207,659,761 bp,
min_len: 500bp,
avg_len: 784.3 bp,
max_len: 48,248 bp,
N50 = 735 bp,
L50 = 88,370 contigs

Currently, I am running with --presets meta-sensitive and expect it to finish by tomorrow.

My main question: Are these MEGAHIT assembly results (short N50, high fragmentation, very large number of contigs) expected for a rhizospheric soil metagenome, or should I try to further optimize?

Thank you again for your time and guidance. I kindly request you to reply, Sir. It will be a great help for my learning, and I will be truly grateful to you.

ADD REPLY
0
Entering edit mode

Please use ADD REPLY/ADD COMMENT when responding to existing comments. SUBMIT ANSWER should only be used for new answers to original question.

ADD REPLY
0
Entering edit mode

I am sorry for the mistake, sir. My apologies - I will follow this instruction and make sure to use ADD REPLY/ADD COMMENT correctly from now on.

ADD REPLY
0
Entering edit mode

I don't know what the messages above means exactly, but it is very likely that you don't have enough memory.

If this is a complex community - as I would expect from a soil sample - then you likely don't have enough reads to get a better assembly. I already told you that 70 million reads is not a big dataset, and for a complex community it is definitely not big. I suggest you go back and check the contig coverage for the MEGAHIT assembly. If the average number is in the single- or low double-digit range, that would explain why you get the fragmentation. The assembler simply has no information to extend the contigs.

Please do not expect that I will keep replying to your questions. I am not here to guide anyone through the whole process step by step - I have my students for that.

ADD REPLY
0
Entering edit mode

I sincerely apologize, Sir, for asking too many questions and taking up your time. I am very grateful for the guidance and suggestions you have already given me - they have been very helpful for my learning. Thank you once again for your patience and replies.

ADD REPLY

Login before adding your answer.

Traffic: 2727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6