Hi, I am a beginner in shotgun metagenomic assembly and am currently trying to run metaSPAdes.
Input data: One sample with ~71,251,478 paired-end reads.
System: 125 GB RAM + 2 TB swap memory.
Command used:
metaspades.py \
-1 sample_R1.fastq.gz \
-2 sample_R2.fastq.gz \
-t 16 \
-m 250 \
-k 21,33,55,77,99,127 \
-o output_dir
Note: I only used the assembler module (without error correction).
My doubts are:
- Will this command run properly with my dataset and resources?
- Do I need to enable error correction explicitly for better assembly results?
- Are there recommended parameter changes for large shotgun datasets like mine?
Any suggestions or guidance will be greatly appreciated.
Thank you very much, sir, for your helpful suggestions. I really appreciate your guidance. I still have a few more questions, if you don’t mind clarifying:
I have 8 more samples of similar size, but at the moment I am testing with just one sample. When I tried MEGAHIT with default parameters on this dataset, the assembly seemed highly fragmented (around 18 million contigs) and N50 value of 475bp.
Currently, I am running metaSPAdes with both assembly and error correction enabled, but it has already been running for ~38 hours and is still in the error correction phase. Is it expected for error correction to take this long with a dataset of ~71 million paired-end reads? So should i continue with this and those kmer values are good to go or should i change them?
Your advice would be very valuable for me to understand whether I should wait or try adjusting my approach.
Thanks again for your time and support!
It is a weekend, and I think you may want to let the computer work through it instead of you. No, it is not normal for error correction to take that long, but I know nothing about your computer resources except the memory size.
If you are worried about MEGAHIT getting a poor assembly, I can assure you that's not because this program can't assemble. So something is off with your data, or with the way you did it. Maybe add
--presets meta-sensitive
and run again? Or it could be that your dataset is not very complex and was sequenced at a very high depth (500x or more), which will cause fragmentation. Error correction in SPAdes will help a little if that's the issue, but in addition you may need to do downsampling of your reads to 100x or at most 200x.This will likely require patience and research on your part as you probably have to systematically compare several options. Looks like you are already doing this, but just in case: it is best to figure this out on a single sample before you move on to others.
PS How is the error correction running for 38 hours when you said to be using a command without error correction?
Thank you so much, sir, for patiently clarifying all my queries. Your explanations are very helpful. I am going to try all the suggestions you shared and incorporate them into my work.
Regarding the "How is the error correction running for 38 hours when you said to be using a command without error correction?" you mentioned, you are absolutely right, and I sincerely apologize for the confusion. That was a typo in my first post. I actually used error correction along with assembly, but in a hurry while typing I mistakenly wrote “without error correction.” You can also see from the command I shared earlier that I did not restrict it to only assembly. I’m really sorry for the mistake and the confusion it caused.
Thank you again for your valuable guidance, sir. I will definitely contact you if I have more queries after trying out your suggestions.
Hello Sir,
Thank you for your earlier suggestion. I wanted to share what actually happened with my assembly runs and also ask for your advice.
I was running the assembly on a system with 125 GB RAM and 2 TB swap memory. However, during the run with error correction enabled, I noticed that the swap memory was not being used at all, and the run seemed stuck at one point. I have shared the log file below. Could you kindly help me understand whether this issue is due to (i) error correction step, (ii) memory allocation, or (iii) just the long runtime?
Before your suggestion of using
--presets meta-sensitive
, I tried two runs with MEGAHIT:Currently, I am running with --presets meta-sensitive and expect it to finish by tomorrow.
My main question: Are these MEGAHIT assembly results (short N50, high fragmentation, very large number of contigs) expected for a rhizospheric soil metagenome, or should I try to further optimize?
Thank you again for your time and guidance. I kindly request you to reply, Sir. It will be a great help for my learning, and I will be truly grateful to you.
Please use
ADD REPLY/ADD COMMENT
when responding to existing comments.SUBMIT ANSWER
should only be used for new answers to original question.I am sorry for the mistake, sir. My apologies - I will follow this instruction and make sure to use ADD REPLY/ADD COMMENT correctly from now on.
I don't know what the messages above means exactly, but it is very likely that you don't have enough memory.
If this is a complex community - as I would expect from a soil sample - then you likely don't have enough reads to get a better assembly. I already told you that 70 million reads is not a big dataset, and for a complex community it is definitely not big. I suggest you go back and check the contig coverage for the MEGAHIT assembly. If the average number is in the single- or low double-digit range, that would explain why you get the fragmentation. The assembler simply has no information to extend the contigs.
Please do not expect that I will keep replying to your questions. I am not here to guide anyone through the whole process step by step - I have my students for that.
I sincerely apologize, Sir, for asking too many questions and taking up your time. I am very grateful for the guidance and suggestions you have already given me - they have been very helpful for my learning. Thank you once again for your patience and replies.