Question

MIRA runs yielding different assemblies with same input data and parameters

0

Entering edit mode

6 weeks ago

btc347 • 0

Hi everyone,

I am currently doing de novo assembly of metavirome samples, and I've been using MIRA to assemble contigs from some samples with lots of repeats where metaviralSPAdes was failing to assemble anything. However, I'm running into an issue with MIRA where multiple runs of my pipeline with the same input data and parameters are giving me different assemblies. Is this normal behavior?

The following has been my pipeline using my sample "Farm-3" as an example. First, I used bbnorm to downsample my data to approximately the average depth recommended for Illumina data by MIRA (~80x). I then ran MIRA with the following config file:

project = Farm-3
job = genome,denovo,accurate
readgroup = DataIlluminaPairedEnd500Lib
data = Farm-3_R1_001_clean_norm.fastq Farm-3_R2_001_clean_norm.fastq
technology = solexa
rename_prefix = M06453:23:000000000-K253L Farm-3_
template_size = 5 2000 autorefine
segment_placement = ---> <---
parameters = COMMON_SETTINGS -NW:cac=warn

Using these parameters, MIRA was able to generate contigs from my samples that were not being assembled by metaviralSPAdes. Below are some statistics from the "info_assembly.txt" file output from MIRA for "Farm-3".

Num. reads assembled: 76056
Num. singlets: 0
Large contigs:
Number of contigs: 4
Total consensus: 202436
Largest contig: 195702
N50 contig size: 195702

Additionally, one of the contigs belonged to a large virus (~195 kb) which I was told was likely to be present in these samples. So everything looked good. However, when I later reran MIRA with the same parameters on the same downsampled data (piped to a different output directory), I observed that I had different number of contigs for "Farm-3" than in my prior run. Additionally, the ~195 kb contig was no longer present in the output:

Num. reads assembled: 76029
Num. singlets: 0
Large contigs:
Number of contigs: 6
Total consensus: 203027
Largest contig: 102502
N50 contig size: 102502

Looking through the MIRA logfiles for both runs of "Farm-3", I see that both MIRA runs performed 5 passes. Notably, the ~195 kb contig is identified in the first pass by both runs. It is identified twice more in subsequent passes in the first "successful" run, but it is not identified again in the second run. Additionally, for both runs, the input data numbers (reads, used bases, GC content etc.) are basically identical prior to the beginning of the first pass. After the first pass, these numbers begin to diverge slightly.

So long story short, I'm getting different results MIRA despite using the same input data and parameters. Based on my (very limited) understanding of MIRA, I didn't think there was anything about MIRA's algorithm that would cause differences between runs (though again I could be completely wrong there). Does anyone have any experience with this or have any recommendations for how to proceed? I know I could just take my assembled contigs and use them downstream, but obviously I'd like to have my analysis be reproducible.

I can also try and link the MIRA logfiles for both runs somewhere if that would be useful for anyone.

Thanks!

MIRA replication denovo assembly • 118 views

ADD COMMENT • link 6 weeks ago by btc347 • 0