Question: MUMmer alignments: Trying to understand the average identity result
0
gravatar for caro-ca
23 days ago by
caro-ca0
caro-ca0 wrote:

Hi, community! I am de novo assembling Nanopore long reads and I am comparing my draft genome assembly against the online available reference genome. First, I want to give you some details about the inputs. The reference genome used a hybrid method that compromised Illumina and PacBio; they assembled the short reads and used the long reads (~30x coverage) to gap-filling. In my assembly, I could find complete chromosomes but nothing on the reference genome. The genome annotation of the reference genome could annotate more protein-coding genes than my own assembly. Both assemblies are from the same species but different strains and they are different in genome size.

When I ran nucmer from MUMmer with the option -maxmatch and delta-filter I got an average identity of 93%. How is this possible? I find it difficult to understand because:

1) My assembly had ~47x coverage which according to the literature I needed 70x coverage to overcome the systematic errors of Nanopore so my assembly has errors even though I did a lot of error-correction, consensus and polishing steps.

2) With the Nanopore assembly, I could span more repetitive regions than the reference genome.

So, In general, there are different reasons why I would have expected a little bit less of the identity percentage.

Let me know if I made myself clear. Thank you in advance for your help.

ADD COMMENTlink modified 23 days ago by Istvan Albert ♦♦ 81k • written 23 days ago by caro-ca0
1

My assembly had ~47x coverage which according to the literature I needed 70x coverage to overcome the systematic errors of Nanopore

It's important to realize here how fast the field changes. Nanopore errors have become a lot less problematic than some old literature will make you think. I don't know the paper you are talking about here, but make sure to take into account that the technology was of lesser quality when those people generated their data.

ADD REPLYlink written 23 days ago by WouterDeCoster42k

Hi! Thank you for your answer. Actually, I am dealing with that coverage value. What is low or high coverage to overcome systematic Nanopore errors. Could you suggest me a paper which clarifies this? The one I read is from 2015.
Thank you!

ADD REPLYlink written 21 days ago by caro-ca0
1
gravatar for Istvan Albert
23 days ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

The question is not that well-formed - that is one reason you are not getting more answers. The most obvious explanation is quite straightforward, but most likely will not directly provide you with the information you seek:

  • You get a 93% average identity because that is how many bases are identical, on average.

You are also saying: I would have expected a little bit less of the identity percentage.

First I would say that it is quite challenging to accurately predict how similar two strains are. An average 93% percent identity does not mean the sequences are 93% identical everywhere. There will be long regions of identical sequences punctuated by shorter regions of great diversity - that is how closely related sequences should look like.

7% difference could a be a lot, or very little - it all depends on how closely packed the information in the genomes are. My gut feeling would say that a 7% difference across strains is a little too much already.

If you suspect your alignments to be the culprit then look and evaluate the alignments themselves. One of the easiest ways to compare two similar sequences is to use the minimap2 aligner. Basically align the draft genome to the reference then visualize the resulting BAM file in IGV. It can be very informative.

ADD COMMENTlink written 23 days ago by Istvan Albert ♦♦ 81k
1

Another cool tool to compare assemblies/genomes is D-genies, which uses minimap2 for alignment and does a pretty visualization. Another tool to evaluate assemblies is QUAST.

ADD REPLYlink modified 23 days ago • written 23 days ago by WouterDeCoster42k

Thank you! To assess my assemblies I used QUAST, Tapestry, BUSCO and dnadiff from MUMmer, but I will pay a close look at D-genies! Thank you for your answers

ADD REPLYlink written 21 days ago by caro-ca0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1059 users visited in the last hour