Question: How can an assembled genome have Ns in it?
gravatar for predeus
3.0 years ago by
predeus1.3k wrote:

Hello all,

maybe it's a silly question, but I have very little experience with genome assembly, so I was hoping somebody would help me out. A colleague of mine has pointed out to certain number of N nucleotides in continuous parts of some genome assemblies (as in the middle of a chromosome). They are not present in human or mouse assemblies, but are seen quite often in other genomes.

Now what is confusing to me is those are not the hard-masked versions of the genomes - or at least so they said. Those are un-masked versions.

Could you have a certain number of Ns in an assembled scaffold? How could you know the number of Ns for sure if you never got the sequence?

Thank you for any input

genome assembly ngs wgs • 990 views
ADD COMMENTlink written 3.0 years ago by predeus1.3k

Human genome has quite some parts with "N" nucleotides, mainly repetitive content such as telomeres and centromeres.

ADD REPLYlink written 3.0 years ago by WouterDeCoster42k

Yeah, I know that - I know that they are usually masked. That I can easily understand.

What I don't understand is that how can you have undefined sequences of known length in the assembly.

If you have stretches that are very repetitive and cannot be assembled, you have a hole in the assembly there, don't you? You can't just put the two scaffolds together and put some N's in between, since you won't know the length of the linker.

That's what I don't get.

ADD REPLYlink written 3.0 years ago by predeus1.3k

If you have paired end or mate pair reads that have alignments on either side of the gap, and since you know the expected size of the distance between pairs, you can use that to infer gap size and insert a correct number of Ns. Otherwise, unknown gap sizes

ADD REPLYlink written 3.0 years ago by cmdcolin1.3k

Thank you for the link, that answered it for me!

ADD REPLYlink written 3.0 years ago by predeus1.3k

It depends on the stage of your assembly. At the contig stage, there should be no N as far as I am aware. At the scaffold/pseudomolecule stage, you will have blocks of Ns between contigs (scaffold stage)/scaffolds (pseudomolecule stage), when you know the order of the contigs/scaffolds (e.g. from long mate-pair libraries) but were not able to actually overlap them.

ADD REPLYlink written 3.0 years ago by cschu1811.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1005 users visited in the last hour