Question

Does gap filling introduce duplicated sequence in a reference genome?

0

Entering edit mode

5.1 years ago

brogroh • 0

I see that there are many tools for filling gaps in a reference genome. I understand why this is a useful step in order to have as much contiguous sequence as possible. However, one thing I have been wondering about, which I haven't seen discussed anywhere, is whether this has the downside of introducing duplicate sequence into the reference, particularly for a highly fragmented genome, and whether there are any tools for dealing with this. For example, I am working with a genome with ~60,000 contigs which together sum to roughly the expected size of the genome. These have been scaffolded into ~45,000 scaffolds. I am wondering whether it makes to perform gap filling at this stage, because, if all of the genome is contained within the set of scaffolds + unscaffolded contigs, filling in gaps will generate sequence which is already contained within contigs which were not able to be scaffolded. My intuition is that this would actually worsen the quality of read mapping as reads could then map ambiguously to the gap-filled sequence or to a contig which simply did not get scaffolded, but truly belongs in scaffold gaps. Can anyone comment on whether this is likely to be a problem and if there are common solutions to this problem?

assembly genome • 1.3k views

ADD COMMENT • link updated 5.1 years ago by Carambakaracho ★ 3.2k • written 5.1 years ago by brogroh • 0

score 0 · Answer 1 · 2019-03-28

0

Entering edit mode

5.1 years ago

Carambakaracho ★ 3.2k

in Brief: It depends, on the sequencing technology and but also on how you define "duplicated" sequence

Though gaps often occur due to repetitive or low complexity sequence, scaffolding (usually) fills gaps between contigs with ambiguous sequence (N, that is). Scaffolding is mostly done for ordering sequence, so you'll be able to tell the sequence of contigs in your genome is for example "contig1 - contig3 - contig2". To order and connect the scaffolds you needed "long jump" or "mate pair" libraries, a technology sequencing long fragments and hence insert sizes. Based on the known length of the fragments a scaffolder software can estimate the gap size.

different library sizes

Note, that some of this is obsolete with long read technologies like PacBio and ONP, where some gaps can be filled with actual sequence. In this case you certainly sort of "duplicate" sequence. Sort of, because given that the sequence is very likely low complexity and//or repetitive it is very likely to occur on many positions in the genome, see for example wikipedia on non-coding DNA and transposable elements

ADD COMMENT • link 5.1 years ago by Carambakaracho ★ 3.2k

0

Entering edit mode

To clarify, the contigs are from a PacBio assembly, and these have already been scaffolded using mate pair links. My question is about gap filling using mate pair sequence i.e. building consensus sequence for the gap based on the 'overhang' read from a mate pair in which one read maps within a contig within a scaffold and the other is expected to fall within the gap. The conversion of Ns to nucleotide sequence is going to create sequence which already exists within the contigs, but just isn't in the scaffold already, no?

ADD REPLY • link 5.1 years ago by brogroh • 0

0

Entering edit mode

contigs are from a PacBio assembly, and these have already been scaffolded using mate pair links

okay. I never did it this way, but used contigs assembled from short PE fragments which I then scaffolded using long reads (or formerly long MP fragments)

[Will this] create sequence which already exists within the contigs, but just isn't in the scaffold already?

Given the context above, I'm not sure I understand your question here. Let me try to answer what I think I understand: In case you're assembly can't be scaffolded better with long reads, chances are the gaps in the sequence are somewhat too long to brigde them with PacBio reads. In that case, you're MP library requires enormous coverage to reliably fill the region. The variation of insert size distribution on MP libraries are usually quite large, so you can't just estimate the position in the gap based on the distance but need to stack overlapping reads.

However, I guess there's repetitive sequence in the gap (that's why the gap is most likely there). In that case short reads will create ambiguous connections between the fragments. Imagine aligning multiple reads comprised of AT only to span an AT region longer than a read.

ADD REPLY • link 5.1 years ago by Carambakaracho ★ 3.2k