I see that there are many tools for filling gaps in a reference genome. I understand why this is a useful step in order to have as much contiguous sequence as possible. However, one thing I have been wondering about, which I haven't seen discussed anywhere, is whether this has the downside of introducing duplicate sequence into the reference, particularly for a highly fragmented genome, and whether there are any tools for dealing with this. For example, I am working with a genome with ~60,000 contigs which together sum to roughly the expected size of the genome. These have been scaffolded into ~45,000 scaffolds. I am wondering whether it makes to perform gap filling at this stage, because, if all of the genome is contained within the set of scaffolds + unscaffolded contigs, filling in gaps will generate sequence which is already contained within contigs which were not able to be scaffolded. My intuition is that this would actually worsen the quality of read mapping as reads could then map ambiguously to the gap-filled sequence or to a contig which simply did not get scaffolded, but truly belongs in scaffold gaps. Can anyone comment on whether this is likely to be a problem and if there are common solutions to this problem?
in Brief: It depends, on the sequencing technology and but also on how you define "duplicated" sequence
Though gaps often occur due to repetitive or low complexity sequence, scaffolding (usually) fills gaps between contigs with ambiguous sequence (
N, that is). Scaffolding is mostly done for ordering sequence, so you'll be able to tell the sequence of contigs in your genome is for example "contig1 - contig3 - contig2". To order and connect the scaffolds you needed "long jump" or "mate pair" libraries, a technology sequencing long fragments and hence insert sizes. Based on the known length of the fragments a scaffolder software can estimate the gap size.
Note, that some of this is obsolete with long read technologies like PacBio and ONP, where some gaps can be filled with actual sequence. In this case you certainly sort of "duplicate" sequence. Sort of, because given that the sequence is very likely low complexity and//or repetitive it is very likely to occur on many positions in the genome, see for example wikipedia on non-coding DNA and transposable elements