When learning about de-novo paired end sequencing algorithms in school it was presented as a set of large DNA fragments with two sequenced ends and a region between the two ends that was not sequenced but for which we know the exact length. The sequences for the two ends and the exact length of the un-sequenced region were then used to assemble the genome.
Now that I'm working in a lab with NGS sequence data from Illumina paired-end sequencing, it does not appear to work that way. I'm guessing this is because we are aligning to a reference genome. It does not appear to know the un-sequenced region length before aligning to the reference genome. I think it can be estimated after the fact but that is another point of discussion. It also appears to merge the two ends if there is overlap and consider them separately if there is not overlap. Here are some questions that I have which are confusing me.
Question all pertain to the no overlap case.
- Does Illumina know the exact length of the un-sequenced region for a pair?
- If Illumina does not know the exact length of the un-sequenced region for a pair are their algorithms that make use of the fact that one comes after the other when aligning to a reference genome (or even for de novo assembly)?
Also, any additional insights you can offer are appreciated. I'm very familiar with the Illumina documentation in detail so you don't need to repeat that. I'm more interested in a wider understanding of this and the relationship of what I learned in school and how actually sequencing works in labs.
Thanks for any help provided.