When learning about de-novo paired end sequencing algorithms in school it was presented as a set of large DNA fragments with two sequenced ends and a region between the two ends that was not sequenced but for which we know the exact length. The sequences for the two ends and the exact length of the un-sequenced region were then used to assemble the genome.
Now that I'm working in a lab with NGS sequence data from Illumina paired-end sequencing, it does not appear to work that way. I'm guessing this is because we are aligning to a reference genome. It does not appear to know the un-sequenced region length before aligning to the reference genome. I think it can be estimated after the fact but that is another point of discussion. It also appears to merge the two ends if there is overlap and consider them separately if there is not overlap. Here are some questions that I have which are confusing me.
Question all pertain to the no overlap case.
- Does Illumina know the exact length of the un-sequenced region for a pair?
- If Illumina does not know the exact length of the un-sequenced region for a pair are their algorithms that make use of the fact that one comes after the other when aligning to a reference genome (or even for de novo assembly)?
Also, any additional insights you can offer are appreciated. I'm very familiar with the Illumina documentation in detail so you don't need to repeat that. I'm more interested in a wider understanding of this and the relationship of what I learned in school and how actually sequencing works in labs.
Thanks for any help provided.
Only if you are assembling using a reference based assembly strategy.
Not unless there is a reference that you can align to. The length of the "unsequenced" part is inferred from the alignments of the pair of reads to a reference.
For the first one, I think you meant to say, "only if you are not ..." since that is when the exact length of the un-sequenced region is needed the most.
For the second one. I understand now more clearly that it can only be inferred after the fact. However, do you think it uses the lesser information that they are read pairs from the same strand and one comes after the other in its alignment to the reference genome? Are you aware of an algorithm for that use of the lesser information?
I don't know for sure that assembly uses lengths of paired-end reads but assembles will expect a distribution of sizes (e.g. 350-550 for standard libraries). I meant above that exact length of unsequenced region will only be known if you are assembling using a known reference and can place the reads on that reference. Many assemblers are based on k-mers (which are derived from the reads) and use those to create de bruijn graphs (one example paper).
Small correction. Read pairs are from the same fragment but they are not from same strand. Someone more algorithmically oriented will comment in detail but paired-end reads appear to be useful for orienting nodes in later on in scaffolding process. This looks like a good read on how velvet (a de bruijn assembler) works.