Question

Pair-End Sequencing Algorithm Confusion (Illumina)

0

Entering edit mode

3.4 years ago

rayoub ▴ 110

When learning about de-novo paired end sequencing algorithms in school it was presented as a set of large DNA fragments with two sequenced ends and a region between the two ends that was not sequenced but for which we know the exact length. The sequences for the two ends and the exact length of the un-sequenced region were then used to assemble the genome.

Now that I'm working in a lab with NGS sequence data from Illumina paired-end sequencing, it does not appear to work that way. I'm guessing this is because we are aligning to a reference genome. It does not appear to know the un-sequenced region length before aligning to the reference genome. I think it can be estimated after the fact but that is another point of discussion. It also appears to merge the two ends if there is overlap and consider them separately if there is not overlap. Here are some questions that I have which are confusing me.

Question all pertain to the no overlap case.

Does Illumina know the exact length of the un-sequenced region for a pair?
If Illumina does not know the exact length of the un-sequenced region for a pair are their algorithms that make use of the fact that one comes after the other when aligning to a reference genome (or even for de novo assembly)?

Also, any additional insights you can offer are appreciated. I'm very familiar with the Illumina documentation in detail so you don't need to repeat that. I'm more interested in a wider understanding of this and the relationship of what I learned in school and how actually sequencing works in labs.

Thanks for any help provided.

Sequencing • 1.1k views

ADD COMMENT • link updated 3.4 years ago by swbarnes2 14k • written 3.4 years ago by rayoub ▴ 110

1

Entering edit mode

The sequences for the two ends and the exact length of the un-sequenced region were then used to assemble the genome.

Only if you are assembling using a reference based assembly strategy.

Does Illumina know the exact length of the un-sequenced region for a pair?

Not unless there is a reference that you can align to. The length of the "unsequenced" part is inferred from the alignments of the pair of reads to a reference.

ADD REPLY • link 3.4 years ago by GenoMax 147k

0

Entering edit mode

For the first one, I think you meant to say, "only if you are not ..." since that is when the exact length of the un-sequenced region is needed the most.

For the second one. I understand now more clearly that it can only be inferred after the fact. However, do you think it uses the lesser information that they are read pairs from the same strand and one comes after the other in its alignment to the reference genome? Are you aware of an algorithm for that use of the lesser information?

ADD REPLY • link 3.4 years ago by rayoub ▴ 110

1

Entering edit mode

I don't know for sure that assembly uses lengths of paired-end reads but assembles will expect a distribution of sizes (e.g. 350-550 for standard libraries). I meant above that exact length of unsequenced region will only be known if you are assembling using a known reference and can place the reads on that reference. Many assemblers are based on k-mers (which are derived from the reads) and use those to create de bruijn graphs (one example paper).

they are read pairs from the same strand

Small correction. Read pairs are from the same fragment but they are not from same strand. Someone more algorithmically oriented will comment in detail but paired-end reads appear to be useful for orienting nodes in later on in scaffolding process. This looks like a good read on how velvet (a de bruijn assembler) works.

ADD REPLY • link 3.4 years ago by GenoMax 147k

score 1 · Answer 1 · 2021-06-04

There are stages in the library prep where the library is run on a gel; you can get an idea of what most of the insert sizes are by looking at the peak in the gel which corresponds to your desired library products. Alignment programs like bwa will also look at what maps where, and after looking at few thousand or so pairs, will have an idea of the distribution of insert sizes.

And sure, alignment algorithms will use the fact that pairs have an expected orientation and distance from each other to assist in mapping. I don't know how old the last good popular DNA mapper is, but they are pretty old, and same for assemblers; no one is leaving potentially useful information on the table here.