Question: What is the difference between paired end reads and overlapping reads, and then why merge overlapping reads before assembly?
0
gravatar for robert.murphy
7 weeks ago by
robert.murphy10 wrote:

What is the difference between paired end reads and overlapping reads, after reading around I am struggling to find an answer as to what overlapping reads actually are. From what I can see they just seem to be another term for paired end reads but I feel this must be wrong or I would have found somewhere stating this?

Secondly, why merge overlapping reads prior to assembly?

Any help would be appriciated :)

sequencing assembly genome • 196 views
ADD COMMENTlink modified 7 weeks ago by lieven.sterck6.8k • written 7 weeks ago by robert.murphy10

Paired-end reads - reads that are produced in pairs with approximately known insert distance (they are designed in this way). Overlapping reads - paired-end reads when the insert distance becomes negative (they physically overlap). Why to merge them - to avoid mis-assembly. Imagine a read that has ACGT sequence and has an overlapping read of TACG. You can merge it into ACGTACG. Or you can remain it the same - and then another read, e.g., CGTCC may occur in your data, and you will get an assembly path ACGTCC instead of correct ACGTACG. I don't say that's how the assembly algorithm makes a choice where to put a path (there are various considersations), but it may happen in approx this way.

UPD: instead of insert distance should be inner distance, as clarified by the answer below.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by German.M.Demidov1.3k
2
gravatar for lieven.sterck
7 weeks ago by
lieven.sterck6.8k
VIB, Ghent, Belgium
lieven.sterck6.8k wrote:

well, overlapping reads are a "special case" of paired end reads.

Just as the word say they overlap each other (while in normal cases that should not be). It's actually all linked to the fragment/insert size of your library and the read length sequencing mode you execute.

eg. say your insert is 200bp and you sequence 150bp PE reads, those reads will overlap by 50-100 bases (== overlapping reads), if for the same library you do 75bp PE sequencing they will not overlap.

looking at the image above, if the inner distance becomes negative, the reads will overlap

In some cases (eg. Assembly, amplicon sequencing, ... ) it's advantageous to have overlapping reads as you will end up with longer sequences which you know form a single biological entity

ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by lieven.sterck6.8k

Ah okay so paired end reads don't always overlap. Why is this the case and would you then not be lossing alot of information in your sequencing?

Secondly then why do paired end over single end or visa versa?

Thank you for the help :)

ADD REPLYlink written 7 weeks ago by robert.murphy10
1

exactly, in theory paired end reads should not overlap.

Why do you assume you would be loosing information? In the best cases you will actually gain info (all bases of your input DNA/RNA will have been sequenced, and you will even have some double check on the overlapping ones).

PE over SE because that still gives you more sequence data (and will give higher specificity when doing read mapping for instance). You might opt for SE if you don't really need to info of PE reads, because they are cheaper, takes less space to store them, .... . Typically one would do SE sequencing for gene expression analysis (no need for the higher info of PE, one just has to be able to map them accurately) and PE for assembly for instance (here is often de-novo, so you will benefit from having more data)

ADD REPLYlink written 7 weeks ago by lieven.sterck6.8k

Ah okay that all makes sense. I assume lost information due to the Inner distance shown in the image above. Nothing is capturing the sequence data there right? I assume that is then why Illumina uses long and short fragment sizes to capture the max information and be able to deal with repeating regions? Thank you very much for all the help!

ADD REPLYlink written 7 weeks ago by robert.murphy10
1

you just sequence with high coverage - thus there will be other reads that will cover "inner distance" gap

ADD REPLYlink written 7 weeks ago by German.M.Demidov1.3k

in theory paired end reads should not overlap.

No necessarily. There are library designs where they need to overlap for specific reasons.

ADD REPLYlink written 7 weeks ago by genomax77k

True, there are indeed cases where overlapping is desired (or even needed) but still I stand with the statement they should in theory not overlap. ;)

I'm supported in this by for instance trimmomatic (in default mode) which will discard overlapping reads if they overlap substantially. Also for instance gsnap (yes, old software, I know) will not map them since they do not comply with default standards.

ADD REPLYlink written 7 weeks ago by lieven.sterck6.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2217 users visited in the last hour