Tool: Tools to merge overlapping paired-end reads
gravatar for Abdul Rafay Khan
3.8 years ago by
Karachi, PK
Abdul Rafay Khan1.1k wrote:


In very simple terms, current sequencing technology begins by breaking up long pieces of DNA into lots more short pieces of DNA. The resultant set of DNA is called a "library" and the short pieces are called "fragments". Each of the fragments in the library are then sequenced individually and in parallel. There are two ways of sequencing a fragment - either just from one end, or from both ends of a fragment. If only one end is sequenced, you get a single read. If your technology can sequence both ends, you get a "pair" of reads for each fragment. These "paired-end" reads are standard practice on Illumina instruments like the GAIIx, HiSeq and MiSeq.

Now, for single-end reads, you need to make sure your read length (L) is shorter than your fragment length (F) or otherwise the sequence will run out of DNA to read! Typical Illumina fragment libraries would use F ~ 450bp but this is variable. For paired-end reads, you want to make sure that F is long enough to fit two reads. This means you need F to be at least 2L. As L=100 or 150bp these days for most people, using F~450bp is fine, there is a still a safety margin in the middle.

However, some things have changed in the Illumina ecosystem this year. Firstly, read lengths are now moving to >150bp on the HiSeq (and have already been on the GAIIx), and to >250bp on the MiSeq, with possibilities of longer ones coming soon! This means that the standard library size F~450bp has become too small, and paired end reads will overlap. Secondly, the new enyzmatic Nextera library preparation system produces a wide spread of F sizes compared to the previous TruSeq system. With Nextera, we see F ranging from 100bp to 900bp in the same library. So some reads will overlap, and others won't. It's starting to get messy.

The whole point of paired-end reads is to get the benefit of longer reads without actually being able to sequence reads that long. A paired-end read (two reads of length L) from a fragment of length F, is a bit like a single-read of length F, except a bunch of bases in the middle of it are unknown, and how many of them there are is only roughly known (as libraries are only nominally of length F, each read will vary). This gives the reads a longer context, and this particularly helps in de novo assembly and in aligning more reads unambiguously to a reference genome. However, many software tools will get confused if you give them overlapping pairs, and if we could overlap them and turn them into longer single-end reads, many tools will produce better results, and faster.

The tools

Here is a list of tools which can do the overlapping procedure. I am NOT going to review them all here. I've used one tool (FLASH) to overlap some MiSeq 2x150 PE reads, and then assembled them using Velvet, and the merged reads produced a "better" assembly than with the paired reads. But that's it. I write this post to inform people of the problem, and to collate all the tools in one place to save others effort. Enjoy!

PEAR (Paired-End Read Merger)

COPE (Connecting Overlapping Paired End reads)


FLASH (Fast Length Adjustment of Short Reads to Improve Genome Assemblies)

fastq-join (part of ea-utils)


stitch (now defunct, merged into PANDAseq)

Features to look for

Keeps original IDs in merged reads

Outputs the un-overlapped paired reads

Ability to strip adaptors first

Rescores the Phred qualities across the overlapped region

Parameters to control the overlap sensitivity

Handle .gz and .bz2 compressed files

Multi-threading support

Written in C/C++ (faster compiled) rather than Python/Perl (slower)

ADD COMMENTlink modified 2.1 years ago by kate.j.mckenzie10 • written 3.8 years ago by Abdul Rafay Khan1.1k

This was originally posted by Torsten Seemann to his blog The Genome Factory. This was back in 2012, so some of the recommendations may be out of date. Two more recent tools worth looking at are leeHom and AdapterRemoval v2.

ADD REPLYlink written 2.1 years ago by kate.j.mckenzie10

This is a nice overview :) - You should also take a look at BBMap, which is usually very good at these sorts of read manipulation things.

ADD REPLYlink written 3.8 years ago by John12k

Specifically from BBMap. @Brian has an extended post available here.

ADD REPLYlink written 3.8 years ago by genomax89k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 800 users visited in the last hour