Question

Tool:Tools to merge overlapping paired-end reads

39

Entering edit mode

7.4 years ago

Abdul Rafay Khan ★ 1.2k

Introduction

In very simple terms, current sequencing technology begins by breaking up long pieces of DNA into lots more short pieces of DNA. The resultant set of DNA is called a "library" and the short pieces are called "fragments". Each of the fragments in the library are then sequenced individually and in parallel. There are two ways of sequencing a fragment - either just from one end, or from both ends of a fragment. If only one end is sequenced, you get a single read. If your technology can sequence both ends, you get a "pair" of reads for each fragment. These "paired-end" reads are standard practice on Illumina instruments like the GAIIx, HiSeq and MiSeq.

Now, for single-end reads, you need to make sure your read length (L) is shorter than your fragment length (F) or otherwise the sequence will run out of DNA to read! Typical Illumina fragment libraries would use F ~ 450bp but this is variable. For paired-end reads, you want to make sure that F is long enough to fit two reads. This means you need F to be at least 2L. As L=100 or 150bp these days for most people, using F~450bp is fine, there is a still a safety margin in the middle.

However, some things have changed in the Illumina ecosystem this year. Firstly, read lengths are now moving to >150bp on the HiSeq (and have already been on the GAIIx), and to >250bp on the MiSeq, with possibilities of longer ones coming soon! This means that the standard library size F~450bp has become too small, and paired end reads will overlap. Secondly, the new enyzmatic Nextera library preparation system produces a wide spread of F sizes compared to the previous TruSeq system. With Nextera, we see F ranging from 100bp to 900bp in the same library. So some reads will overlap, and others won't. It's starting to get messy.

The whole point of paired-end reads is to get the benefit of longer reads without actually being able to sequence reads that long. A paired-end read (two reads of length L) from a fragment of length F, is a bit like a single-read of length F, except a bunch of bases in the middle of it are unknown, and how many of them there are is only roughly known (as libraries are only nominally of length F, each read will vary). This gives the reads a longer context, and this particularly helps in de novo assembly and in aligning more reads unambiguously to a reference genome. However, many software tools will get confused if you give them overlapping pairs, and if we could overlap them and turn them into longer single-end reads, many tools will produce better results, and faster.

The tools

Here is a list of tools which can do the overlapping procedure. I am NOT going to review them all here. I've used one tool (FLASH) to overlap some MiSeq 2x150 PE reads, and then assembled them using Velvet, and the merged reads produced a "better" assembly than with the paired reads. But that's it. I write this post to inform people of the problem, and to collate all the tools in one place to save others effort. Enjoy!

PEAR (Paired-End Read Merger) http://sco.h-its.org/exelixis/web/software/pear/doc.html

COPE (Connecting Overlapping Paired End reads) http://sourceforge.net/projects/coperead/

SeqPrep https://github.com/jstjohn/SeqPrep

FLASH (Fast Length Adjustment of Short Reads to Improve Genome Assemblies) http://www.cbcb.umd.edu/software/flash

fastq-join (part of ea-utils) http://code.google.com/p/ea-utils/wiki/FastqJoin

PANDAseq https://github.com/neufeld/pandaseq

stitch (now defunct, merged into PANDAseq) https://github.com/audy/stitch

mergePairs.py http://code.google.com/p/standardized-velvet-assembly-report/source/browse/trunk/mergePairs.py

Features to look for

Keeps original IDs in merged reads

Outputs the un-overlapped paired reads

Ability to strip adaptors first

Rescores the Phred qualities across the overlapped region

Parameters to control the overlap sensitivity

Handle .gz and .bz2 compressed files

Multi-threading support

Written in C/C++ (faster compiled) rather than Python/Perl (slower)

ngs Assembly fastq • 43k views

ADD COMMENT • link updated 12 months ago by Charles-Alexandre Roy ▴ 50 • written 7.4 years ago by Abdul Rafay Khan ★ 1.2k

1

Entering edit mode

This was originally posted by Torsten Seemann to his blog The Genome Factory. This was back in 2012, so some of the recommendations may be out of date. Two more recent tools worth looking at are leeHom and AdapterRemoval v2.

ADD REPLY • link 5.7 years ago by kate.j.mckenzie ▴ 10

1

Entering edit mode

Thanks Genomax. I used bbmap toolkit a lot, and I agree it's really great. But the bbmerge merge does not work with reference to merge the reads, right? My concern is that some reads will have small overlaps, and thus the overall fidelity of this approach will not be as high as I could be with reference. But I will give it a try!

ADD REPLY • link 3.2 years ago by lechu ▴ 20

1

Entering edit mode

bbmerge has plenty of options. You can play with them and see if you are able to get the kind of overlaps you are looking for.

ADD REPLY • link 3.2 years ago by GenoMax 141k

0

Entering edit mode

This is a nice overview :) - You should also take a look at BBMap, which is usually very good at these sorts of read manipulation things.

ADD REPLY • link 7.4 years ago by John 13k

2

Entering edit mode

Specifically bbmerge.sh from BBMap. @Brian has an extended post available here.

ADD REPLY • link 7.4 years ago by GenoMax 141k

1

Entering edit mode

Is there a tool to combine overlapping PE read that uses reference alignment for merging? I only found aftermerge but had no success due to CIGAR string problems (Merging overlapping mates in a BAM / SAM file into one read.)

ADD REPLY • link 3.2 years ago by lechu ▴ 20

1

Entering edit mode

This is not a commonly used analysis method that is why there is a dearth of tools. You should try bbmerge out on your original data. You may be pleasantly surprised.

ADD REPLY • link 3.2 years ago by GenoMax 141k

0

Entering edit mode

Take a look to this article: "Optimizing Information in Next-Generation-Sequencing (NGS) Reads for Improving De Novo Genome Assembly"

Liu T, Tsai C-H, Lee W-B, Chiang J-H (2013) Optimizing Information in Next-Generation-Sequencing (NGS) Reads for Improving De Novo Genome Assembly. PLoS ONE 8(7): e69503. doi:10.1371/journal.pone.0069503

Authors presents the ARF-PE tool, wich looks amazing but seems to be discontinued.

ADD REPLY • link 19 months ago by Asan Emirsale • 0

0

Entering edit mode

NGmerge (2018) is another option. According to the paper, it performs better than other popular tools like FLASH and PEAR, particularly with respect to the estimation of quality scores for consensus bases.

ADD REPLY • link 12 months ago by Charles-Alexandre Roy ▴ 50