Tool:Tools to merge overlapping paired-end reads
0
31
Entering edit mode
4.4 years ago

Introduction

In very simple terms, current sequencing technology begins by breaking up long pieces of DNA into lots more short pieces of DNA. The resultant set of DNA is called a "library" and the short pieces are called "fragments". Each of the fragments in the library are then sequenced individually and in parallel. There are two ways of sequencing a fragment - either just from one end, or from both ends of a fragment. If only one end is sequenced, you get a single read. If your technology can sequence both ends, you get a "pair" of reads for each fragment. These "paired-end" reads are standard practice on Illumina instruments like the GAIIx, HiSeq and MiSeq.

Now, for single-end reads, you need to make sure your read length (L) is shorter than your fragment length (F) or otherwise the sequence will run out of DNA to read! Typical Illumina fragment libraries would use F ~ 450bp but this is variable. For paired-end reads, you want to make sure that F is long enough to fit two reads. This means you need F to be at least 2L. As L=100 or 150bp these days for most people, using F~450bp is fine, there is a still a safety margin in the middle.

However, some things have changed in the Illumina ecosystem this year. Firstly, read lengths are now moving to >150bp on the HiSeq (and have already been on the GAIIx), and to >250bp on the MiSeq, with possibilities of longer ones coming soon! This means that the standard library size F~450bp has become too small, and paired end reads will overlap. Secondly, the new enyzmatic Nextera library preparation system produces a wide spread of F sizes compared to the previous TruSeq system. With Nextera, we see F ranging from 100bp to 900bp in the same library. So some reads will overlap, and others won't. It's starting to get messy.

The whole point of paired-end reads is to get the benefit of longer reads without actually being able to sequence reads that long. A paired-end read (two reads of length L) from a fragment of length F, is a bit like a single-read of length F, except a bunch of bases in the middle of it are unknown, and how many of them there are is only roughly known (as libraries are only nominally of length F, each read will vary). This gives the reads a longer context, and this particularly helps in de novo assembly and in aligning more reads unambiguously to a reference genome. However, many software tools will get confused if you give them overlapping pairs, and if we could overlap them and turn them into longer single-end reads, many tools will produce better results, and faster.

The tools

Here is a list of tools which can do the overlapping procedure. I am NOT going to review them all here. I've used one tool (FLASH) to overlap some MiSeq 2x150 PE reads, and then assembled them using Velvet, and the merged reads produced a "better" assembly than with the paired reads. But that's it. I write this post to inform people of the problem, and to collate all the tools in one place to save others effort. Enjoy!

FLASH (Fast Length Adjustment of Short Reads to Improve Genome Assemblies) http://www.cbcb.umd.edu/software/flash

stitch (now defunct, merged into PANDAseq) https://github.com/audy/stitch

Features to look for

Keeps original IDs in merged reads

Rescores the Phred qualities across the overlapped region

Parameters to control the overlap sensitivity

Handle .gz and .bz2 compressed files

Written in C/C++ (faster compiled) rather than Python/Perl (slower)

overlapping Assembly ngs fastq Tool • 25k views
1
Entering edit mode

This was originally posted by Torsten Seemann to his blog The Genome Factory. This was back in 2012, so some of the recommendations may be out of date. Two more recent tools worth looking at are leeHom and AdapterRemoval v2.

0
Entering edit mode

This is a nice overview :) - You should also take a look at BBMap, which is usually very good at these sorts of read manipulation things.

1
Entering edit mode

Specifically bbmerge.sh from BBMap. @Brian has an extended post available here.

0
Entering edit mode

Is there a tool to combine overlapping PE read that uses reference alignment for merging? I only found aftermerge but had no success due to CIGAR string problems (Merging overlapping mates in a BAM / SAM file into one read.)

0
Entering edit mode

This is not a commonly used analysis method that is why there is a dearth of tools. You should try bbmerge out on your original data. You may be pleasantly surprised.

0
Entering edit mode

Thanks Genomax. I used bbmap toolkit a lot, and I agree it's really great. But the bbmerge merge does not work with reference to merge the reads, right? My concern is that some reads will have small overlaps, and thus the overall fidelity of this approach will not be as high as I could be with reference. But I will give it a try!

0
Entering edit mode

bbmerge has plenty of options. You can play with them and see if you are able to get the kind of overlaps you are looking for.