Question

Joining vs merging paired ended reads

1

Entering edit mode

5.7 years ago

drikaul ▴ 20

Hi community!

I have paired-ended amplicon sequences for a batch of samples, with very little overlap (<10%).

Conceptually, I was wondering if it makes sense to join the forward and reverse read to generate a single read for downstream processing, instead of interleaving/merging them to get the overlapping sequences, since that isn't the best solution in this particular case?

or perhaps, concatenating the R1 and R2 to read it in as a single read?

Thanks!

sequencing merging paired-ended • 8.4k views

ADD COMMENT • link updated 5.7 years ago by h.mon 35k • written 5.7 years ago by drikaul ▴ 20

0

Entering edit mode

Or perhaps just keeping them as two separate paired-end reads? Why would you want to merge or join them?

ADD REPLY • link 5.7 years ago by WouterDeCoster 48k

0

Entering edit mode

The idea is to call OTUs on them, so I'm trying to figure out what the best way is to make use of the forward and reverse reads since the overlap is minimal. For now, I'm leaning more towards just using the forward reads, since their quality is pretty okay in comparison, but I was just wondering, if conceptually, it made sense to join the two?

ADD REPLY • link 5.7 years ago by drikaul ▴ 20

1

Entering edit mode

No it would not make sense in my opinion to just concatenate the forward and reverse reads. That has to do with the downstream analyses. If you blast there is a change that you don't get the right biological hit which is a must in this kind of studies. Did you already tried to merge them and see how good or bad it is?

ADD REPLY • link 5.7 years ago by gb ★ 2.2k

0

Entering edit mode

Thanks, that makes sense and is on the lines of what I was thinking! If by merging the reads, you mean, checking to see the overlap, then yes, I already did that and it's minimal. Haven't tried joining them yet.

ADD REPLY • link 5.7 years ago by drikaul ▴ 20

0

Entering edit mode

Check out 'PANDAseq'.

"PANDASEQ is a program to align Illumina reads, optionally with PCR primers embedded in the sequence, and reconstruct an overlapping sequence."

You can also find many other similar tools on the web.

ADD REPLY • link 5.7 years ago by mbk0asis ▴ 700

score 2 · Answer 1 · 2019-10-30

2

Entering edit mode

5.7 years ago

Carambakaracho ★ 3.3k

In addition to PANDAseq, you may want to look into vsearch, FLASH2 and Pear, all of which can do overlap merging. A classic approach is to do the overlap merging where applicable and join/concatenate the pairs without sufficient overlap. The overlap is a function of the distribution of the fragment size and varies considerably between pairs. vsearch can join the reads, too - see for example Torbjorn Rogne's and Frederic Mahe's pipeline

Joining non overlapping reads can make sense or not, depending on what you plan in downstream processing. Some kmer based classification pipelines need the reads joined, in other cases it may not make sense.

ADD COMMENT • link 5.7 years ago by Carambakaracho ★ 3.3k

0

Entering edit mode

yes, I used PEAR for this particular analysis, but have used vsearch in the past too. I'm trying to call OTUs on the merged reads, would joining the non-overlapping reads make sense for that purpose? The way I understand it, this might generate erroneous OTU sequences that would skew downstream clustering analysis.

ADD REPLY • link 5.7 years ago by drikaul ▴ 20

0

Entering edit mode

depends a bit, but sure clustering merged and joined sequences from the identical organism would yield two OTUs. On the other hand, when the differences in one pair don't justify merging, how likely would it be the pairs ended up in two separate OTUs?

ADD REPLY • link 5.7 years ago by Carambakaracho ★ 3.3k

score 0 · Answer 2 · 2019-10-30

My personal recommendation would be to use primers appropriate to the sequencing platform, in such a way the amplicon is shorter than the sum of R1+R2 and you get a good overlap, allowing unambiguous merging of pairs. This way, you reduce the error rate at the end of the reads, where quality is lower and most error occur.

However, there is a tool developed to use both reads even when there is no overlap:

IM-TORNADO: A Tool for Comparison of 16S Reads from Paired-End Libraries