Question

Paired-End Overlapping With Error Correction - Which Is Better?

2

Entering edit mode

10.6 years ago

Rohit ★ 1.5k

Hello.

I have recently started working on a new project of de-novo assembly. But I want to have some good pipeline to start.

I have to over-lap my reads and also filter the reads according to the quality. I was thinking of using FLASH or COPE for overlapping , with slight preference to COPE as I have used FLASH before (just want to try a new tool).

But I have to do Error correction too and am planning to use MUSKET for it.

My questions are 1) Do I use Musket for error correction before I try to overlap with COPE (or) Do I overlap the reads first with COPE and then use Musket?

2) Is FLASH still better than COPE for overlap or do you suggest anything else?

ngs genome • 4.4k views

ADD COMMENT • link 10.5 years ago by Rohit ★ 1.5k

score 3 · Answer 1 · 2013-10-25

Since I used the tools I mentioned before, these are the results....

First is Error Correction is to be done before Merging. And Musket performs really good for the correction.

For merging, Flash gave one-third more merged reads than Flash when using the same overlap values. But one-fourth of Flash merged reads were shorter than the length(read+ overlap) cut-off. COPE had more better quality overlaps and higher number of longer overlaps compared to Flash. COPE was more precise as it uses the quality cut-off and ambiguity removal too.

When error correction was done before merging, there was an increase by one-fifth in the number of merged reads than without correction.

I worked on primates data and this is how I can conclude based solely on my results,

1) don't Trim just Error-correct (if your read data is not too bad) and 2) error-correct then merge

score 2 · Answer 2 · 2013-09-29

2

Entering edit mode

10.6 years ago

rtliu ★ 2.2k

For question 2), I would suggest using abyss-mergepairs:

ABySS 1.35 included a new program "abyss-mergepairs", source code https://github.com/bcgsc/abyss/blob/master/Align/mergepairs.cc

The program was described in white spruce genome Bioinformatics paper :

"2.4 Read merging Reads from the HiSeq 2000 PET 250 bp libraries and the MiSeq PET 500 bp libraries were merged using abyss mergepairs (Supplementary Fig. S3). This utility performed a pair-wise Smith Waterman overlapped alignment (Smith and Waterman, 1981) between reads pairs, and selected the best quality base where alignments returned mismatching bases. An arbitrary base was selected when qualities were identical. In cases of read-to-read alignment ambiguity, read pairs were not merged."

ADD COMMENT • link 10.6 years ago by rtliu ★ 2.2k

0

Entering edit mode

Have the results been compared to the other existing tools... How different are the results from them?

ADD REPLY • link 10.6 years ago by Rohit ★ 1.5k

0

Entering edit mode

I have not done any comparison, but I trusted ABySS authors.

ADD REPLY • link 10.6 years ago by rtliu ★ 2.2k

score 0 · Answer 3 · 2013-09-26

0

Entering edit mode

10.6 years ago

Istvan Albert 100k

I haven't used this combination of tools yet but my gut says that since the overlap relies on the sequences matching you'd get better results if you corrected first and combined in the second step.

But the gut can be wrong, so maybe the best would be to try both and tell us what turned out to be better :-)

ADD COMMENT • link 10.6 years ago by Istvan Albert 100k

0

Entering edit mode

Will I be able to tell which way was better only after I am done with the assembly by comparing the N50 values and number of contigs, or do you think there is a checkpoint somewhere in the middle?

ADD REPLY • link 10.6 years ago by Rohit ★ 1.5k

0

Entering edit mode

Not quite sure. You may be able to see that from the number of reads that you can merge successfully.

ADD REPLY • link 10.6 years ago by Istvan Albert 100k

0

Entering edit mode

But don't you think that there can be chances of more merged reads if there is data error, and if I go for more data correction then chances of more data loss. But I think I have to try both methods with my data but can't probably be sure which worked better. Vicious circle of Quality filter I guess :(

ADD REPLY • link 10.6 years ago by Rohit ★ 1.5k