Assembly Strategy
1
1
Entering edit mode
10.2 years ago
Panos ★ 1.8k

We're working on sequencing some big insect genomes (>2GBp) and as the first data comes out, I'm trying to find a way to tackle with them.

MaSuRCA crashed on just one lane of unfiltered sequences (~(140 x 2) mil reads) and I started looking for alternatives, since for some of our insects we have 10 lanes (paired-end and mate pairs).

So one of the suggestions was to split our reads into smaller subsets and assemble the subsets, separately. Then, move with assembling the assemblies and so on, until we get to the final assembly. One of the problems I see with this approach, however, is that you may end up assembling contigs coming from different assemblies that have very different sequencing coverage (hence different copy number).

What are your thoughts about this approach and also what are your thoughts about the assembly strategy, in general, that I should follow? I know that some plant genomes are a lot bigger than our insects so maybe there is already a solution!

Also, the machine I ran MaSuRCA on had 256GB of RAM, which I think is not small; maybe I can find a machine with 512GB, but definitely not more than this. So please have that in mind when suggesting solutions!

Last, I saw that there's a very similar question, but it was more than a year ago so some things may have changed since then...

Thanks!

illumina • 2.1k views
ADD COMMENT
1
Entering edit mode
10.2 years ago

Splitting up the reads is likely a necessary strategy. I remember having to do with with some herpesvirus sequences. You can use a secondary aligner (such as the one in Staden) to assemble contigs from different parameters and/or subsets of the data (or even different assembly programs). Here is an example workflow for that sort of strategy:

http://genomics-pubs.princeton.edu/prv/scripts.shtml

You can also see if velvetOptimiser can help:

http://bioinformatics.net.au/software.velvetoptimiser.shtml

I know that Oases automatically collects runs Velvet with different parameters, so you could use Oases and just use the final velvet contigs (and ignore the transcripts - which I would recommend, even if you were working with RNA-Seq data; I've actually found the normal assembly tools, like CLC Bio de novo, to be more accurate).

That said, I think you are going to have gaps and issues with repetitive / homologous sequences no matter what. So, I just wanted to make sure that you weren't expecting to get full chromosomes out of the assembler.

ADD COMMENT

Login before adding your answer.

Traffic: 2601 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6