Question: Looking For A Recommendation To Perform A De Novo Assembly With Miseq Data Of Lengths 2X250Bp
4
gravatar for Leszek
6.7 years ago by
Leszek4.0k
IIMCB, Poland
Leszek4.0k wrote:

I have got overlapping MiSeq 2x250bp reads (after merging single-end 400-450bp). The genome size is ~20Mb. I think de Bruijn graph based assemblers is not the way to proceed with such dataset, isn't it?
Have you had some experienced assembling this kind of data? Maybe some 'good-old-times' (overlap-based) assembler can handle it better?

assembly miseq denovo • 5.8k views
ADD COMMENTlink modified 6.7 years ago by 141341254653464453.5k • written 6.7 years ago by Leszek4.0k
1

What is the sequencing library insert size? If 500-600 or below, you may try to find overlaps within pair of reads with Quake. Also you may get better overlaps if you error correct prior to Quake.

ADD REPLYlink written 6.7 years ago by Darked894.2k
1

I did, so what I'm playing with is single reads of 350-450bp (100x) and paired reads (2x250bp) that didn't merge correctly (50x).

ADD REPLYlink written 6.7 years ago by Leszek4.0k

What organism do you have that is 20Mb? Small end of the eukaryotes? If you have good coverage (this is key) then any of the suites of assemblers will do. I like velvet for a genome this size. Someone else might like another assembler. Depending on your sequencing depth and gene space you'll probably have to do some post assembly clean up. I like PAGIT for that.

ADD REPLYlink modified 6.7 years ago • written 6.7 years ago by Josh Herr5.7k

it's average fungal genome. thing is, the genome is quite heterozygous, so de Bruijn graph assemblers (Velvet, SOAP, ABySS) are having hard times and shattering it a lot... I'm more into older-style assembler like Newbler or Celera. Anyone tried it with MiSeq?

ADD REPLYlink written 6.7 years ago by Leszek4.0k

I work with fungi too. Sounds like 20Mb is in yeast territory, so it's on the smaller size. I am working on assembly of a few in the 40 to 60 Mb range and they also have high heterozygosity. We still use de Bruijn style assemblers mainly, but I also use Newbler on occasion. I think coverage is key. My suggestion is to try Newbler and see how the assembly compares to a de Bruijn like Velvet. Good luck and let me know if you want to commiserate with me about it!

ADD REPLYlink written 6.7 years ago by Josh Herr5.7k
1

Thanks Josh. I have quickly tried SOAPdenovo sometime ago and it performed below my expectations... this is why I want to try something old style. Maybe I will give a try to ALLPaths as BROAD made it in overlapping reads in mind... Anyway, I will keep you posted.

ADD REPLYlink written 6.7 years ago by Leszek4.0k
1

@Leszek, I don't think you can use ALLPATHS this way (with just one library), to my knowledge. Unless, there is some hack I don't know about. With a genome this size you should be able to benchmark numerous methods in a reasonable amount of time. I agree with Josh in the approach, I'd run Newbler and VelvetOptimser and see how they compare, given your read lengths.

ADD REPLYlink modified 6.7 years ago • written 6.7 years ago by SES8.2k
5
gravatar for Bach
6.7 years ago by
Bach550
Bach550 wrote:

For trying overlap based assemblers: absolutely do reduce the data set. Maybe to something like ~80x, but not really much more. Say, 40x from your merged reads, 40x from still paired reads. Then try out any of the usual suspects.

My first try would be with MIRA, but that is just because I wrote it. In case you use MIRA: make sure the merged reads have all adaptors clipped away. The unmerged reads should not be preprocessed at all, MIRA will clip them just right (adaptors, quality, simple sequencing errors, etc.)

ADD COMMENTlink written 6.7 years ago by Bach550

Hi Bastien. I'm trying MIRA. It's running since yesterday - we'll see. Right now I'm running ~60x (the reads that successfully merged). Good point, I was also considering running reads that didn't merge correctly as second lib (these likely have too big insert size or low quals toward the ends so didn't merge correctly).

ADD REPLYlink written 6.7 years ago by Leszek4.0k

MIRA is running since Thursday (third pass) on 10 cores and using 50GB of RAM. Is that fine? Reported coverage is 51x. Another point, how well MIRA handles heterozygous regions (3-4% divergence). It reports 24Mb, while I'm expecting ~13Mb...

ADD REPLYlink written 6.7 years ago by Leszek4.0k
2
gravatar for 14134125465346445
6.7 years ago by
United Kingdom
141341254653464453.5k wrote:

SGA is an overlap-based assembler that works well with Illumina datasets: http://github.com/jts/sga

ADD COMMENTlink written 6.7 years ago by 141341254653464453.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1824 users visited in the last hour