Question

de novo assembly and SNP discovery

0

Entering edit mode

7.3 years ago

rnaseq2017 • 0

Which NGS platform and sequencing depth are suitable for de novo assembly and SNP discovery in a non model fungus with 20 Mb of genome size? Is "Sequencing on Hiseq2500, 100bp PE; 100M reads (10Gb) "suitable ? Could I reduce sequencing depth to 30 M or 50M?

SNP • 2.0k views

ADD COMMENT • link updated 7.3 years ago by Brian Bushnell 20k • written 7.3 years ago by rnaseq2017 • 0

0

Entering edit mode

Are you sequencing two or more stains or comparing to an existing genome? Denovo and SNP detection don't usually go hand in hand

ADD REPLY • link 7.3 years ago by Asaf 10k

0

Entering edit mode

Actually, we kind of do that sometimes at JGI... the goal is generally to find out which strain of an organism, that is capable of metabolizing X, is the best at metabolizing X. Or something similar.

ADD REPLY • link 7.3 years ago by Brian Bushnell 20k

0

Entering edit mode

I didn't want to give an elaborate answer like you gave so I tried to narrow it down to one :)

ADD REPLY • link 7.3 years ago by Asaf 10k

score 3 · Answer 1 · 2017-01-21

Assembly and variant-calling are different. For assembly, you need higher depth and longer reads than for variant-calling, so you should separate your needs into two parts:

1) What kind of data do I need to assemble this organism?

2) What kind of data do I need to call variants on this organism?

I'll assume it's haploid, which simplifies things. I'd suggest at least 100x coverage for assembly. The HiSeq 2500 platform at 2x150bp is great - however, the MiSeq is better, as it offers longer reads (2x250) but it costs more. Since you only assemble once, if you are restricted to Illumina platforms, you should go with MiSeq 2x250 for assembly. MiSeq also allows 2x300bp sequencing, but Illumina has a history of producing corrupted 2x300 kits so I don't recommend that. The 2x250 kits seem to be good, and MiSeq is Illumina's highest-quality platform. For assembly, using an unamplified library is critical.

For subsequent variant-calling on lots of samples, read length is less important. So, just sequence at 2x150 on a HiSeq 2500 (which is Illumina's second-highest-quality platform), or 2x100 (only if it is substantially cheaper per bp). If your organism is haploid, 20x coverage is more than enough for an unamplified library.

If you want an optimal assembly, you should sequence at ~100x on PacBio. That can often yield a near-perfect assembly for genomes in this size range, and it will always be dramatically better than an assembly from Illumina reads. PacBio gives the best assemblies, period - and again, you only assemble once. This would probably be 2 SMRT cells, so, around $1000 for a near-perfect assembly. Definitely the best option! But Illumina is still better for variant-calling, at least for small variants like SNPs. PacBio is better for structural variations and phasing, but you don't need phasing with a haploid. So, I suggest PacBio at >=100x depth for assembly and Illumina HiSeq 2500 at 2x150 (or 2x100 if it is substantially cheaper) for variant-calling, at ~20x depth. Both non-PCR-amplified. If cost is a big issue you can also drop the coverage for variant-calling lower, down to 10x, assuming the fungus is haploid and the libraries are unamplified. Note that lower coverage requires longer reads; 20x with 100bp reads is strictly inferior to 20x with 150bp reads, but at 10x, 100bp reads are even more inferior.