Question: Decision on no. and type of libraries to be generated for plant genome assembly
gravatar for MSM55
2.8 years ago by
MSM55140 wrote:

I want to sequence a plant genome whose estimated genome size is of 600 Mb. My question is how to decide:

  • number and type of libraries (paired end and mate pair) on different platforms (PacBio, Illumina)
  • amount data required
  • read length
  • insert size (mate pair)
  • coverage

An example of the same is given below

enter image description here

The idea is to perform a hybrid assembly as the plant genome is complex and contains repeats. I went through several different papers for genome assembly of related plant species however, I don't know how to decide how many different type of libraries should be prepared.

The objective of this post is to have a general understanding on how to take decisions for the same? I know that this largely depends on cost as well however I am interested in the technical part (calculation)

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by MSM55140

Whilst having no experience with this I believe for a genome of that size you can get a decent assembly by combining long reads from nanopore or pacbio with high quality reads from illumina.

ADD REPLYlink written 2.8 years ago by WouterDeCoster45k
gravatar for colindaven
2.8 years ago by
Hannover Medical School
colindaven2.6k wrote:

600mb is not too huge these days.

I would go for Pacbio 40-50X coverage depending on your budget. The longest reads are really important. Secondly, you are going to need an Illumina paired end library to correct the short errors in the Pacbio assembly. Coverage 30X.

I would really avoid a true hybrid assembly (eg 10X pacbio, 100x Illumina) since the tools and results from this are very poor in comparison.

I like nanopore too but this is not as proven for genome assembly and still gives more consensus errors than Pacbio. Still the best seq tech for structural variation in my book though.

ADD COMMENTlink written 2.8 years ago by colindaven2.6k

You are right long read are important in assembly, to handle repeat region and to get good quality of assembly. How this calculation is done of number libraries and amount of data to be produced ?

Can you please explain this calculation despite of budget

ADD REPLYlink written 2.8 years ago by MSM55140

I don't think there is a rule-set to calculate this. it's usually based on experience and gut feeling ;) (and some prior knowledge on that specific genome/species )

ADD REPLYlink written 2.8 years ago by lieven.sterck10k

Also, this is more of a job for your friendly local genomics service provider. Ask them for a quote and they'll happily crunch some numbers for you.

ADD REPLYlink written 2.8 years ago by colindaven2.6k

And from where does that "experience" come ? There must be somethings which we need to judge and then calculate the data required. Any ideas?

ADD REPLYlink written 2.8 years ago by lakhujanivijay5.4k

from having done similar kind of projects previously. People tend to transfer what worked for genome X to genome Y . Anyway, I think estimated genome size is the main factor here (tightly linked to available funding)

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by lieven.sterck10k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2200 users visited in the last hour