I want to sequence a plant genome whose estimated genome size is of 600 Mb. My question is how to decide:

  • number and type of libraries (paired end and mate pair) on different platforms (PacBio, Illumina)
  • amount data required
  • read length
  • insert size (mate pair)
  • coverage

An example of the same is given below

The idea is to perform a hybrid assembly as the plant genome is complex and contains repeats. I went through several different papers for genome assembly of related plant species however, I don't know how to decide how many different type of libraries should be prepared.

The objective of this post is to have a general understanding on how to take decisions for the same? I know that this largely depends on cost as well however I am interested in the technical part (calculation)

Whilst having no experience with this I believe for a genome of that size you can get a decent assembly by combining long reads from nanopore or pacbio with high quality reads from illumina.

600mb is not too huge these days.

I would go for Pacbio 40-50X coverage depending on your budget. The longest reads are really important. Secondly, you are going to need an Illumina paired end library to correct the short errors in the Pacbio assembly. Coverage 30X.

I would really avoid a true hybrid assembly (eg 10X pacbio, 100x Illumina) since the tools and results from this are very poor in comparison.

I like nanopore too but this is not as proven for genome assembly and still gives more consensus errors than Pacbio. Still the best seq tech for structural variation in my book though.

You are right long read are important in assembly, to handle repeat region and to get good quality of assembly. How this calculation is done of number libraries and amount of data to be produced ?

Can you please explain this calculation despite of budget

I don't think there is a rule-set to calculate this. it's usually based on experience and gut feeling ;) (and some prior knowledge on that specific genome/species )

Also, this is more of a job for your friendly local genomics service provider. Ask them for a quote and they'll happily crunch some numbers for you.

And from where does that "experience" come ? There must be somethings which we need to judge and then calculate the data required. Any ideas?

from having done similar kind of projects previously. People tend to transfer what worked for genome X to genome Y . Anyway, I think estimated genome size is the main factor here (tightly linked to available funding)

