Question

Decision on no. and type of libraries to be generated for plant genome assembly

2

Entering edit mode

7.2 years ago

MSM55 ▴ 160

I want to sequence a plant genome whose estimated genome size is of 600 Mb. My question is how to decide:

number and type of libraries (paired end and mate pair) on different platforms (PacBio, Illumina)
amount data required
read length
insert size (mate pair)
coverage

An example of the same is given below

enter image description here

The idea is to perform a hybrid assembly as the plant genome is complex and contains repeats. I went through several different papers for genome assembly of related plant species however, I don't know how to decide how many different type of libraries should be prepared.

The objective of this post is to have a general understanding on how to take decisions for the same? I know that this largely depends on cost as well however I am interested in the technical part (calculation)

assembly genome library illumina pacbio • 1.9k views

ADD COMMENT • link 7.1 years ago by MSM55 ▴ 160

0

Entering edit mode

Whilst having no experience with this I believe for a genome of that size you can get a decent assembly by combining long reads from nanopore or pacbio with high quality reads from illumina.

ADD REPLY • link 7.2 years ago by WouterDeCoster 48k

score 0 · Answer 1 · 2018-05-21

0

Entering edit mode

7.2 years ago

colindaven 7.7k

600mb is not too huge these days.

I would go for Pacbio 40-50X coverage depending on your budget. The longest reads are really important. Secondly, you are going to need an Illumina paired end library to correct the short errors in the Pacbio assembly. Coverage 30X.

I would really avoid a true hybrid assembly (eg 10X pacbio, 100x Illumina) since the tools and results from this are very poor in comparison.

I like nanopore too but this is not as proven for genome assembly and still gives more consensus errors than Pacbio. Still the best seq tech for structural variation in my book though.

ADD COMMENT • link 7.2 years ago by colindaven 7.7k

0

Entering edit mode

You are right long read are important in assembly, to handle repeat region and to get good quality of assembly. How this calculation is done of number libraries and amount of data to be produced ?

Can you please explain this calculation despite of budget

ADD REPLY • link 7.1 years ago by MSM55 ▴ 160

0

Entering edit mode

I don't think there is a rule-set to calculate this. it's usually based on experience and gut feeling ;) (and some prior knowledge on that specific genome/species )

ADD REPLY • link 7.1 years ago by lieven.sterck 15k

0

Entering edit mode

Also, this is more of a job for your friendly local genomics service provider. Ask them for a quote and they'll happily crunch some numbers for you.

ADD REPLY • link 7.1 years ago by colindaven 7.7k

0

Entering edit mode

And from where does that "experience" come ? There must be somethings which we need to judge and then calculate the data required. Any ideas?

ADD REPLY • link 7.1 years ago by lakhujanivijay 5.9k

0

Entering edit mode

from having done similar kind of projects previously. People tend to transfer what worked for genome X to genome Y . Anyway, I think estimated genome size is the main factor here (tightly linked to available funding)

ADD REPLY • link 7.1 years ago by lieven.sterck 15k