Question

How Does Merging The Multiple Replicate Data Of A Single Genome Affect The Genome Assembly

1

Entering edit mode

10.6 years ago

Rohit ★ 1.5k

Hello again.

I guess I decided to ask all the questions of this week today. I have a model organism which has not yet been sequenced. So I would go for denovo assembly. I will be running the data on multiple lanes as replicates for the same sample.

How will this affect my Genome assembly?

Are there chances of getting erroneous contigs?

Will most of my reads be discarded as k-mer filter might think of the other replicates as repetitive or false positive k-mers and discard them?

Is there any specific assembler which might avoid the above problems?

Or will it just affect the computational resources?

genome assembly ngs contigs • 4.0k views

ADD COMMENT • link updated 9.9 years ago by Biostar 20 • written 10.6 years ago by Rohit ★ 1.5k

score 1 · Answer 1 · 2013-09-26

1

Entering edit mode

10.6 years ago

johnstantongeddes ▴ 410

A lot of questions in one email. Maybe it would help if you could clarify why you "will be running the data on multiple lanes as replicates for the same sample". Do you expect this species to have a large genome so that you will need multiple lanes to achieve adequate coverage? Naively, more lanes simply give you more reads.

For genome assembly, you should probably look into ALLPATHS-LG. The combined used of Illumina and PacBio sequencing libraries is especially promising. See here

ADD COMMENT • link 10.6 years ago by johnstantongeddes ▴ 410

0

Entering edit mode

We are using Illumina sequencing libraries at present, and I have to say that usage of hybrid libraries is something we are not planning as of now as we need lesser errors for the reads. As you have said, the genome is large (larger than human) and highly repetitive.

ADD REPLY • link 10.6 years ago by Rohit ★ 1.5k

score 1 · Answer 2 · 2013-09-26

1

Entering edit mode

10.6 years ago

cts ★ 1.7k

I suspect that by doing replicates your probably going to get very similar assemblies from any of them if they were assembled separately or together since technical replication from a single sample should be very high (one would hope so anyway!). So if all of those replicates contained the same library construction I don't see them changing the overall assembly much other than providing more coverage for your genome, which would be a good thing. However there is a thing as too much data, if you sequence to incredible read depth, sequencing errors will build up which may confuse the assembler. You dont say what model organism your using but I'm guessing that your sequencing a eukaryote based on the amount of data your talking about, so I doubt it would be an issue. Since the replicates are coming from the same sample I doubt that kmers would be filtered as repetitive, instead it most likely to just increase the coverage of each k-mer. Choice of assembler is dependant on how big the genome is. For small microbial genomes I would suggest spades, but for eukaryotes I'm not sure since it is not my field. I would suggest looking at the assembleathon 2 paper that compares genome assemblers for eukaryote genomes.

ADD COMMENT • link 10.6 years ago by cts ★ 1.7k

0

Entering edit mode

It is an eukaryote marine organism with a huge genome. And since you say that kmers would not be filtered as repetitive, I think I can proceed with it. I have one more sample data-set of primates which I am using to test my assembly method. It has multiple replicates too. So I would be just testing it with my pipeline at hand.

ADD REPLY • link 10.6 years ago by Rohit ★ 1.5k

0

Entering edit mode

You need to be pretty explicit in the information you give. Define huge. Do you know if its polymorphic? Has some survey sequencing already been done? What were the results?

How repetitive is it likely to be (10% little -> 90% quite repetitive)? If it is repetitive, what are the repeats? Are they tandem repeats or large structured transposable elements?

ADD REPLY • link 10.6 years ago by gammyknee ▴ 210

0

Entering edit mode

Genome size expected is about 10 GB. Expected to be about 90% repetitive. No information about the transposable elements and tandem repeats.

ADD REPLY • link 10.6 years ago by Rohit ★ 1.5k

0

Entering edit mode

Is it possible to get a few BAC sequences done? This would give you a lot of information which will dramatically improve your assembly.

Although being 90% repetitive you don't have much chance. What are you interested in capturing? Maybe transcriptome sequencing would be more effective if you're just after the genes....

ADD REPLY • link 10.6 years ago by gammyknee ▴ 210

0

Entering edit mode

Yes... Transcriptome sequencing was done... Genome was sequenced as these organisms are really ancient...

ADD REPLY • link 10.5 years ago by Rohit ★ 1.5k

score 0 · Answer 3 · 2013-10-28

A few observations to follow up the work :

Replicates do no change the genome assembly contigs much (you are just wasting your memory if you all replicates together into one)
Transcriptome sequencing is the best to go for when you have 70-90% repetitive genome and when gene information is the one you are looking for mainly
JR assembler is good for big meaningful assemblies, but with such a high repetitive content it needs more filtering . The best strategy would be to -

i) First assemble mitochondria and remove all the reads related to it

ii) Make a k-mer limit and remove all the reads related to those highly repetitive k-mers