Question: Unexpected Small Genome Assembly
1
gravatar for Fabian Bull
7.0 years ago by
Fabian Bull1.3k
German
Fabian Bull1.3k wrote:

I have a 86Mbp plant genome. It was assembled with very high coverage and some mate pair info for scaffolding.

After the process was finished, I ended up with 50Mbp accumulated contig size.

Someone has an idea why it is so much less? In general contigs should accumulate to a size high then the genome. Could it be caused by diploidy?

Or am I completly wrong?

EDIT:

Reads: PE Illumina 100bp, MP Illumina, 2kbp, 10kbp, 20kbp
Assembler: CLC
Scaffoler: SSpace
Coverage: 50x

genome assembly scaffolding plant • 2.4k views
ADD COMMENTlink modified 7.0 years ago by Haibao Tang3.0k • written 7.0 years ago by Fabian Bull1.3k

Can you give us more information on what sequencing platform? How many reads? What the coverage is? What assembler you used?

ADD REPLYlink written 7.0 years ago by Damian Kao15k
4
gravatar for Nick Loman
7.0 years ago by
Nick Loman610
United Kingdom
Nick Loman610 wrote:

Repeats. Repetitive sequence often gets collapsed down into contigs with very high depth of coverage. Plot a graph of contig size against contig depth of coverage to test this theory.

ADD COMMENTlink written 7.0 years ago by Nick Loman610

Hmmm I'll definetly do that but in a 86Mbp plant genome there can't be many repeats.

ADD REPLYlink written 7.0 years ago by Fabian Bull1.3k

Why do you say that?

ADD REPLYlink written 7.0 years ago by Nick Loman610

To follow up on peri4n's response, plant genome size correlates strongly with transposon (LTR retrotransposon, to be specific) copy number. There are also differences in rates of DNA removal between species, but you can bet a small plant genome will have fewer LTR elements than a closely related species with a large genome. There will certainly be some repeats, but 86 Mb is extremely small for a plant genome. Given a size in plants of ca. 5-10 kb, there can't be many LTR elements. http://www.hindawi.com/journals/jb/2010/382732/ http://www.ncbi.nlm.nih.gov/pubmed/20064738

ADD REPLYlink written 7.0 years ago by SES8.1k
2
gravatar for Haibao Tang
7.0 years ago by
Haibao Tang3.0k
Mountain View, CA
Haibao Tang3.0k wrote:

CLC de novo assembler is fast but not very accurate, its scaffolding ability is limited. The default contig size cutoff for CLC is 200. Based on my experience, I'd say your case falls within my expectation.

For your data setup, I encourage you to try a different assembler - SOAPdenovo or ALLPATHS (if you happen to have ~200bp insert PE). They should both run fast on the amount of data you have.

If it's a problem of diploidy (heterozygosity) then the final assembly should be larger than you anticipated since the two copies of alleles may fall into two different contigs. Smaller means things are getting collapsed as Nick suggested, or there are some difficult regions for CLC to walk through.

Does your species have an unusual GC content?

ADD COMMENTlink written 7.0 years ago by Haibao Tang3.0k
1

@peri4n: If I were you, I would definitely try another assembler such as SGA and SOAPdenovo. Even the best assembler may behave unexpectedly given data of special features. CLC's assembler, from what I heard, is fast but not the best in terms of accuracy and contiguity. BTW, 60% GC is very high. Illumina will suffer.

ADD REPLYlink written 7.0 years ago by lh331k
1

@lh3: You are right about ALLPATHS-LG. It sounds like the OP has a range of insert sizes of PEs and mate pairs, but it is not clear if he has the "overlapping paired-end library" required by ALLPATHS.

ADD REPLYlink written 7.0 years ago by Haibao Tang3.0k

Sorry I forgot to mention, that I used SSpace for scaffolding. If I remember correctly it is 60% GC.

ADD REPLYlink written 7.0 years ago by Fabian Bull1.3k

+1. 60% GC is very high. Illumina will suffer.

ADD REPLYlink written 7.0 years ago by lh331k

@Haibao: I think ALLPATHS-LG always requires libraries with at least 2 distinct insert sizes?

ADD REPLYlink written 7.0 years ago by lh331k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1743 users visited in the last hour