Unexpected Small Genome Assembly
2
1
Entering edit mode
10.3 years ago
Fabian Bull ★ 1.3k

I have a 86Mbp plant genome. It was assembled with very high coverage and some mate pair info for scaffolding.

After the process was finished, I ended up with 50Mbp accumulated contig size.

Someone has an idea why it is so much less? In general contigs should accumulate to a size high then the genome. Could it be caused by diploidy?

Or am I completly wrong?

EDIT:

Reads: PE Illumina 100bp, MP Illumina, 2kbp, 10kbp, 20kbp
Assembler: CLC
Scaffoler: SSpace
Coverage: 50x

genome assembly scaffolding plant • 3.1k views
0
Entering edit mode

Can you give us more information on what sequencing platform? How many reads? What the coverage is? What assembler you used?

4
Entering edit mode
10.3 years ago
Nick Loman ▴ 610

Repeats. Repetitive sequence often gets collapsed down into contigs with very high depth of coverage. Plot a graph of contig size against contig depth of coverage to test this theory.

0
Entering edit mode

Hmmm I'll definetly do that but in a 86Mbp plant genome there can't be many repeats.

0
Entering edit mode

Why do you say that?

0
Entering edit mode

To follow up on peri4n's response, plant genome size correlates strongly with transposon (LTR retrotransposon, to be specific) copy number. There are also differences in rates of DNA removal between species, but you can bet a small plant genome will have fewer LTR elements than a closely related species with a large genome. There will certainly be some repeats, but 86 Mb is extremely small for a plant genome. Given a size in plants of ca. 5-10 kb, there can't be many LTR elements. http://www.hindawi.com/journals/jb/2010/382732/ http://www.ncbi.nlm.nih.gov/pubmed/20064738

2
Entering edit mode
10.3 years ago

CLC de novo assembler is fast but not very accurate, its scaffolding ability is limited. The default contig size cutoff for CLC is 200. Based on my experience, I'd say your case falls within my expectation.

For your data setup, I encourage you to try a different assembler - SOAPdenovo or ALLPATHS (if you happen to have ~200bp insert PE). They should both run fast on the amount of data you have.

If it's a problem of diploidy (heterozygosity) then the final assembly should be larger than you anticipated since the two copies of alleles may fall into two different contigs. Smaller means things are getting collapsed as Nick suggested, or there are some difficult regions for CLC to walk through.

Does your species have an unusual GC content?

1
Entering edit mode

@peri4n: If I were you, I would definitely try another assembler such as SGA and SOAPdenovo. Even the best assembler may behave unexpectedly given data of special features. CLC's assembler, from what I heard, is fast but not the best in terms of accuracy and contiguity. BTW, 60% GC is very high. Illumina will suffer.

1
Entering edit mode

@lh3: You are right about ALLPATHS-LG. It sounds like the OP has a range of insert sizes of PEs and mate pairs, but it is not clear if he has the "overlapping paired-end library" required by ALLPATHS.

0
Entering edit mode

Sorry I forgot to mention, that I used SSpace for scaffolding. If I remember correctly it is 60% GC.

0
Entering edit mode

+1. 60% GC is very high. Illumina will suffer.

0
Entering edit mode

@Haibao: I think ALLPATHS-LG always requires libraries with at least 2 distinct insert sizes?