Question

genome assembly, larger genome size than expect

0

Entering edit mode

6.8 years ago

yifan wang ▴ 20

hi folks. I am trying to do a marine fish genome assembly, and i got 120 G reads for two short pairend lib and a large matepair lib. as the expect the genome size should be about 600 MB, but i got 1200MB after the gapclose with SOAPdenovo. then i read some paper (The oyster genome reveals stressadaptation and complexity ofshell formation) and search online get a phase "Remove Redundancy From Assembly", so is there any idea how to deal with reducing the error? or any other advice?

Assembly genome • 3.6k views

ADD COMMENT • link updated 6.8 years ago by Chris Fields ★ 2.2k • written 6.8 years ago by yifan wang ▴ 20

score 2 · Answer 1 · 2017-06-26

2

Entering edit mode

6.8 years ago

Chris Fields ★ 2.2k

I have seen this in particularly heterozygous genome assemblies, where you may have alternative haplotypes assembled (though this also rends to shatter the genome assembly by breaking up the graph). It's also possible the genome size is underestimated.

Have you run a kmer abundance analysis on your shotgun data? This normally gives you an idea on (1) overall kmer coverage and (2) other issues you may have with the assembly, such as heterozygosity, repetitive sequences, high sequence error rate, etc. Lots of tools can do this, such as khmer and KAT. If heterozygosity is an issue you can always try Platanus, though it can be a little tricky to use.

ADD COMMENT • link 6.8 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

thanks for your suggestions, i tried Platanus for assembly and i got information, the contig.fa is 1013Mb and there is 370Mb contigBubble.fa

ADD REPLY • link 6.8 years ago by yifan wang ▴ 20

0

Entering edit mode

So if Platanus and SOAPdenovo roughly agree on genome size then I think your original genome size estimation is way off.

I'd run Chris' suggested kmer abundance analysis, here's an online tool which I found easiest to use: http://qb.cshl.edu/genomescope/

ADD REPLY • link 6.8 years ago by Philipp Bayer 8.3k

0

Entering edit mode

What do the basic overall stats look like when comparing the two assemblies, NG50 for example? I suggest using an arbitrarily high est. genome size when calculating these (maybe 1-1.2Gb) just for comparison purposes, the N50 will not be directly comparable. Also, I recommend looking at MEGAHIT over SOAPdenovo2 (note the github docs on SOAPdenovo2 also state this). Don't include the bubble file with the Platanus data, those are generally the redundant sequences (possible allelic variations).

Also, like most assemblers Platanus and SOAPdenovo2/MEGAHIT have options at the contig and scaffold steps, these can be used to reduce redundancy and to play with linkage parameters.

ADD REPLY • link 6.8 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

I am still working on platanus scaffolding here is the stat info. Total sequences 569886 Total bases 755283165 Min sequence length 100 Max sequence length 412374 Average sequence length 1325.32 Median sequence length 200.00 N25 length 47550 N50 length 12865 N75 length 2006 N90 length 561 N95 length 210 As 29.04 % Ts 28.69 % Gs 20.22 % Cs 20.19 % (A + T)s 57.73 % (G + C)s 40.41 % Ns 1.86 % still not good enough.

ADD REPLY • link 6.8 years ago by yifan wang ▴ 20

score 1 · Answer 2 · 2017-06-25

1

Entering edit mode

6.8 years ago

Philipp Bayer 8.3k

Is it possible that you're adding the size of the contigs and the size of the scaffolds? Depending on the assembler these two files present the same genome, but the scaffold file contain Ns that connect contigs by using, for example, matepaired reads. That would explain why you have double the expected size.

I have never seen a genome assembly that doubled the expected genome size, only once did I have a bigger assembly because the genome size estimation was wrong, is that a possibility here?

Which assembler did you use, what was the exact command? Which files are you adding up?

ADD COMMENT • link 6.8 years ago by Philipp Bayer 8.3k

0

Entering edit mode

here is the info : Total sequences 2402525 Total bases 1524700266 Min sequence length 100 Max sequence length 157402 Average sequence length 634.62 Median sequence length 201.00 N25 length 7101 N50 length 2063 N75 length 524 N90 length 199 N95 length 128 As 26.62 % Ts 26.16 % Gs 18.51 % Cs 18.77 % (A + T)s 52.78 % (G + C)s 37.28 % Ns 9.94 %

i use the SOAP denovo for assembly. here is the command: /wangyf/soft/SOAPdenovo2-bin-LINUX-generic-r240/SOAPdenovo-127mer pregraph \ -s ./lib.cfg -K 99 -R -p 20 \ -o syc109mer \ 1>./syc109_pregraph.err 2>./syc109_pregraph.log

/wangyf/soft/SOAPdenovo2-bin-LINUX-generic-r240/SOAPdenovo-127mer contig \ -g ./syc109mer -R -p 20 \ 1>./syc109mer-contig.err 2>./syc109mer-contig.log

/wangyf/soft/SOAPdenovo2-bin-LINUX-generic-r240/SOAPdenovo-127mer map -s lib.cfg.02 \ -g ./syc109mer -p 20 \ 1>./syc109mer-map.err 2>./syc109mer-map.log

/wangyf/soft/SOAPdenovo2-bin-LINUX-generic-r240/SOAPdenovo-127mer scaff \ -g ./syc109mer -F -p 20 \ 1>./syc109mer-scaffold.err 2>./syc109mer-scaffold.log

/wangyf/soft/Gapcloser/GapCloser \ -a ./syc109mer.scafSeq \ -b ./lib.cfg.02 \ -o ./syc.gapclose \ -l 140 \ -t 40 \ 1>./syc109mer-gapcloser-2.err 2>./syc109mer-gapcloser-2.log

and the another fish within same family genome size is about 650 MB.

ADD REPLY • link 6.8 years ago by yifan wang ▴ 20