Question: genome assembly, larger genome size than expect
gravatar for yifan wang
2.0 years ago by
yifan wang10
yifan wang10 wrote:

hi folks. I am trying to do a marine fish genome assembly, and i got 120 G reads for two short pairend lib and a large matepair lib. as the expect the genome size should be about 600 MB, but i got 1200MB after the gapclose with SOAPdenovo. then i read some paper (The oyster genome reveals stressadaptation and complexity ofshell formation) and search online get a phase "Remove Redundancy From Assembly", so is there any idea how to deal with reducing the error? or any other advice?

assembly genome • 1.3k views
ADD COMMENTlink modified 2.0 years ago by Chris Fields2.1k • written 2.0 years ago by yifan wang10
gravatar for Chris Fields
2.0 years ago by
Chris Fields2.1k
University of Illinois Urbana-Champaign
Chris Fields2.1k wrote:

I have seen this in particularly heterozygous genome assemblies, where you may have alternative haplotypes assembled (though this also rends to shatter the genome assembly by breaking up the graph). It's also possible the genome size is underestimated.

Have you run a kmer abundance analysis on your shotgun data? This normally gives you an idea on (1) overall kmer coverage and (2) other issues you may have with the assembly, such as heterozygosity, repetitive sequences, high sequence error rate, etc. Lots of tools can do this, such as khmer and KAT. If heterozygosity is an issue you can always try Platanus, though it can be a little tricky to use.

ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by Chris Fields2.1k

thanks for your suggestions, i tried Platanus for assembly and i got information, the contig.fa is 1013Mb and there is 370Mb contigBubble.fa

ADD REPLYlink written 24 months ago by yifan wang10

So if Platanus and SOAPdenovo roughly agree on genome size then I think your original genome size estimation is way off.

I'd run Chris' suggested kmer abundance analysis, here's an online tool which I found easiest to use:

ADD REPLYlink written 24 months ago by Philipp Bayer6.2k

What do the basic overall stats look like when comparing the two assemblies, NG50 for example? I suggest using an arbitrarily high est. genome size when calculating these (maybe 1-1.2Gb) just for comparison purposes, the N50 will not be directly comparable. Also, I recommend looking at MEGAHIT over SOAPdenovo2 (note the github docs on SOAPdenovo2 also state this). Don't include the bubble file with the Platanus data, those are generally the redundant sequences (possible allelic variations).

Also, like most assemblers Platanus and SOAPdenovo2/MEGAHIT have options at the contig and scaffold steps, these can be used to reduce redundancy and to play with linkage parameters.

ADD REPLYlink written 24 months ago by Chris Fields2.1k

I am still working on platanus scaffolding here is the stat info. Total sequences 569886 Total bases 755283165 Min sequence length 100 Max sequence length 412374 Average sequence length 1325.32 Median sequence length 200.00 N25 length 47550 N50 length 12865 N75 length 2006 N90 length 561 N95 length 210 As 29.04 % Ts 28.69 % Gs 20.22 % Cs 20.19 % (A + T)s 57.73 % (G + C)s 40.41 % Ns 1.86 % still not good enough.

ADD REPLYlink modified 24 months ago • written 24 months ago by yifan wang10
gravatar for Philipp Bayer
2.0 years ago by
Philipp Bayer6.2k
Philipp Bayer6.2k wrote:

Is it possible that you're adding the size of the contigs and the size of the scaffolds? Depending on the assembler these two files present the same genome, but the scaffold file contain Ns that connect contigs by using, for example, matepaired reads. That would explain why you have double the expected size.

I have never seen a genome assembly that doubled the expected genome size, only once did I have a bigger assembly because the genome size estimation was wrong, is that a possibility here?

Which assembler did you use, what was the exact command? Which files are you adding up?

ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by Philipp Bayer6.2k

here is the info : Total sequences 2402525 Total bases 1524700266 Min sequence length 100 Max sequence length 157402 Average sequence length 634.62 Median sequence length 201.00 N25 length 7101 N50 length 2063 N75 length 524 N90 length 199 N95 length 128 As 26.62 % Ts 26.16 % Gs 18.51 % Cs 18.77 % (A + T)s 52.78 % (G + C)s 37.28 % Ns 9.94 %

i use the SOAP denovo for assembly. here is the command: /wangyf/soft/SOAPdenovo2-bin-LINUX-generic-r240/SOAPdenovo-127mer pregraph \ -s ./lib.cfg -K 99 -R -p 20 \ -o syc109mer \ 1>./syc109_pregraph.err 2>./syc109_pregraph.log

/wangyf/soft/SOAPdenovo2-bin-LINUX-generic-r240/SOAPdenovo-127mer contig \ -g ./syc109mer -R -p 20 \ 1>./syc109mer-contig.err 2>./syc109mer-contig.log

/wangyf/soft/SOAPdenovo2-bin-LINUX-generic-r240/SOAPdenovo-127mer map -s lib.cfg.02 \ -g ./syc109mer -p 20 \ 1>./syc109mer-map.err 2>./syc109mer-map.log

/wangyf/soft/SOAPdenovo2-bin-LINUX-generic-r240/SOAPdenovo-127mer scaff \ -g ./syc109mer -F -p 20 \ 1>./syc109mer-scaffold.err 2>./syc109mer-scaffold.log

/wangyf/soft/Gapcloser/GapCloser \ -a ./syc109mer.scafSeq \ -b ./lib.cfg.02 \ -o ./syc.gapclose \ -l 140 \ -t 40 \ 1>./syc109mer-gapcloser-2.err 2>./syc109mer-gapcloser-2.log

and the another fish within same family genome size is about 650 MB.

ADD REPLYlink written 24 months ago by yifan wang10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1679 users visited in the last hour