hi folks. I am trying to do a marine fish genome assembly, and i got 120 G reads for two short pairend lib and a large matepair lib. as the expect the genome size should be about 600 MB, but i got 1200MB after the gapclose with SOAPdenovo. then i read some paper (The oyster genome reveals stressadaptation and complexity ofshell formation) and search online get a phase "Remove Redundancy From Assembly", so is there any idea how to deal with reducing the error? or any other advice?
I have seen this in particularly heterozygous genome assemblies, where you may have alternative haplotypes assembled (though this also rends to shatter the genome assembly by breaking up the graph). It's also possible the genome size is underestimated.
Have you run a kmer abundance analysis on your shotgun data? This normally gives you an idea on (1) overall kmer coverage and (2) other issues you may have with the assembly, such as heterozygosity, repetitive sequences, high sequence error rate, etc. Lots of tools can do this, such as khmer and KAT. If heterozygosity is an issue you can always try Platanus, though it can be a little tricky to use.
Is it possible that you're adding the size of the contigs and the size of the scaffolds? Depending on the assembler these two files present the same genome, but the scaffold file contain Ns that connect contigs by using, for example, matepaired reads. That would explain why you have double the expected size.
I have never seen a genome assembly that doubled the expected genome size, only once did I have a bigger assembly because the genome size estimation was wrong, is that a possibility here?
Which assembler did you use, what was the exact command? Which files are you adding up?