Hello,
We are trying to assemble the genome of the common loon, and I have used abyss (v. 1.5.2) to produce de novo assemblies with the following output for different values of k:
n | n:500 | n:N50 | min | N80 | N50 | N20 | E-size | max | sum | name | k37 |
4689853 | 152662 | 53975 | 500 | 579 | 761 | 1177 | 935 | 10767 | 1.19E+08 | test-unitigs.fa | |
4689829 | 152665 | 53982 | 500 | 579 | 761 | 1177 | 935 | 10767 | 1.19E+08 | test-contigs.fa | |
4689653 | 152727 | 53871 | 500 | 579 | 762 | 1178 | 935 | 10767 | 1.19E+08 | test-scaffolds.fa | |
k55 | |||||||||||
2564599 | 25203 | 9944 | 500 | 542 | 639 | 867 | 795 | 14461 | 1.70E+07 | test-unitigs.fa | |
2564423 | 25179 | 9885 | 500 | 543 | 641 | 877 | 799 | 14461 | 1.71E+07 | test-contigs.fa | |
2564028 | 25142 | 9769 | 500 | 543 | 645 | 902 | 811 | 14461 | 1.72E+07 | test-scaffolds.fa | |
k32 | |||||||||||
5038033 | 198105 | 67641 | 500 | 591 | 802 | 1287 | 1005 | 7812 | 1.61E+08 | test-unitigs.fa | |
5038000 | 198106 | 67653 | 500 | 591 | 802 | 1287 | 1005 | 7812 | 1.61E+08 | test-contigs.fa | |
5037795 | 198153 | 67499 | 500 | 591 | 803 | 1287 | 1005 | 7812 | 1.62E+08 | test-scaffolds.fa | |
k48 | |||||||||||
3736945 | 62667 | 24079 | 500 | 554 | 678 | 955 | 804 | 9769 | 4.42E+07 | test-unitigs.fa | |
3736733 | 62628 | 24040 | 500 | 554 | 679 | 961 | 806 | 9769 | 4.43E+07 | test-contigs.fa | |
3735950 | 62435 | 23669 | 500 | 555 | 684 | 986 | 817 | 9769 | 4.45E+07 | test-scaffolds.fa | |
k64 | |||||||||||
1636437 | 5055 | 1730 | 500 | 542 | 655 | 1196 | 1133 | 12872 | 3.72E+06 | test-unitigs.fa | |
1636380 | 5054 | 1717 | 500 | 542 | 657 | 1203 | 1142 | 12872 | 3.83E+06 | test-contigs.fa | |
1636124 | 5088 | 1698 | 500 | 545 | 669 | 1282 | 1159 | 12872 | 3.83E+06 | test.scaffolds | |
k25 | |||||||||||
6946557 | 228359 | 83689 | 500 | 578 | 747 | 1096 | 873 | 5000 | 1.74E+08 | test-unitigs.fa | |
6946544 | 228358 | 83694 | 500 | 578 | 747 | 1096 | 873 | 5000 | 1.74E+08 | test-contigs.fa | |
6946414 | 228386 | 83762 | 500 | 578 | 747 | 1096 | 873 | 5000 | 1.74E+08 | test.scaffolds | |
k31 | |||||||||||
5114778 | 207133 | 70364 | 500 | 593 | 809 | 1301 | 1015 | 7999 | 1.70E+08 | test-unitigs.fa | |
5114751 | 207137 | 70181 | 500 | 593 | 810 | 1301 | 1015 | 7999 | 1.70E+08 | test-contigs.fa | |
5114566 | 207200 | 70239 | 500 | 593 | 810 | 1302 | 1015 | 7999 | 1.70E+08 | test.scaffolds | |
k30 | |||||||||||
5192389 | 216073 | 73119 | 500 | 595 | 814 | 1312 | 1022 | 7998 | 1.78E+08 | test-unitigs.fa | |
5192361 | 216073 | 73130 | 500 | 595 | 814 | 1312 | 1022 | 7998 | 1.78E+08 | test-contigs.fa | |
5192194 | 216128 | 72984 | 500 | 595 | 814 | 1313 | 1022 | 7998 | 1.78E+08 | test.scaffolds |
For the assembly with the highest N50 (814 bp), the contigs are small and highly fragmented (and essentially no scaffolds are produced) even after mapping these contigs to the available red-throated loon genome:
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 5,237,924 5,238,436 767,438,425 767,326,331 99.99%
50 3,616,441 3,616,953 710,236,525 710,124,431 99.98%
100 2,146,720 2,147,232 604,271,394 604,159,300 99.98%
250 743,885 744,397 394,016,485 393,904,391 99.97%
500 247,247 247,755 223,350,732 223,238,838 99.95%
1 KB 62,044 62,409 98,533,822 98,431,583 99.90%
2.5 KB 5,725 5,731 18,713,830 18,710,728 99.98%
5 KB 231 231 1,310,589 1,310,589 100.00%
What I am wondering is whether anyone has any ideas why our assembly is so fragmented and if there are any techniques we can use to improve contig length. Sequence data are in the form of pe reads (291,098,878 after filtering) drawn from one insert library size (8kb)? Could the fact that we do not have multiple library sizes be to blame for the small contigs? I do not have an estimate of genome size, but it should be in the range of 1 Gb, and the species is diploid.
Here is the comand I used to run abyss for different k-mer sizes: nohup abyss-pe k=29 name=test29 np=48 in='/share/apps/Data/Loon/COLO1527-8kb_1.filtered.fastq.gz /share/apps/Data/Loon/COLO1527-8kb_2.fastq.gz' &
I am really hoping to find a way to improve contig length, but so far I have not found a way to do this or produce viable scaffolds. Thanks very much for any suggestions.
Zach
Thank you. The k-mer sizes are on the very right hand side of the first table (they range from 25 to 64, but I did not do every in between). In the sum column these are apparently the sum of the lengths of the contigs at least 100 bp in length).
I am not sure about the heterozygosity, but I have not heard of Platanus. Thanks for that although, I can't imagine why the heterozygosity is higher than other bird genomes assembled using abyss or SOAPdenovo.
The company that did the sequencing for us did an initial SOAPdenovo assembly and we moved on to abyss since their assembly had a contig N50 of approximately 200. So at 814 we have improved with abyss slighly, but not as much as could be hoped.
I believe the read depth was about 33X.
Ah, didn't see the horizontal scrolling bar.. alright.
33x is rather low coverage for de novo assembly. You'd probably get higher contiguity if you sequenced an extra paired-end library.
Follow-up questions:
I concur; 33x is way too low for 100bp reads on a large diploid genome. With even K=55 you already have only 46 kmers per read, or 33*(46/100)/2 = 7.59x kmer coverage per ploidy, and for a good assembly, you would want an even higher K. You need a lot more coverage for a good assembly, preferably from longer reads (150bp at least; 250bp would be better).
Try to aim for at least
15x20x kmer coverage per ploidy minimum. 30x is better.Sometimes you can improve things by merging your paired reads, if they are overlapping. That can increase the number of long kmers yielded per pair.
I think it is great that you've tried many kmer values, but you should also try other assemblers too. I have had great luck with MaSuRCA as compared all other assemblers that I've tested (eg., ALLPATHS N50 was 27kb, SOAP-denovo N50 was 16KB, RAY, I don't want to mention, and MaSuRCA was 2.4Mb!!).
Also, repeat content. What is your genomes estimated repeat content? I know animals have relatively low repeat content, but it will definitely matter!
Thanks very much to all of you for your help. I really appreciate it. I suspected that our sequencing coverage was too low for a while as many of you have said, but I have just been trying to make the best of the existing data. But your comments have helped me decide that we probably can't improve contig length too much without additional sequencing with another insert library size and higher coverage, so that is what we are going to look into next.
To answer, I do believe that adapters were removed from reads, and error correction has been done, although I was not familiar with BFC. That might be something to use if we get another paired-end library. I am pretty new to this type of work so I haven't estimated repeat content, but most bird genomes have low repeat content for amniotes.
Abyss has worked well for us compared to SOAPdenovo, but these are the only assemblers we have tried. I'll look into MaSuRCA. Any suggestions on what an appropriate insert size for another library would be?
Thanks,
Zach
Overlapping 2x250bp reads