Question

Fragmented ABySS assembly

1

Entering edit mode

8.9 years ago

zgayk ▴ 90

Hello,

We are trying to assemble the genome of the common loon, and I have used abyss (v. 1.5.2) to produce de novo assemblies with the following output for different values of k:

n         n:500    n:N50   min   N80   N50   N20    E-size   max     sum        name                k37
4689853   152662   53975   500   579   761   1177   935      10767   1.19E+08   test-unitigs.fa      
4689829   152665   53982   500   579   761   1177   935      10767   1.19E+08   test-contigs.fa      
4689653   152727   53871   500   579   762   1178   935      10767   1.19E+08   test-scaffolds.fa    
                                                                                                    k55
2564599   25203    9944    500   542   639   867    795      14461   1.70E+07   test-unitigs.fa      
2564423   25179    9885    500   543   641   877    799      14461   1.71E+07   test-contigs.fa      
2564028   25142    9769    500   543   645   902    811      14461   1.72E+07   test-scaffolds.fa    
                                                                                                    k32
5038033   198105   67641   500   591   802   1287   1005     7812    1.61E+08   test-unitigs.fa      
5038000   198106   67653   500   591   802   1287   1005     7812    1.61E+08   test-contigs.fa      
5037795   198153   67499   500   591   803   1287   1005     7812    1.62E+08   test-scaffolds.fa    
                                                                                                    k48
3736945   62667    24079   500   554   678   955    804      9769    4.42E+07   test-unitigs.fa      
3736733   62628    24040   500   554   679   961    806      9769    4.43E+07   test-contigs.fa      
3735950   62435    23669   500   555   684   986    817      9769    4.45E+07   test-scaffolds.fa    
                                                                                                    k64
1636437   5055     1730    500   542   655   1196   1133     12872   3.72E+06   test-unitigs.fa      
1636380   5054     1717    500   542   657   1203   1142     12872   3.83E+06   test-contigs.fa      
1636124   5088     1698    500   545   669   1282   1159     12872   3.83E+06   test.scaffolds       
                                                                                                    k25
6946557   228359   83689   500   578   747   1096   873      5000    1.74E+08   test-unitigs.fa      
6946544   228358   83694   500   578   747   1096   873      5000    1.74E+08   test-contigs.fa      
6946414   228386   83762   500   578   747   1096   873      5000    1.74E+08   test.scaffolds       
                                                                                                    k31
5114778   207133   70364   500   593   809   1301   1015     7999    1.70E+08   test-unitigs.fa      
5114751   207137   70181   500   593   810   1301   1015     7999    1.70E+08   test-contigs.fa      
5114566   207200   70239   500   593   810   1302   1015     7999    1.70E+08   test.scaffolds       
                                                                                                    k30
5192389   216073   73119   500   595   814   1312   1022     7998    1.78E+08   test-unitigs.fa      
5192361   216073   73130   500   595   814   1312   1022     7998    1.78E+08   test-contigs.fa      
5192194   216128   72984   500   595   814   1313   1022     7998    1.78E+08   test.scaffolds

For the assembly with the highest N50 (814 bp), the contigs are small and highly fragmented (and essentially no scaffolds are produced) even after mapping these contigs to the available red-throated loon genome:

Minimum     Number            Number            Total             Total             Scaffold
Scaffold    of                of                Scaffold          Contig            Contig  
Length      Scaffolds         Contigs           Length            Length            Coverage
--------    --------------    --------------    --------------    --------------    --------
    All          5,237,924         5,238,436       767,438,425       767,326,331      99.99%
     50          3,616,441         3,616,953       710,236,525       710,124,431      99.98%
    100          2,146,720         2,147,232       604,271,394       604,159,300      99.98%
    250            743,885           744,397       394,016,485       393,904,391      99.97%
    500            247,247           247,755       223,350,732       223,238,838      99.95%
   1 KB             62,044            62,409        98,533,822        98,431,583      99.90%
 2.5 KB              5,725             5,731        18,713,830        18,710,728      99.98%
   5 KB                231               231         1,310,589         1,310,589     100.00%

What I am wondering is whether anyone has any ideas why our assembly is so fragmented and if there are any techniques we can use to improve contig length. Sequence data are in the form of pe reads (291,098,878 after filtering) drawn from one insert library size (8kb)? Could the fact that we do not have multiple library sizes be to blame for the small contigs? I do not have an estimate of genome size, but it should be in the range of 1 Gb, and the species is diploid.

Here is the comand I used to run abyss for different k-mer sizes: nohup abyss-pe k=29 name=test29 np=48 in='/share/apps/Data/Loon/COLO1527-8kb_1.filtered.fastq.gz /share/apps/Data/Loon/COLO1527-8kb_2.fastq.gz' &

I am really hoping to find a way to improve contig length, but so far I have not found a way to do this or produce viable scaffolds. Thanks very much for any suggestions.

Zach

Assembly • 3.0k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 8.9 years ago by zgayk ▴ 90

0

Entering edit mode

Indeed, if contigs are short, it will be challenging to scaffold using a single large insert library.
In the first table, what were the kmer sizes? Also, is the "sum" column the size of the assembly? Seems low.
Is your sequencing coverage well above 30x?
Perhaps the heterozygosity is high? There are specific assemblers for that, e.g. Platanus. Anyhow, it's always a good idea to try another assembler.

ADD REPLY • link updated 14 months ago by Ram 43k • written 8.9 years ago by Rayan Chikhi ★ 1.5k

0

Entering edit mode

Thank you. The k-mer sizes are on the very right hand side of the first table (they range from 25 to 64, but I did not do every in between). In the sum column these are apparently the sum of the lengths of the contigs at least 100 bp in length).

I am not sure about the heterozygosity, but I have not heard of Platanus. Thanks for that although, I can't imagine why the heterozygosity is higher than other bird genomes assembled using abyss or SOAPdenovo.

The company that did the sequencing for us did an initial SOAPdenovo assembly and we moved on to abyss since their assembly had a contig N50 of approximately 200. So at 814 we have improved with abyss slighly, but not as much as could be hoped.

I believe the read depth was about 33X.

ADD REPLY • link updated 14 months ago by Ram 43k • written 8.9 years ago by zgayk ▴ 90

0

Entering edit mode

Ah, didn't see the horizontal scrolling bar.. alright.

33x is rather low coverage for de novo assembly. You'd probably get higher contiguity if you sequenced an extra paired-end library.

Follow-up questions:

Did you check for adapters in reads?
How about error correction? See a recent paper (which also reviews other correction tools).

ADD REPLY • link updated 14 months ago by Ram 43k • written 8.9 years ago by Rayan Chikhi ★ 1.5k

0

Entering edit mode

I concur; 33x is way too low for 100bp reads on a large diploid genome. With even K=55 you already have only 46 kmers per read, or 33*(46/100)/2 = 7.59x kmer coverage per ploidy, and for a good assembly, you would want an even higher K. You need a lot more coverage for a good assembly, preferably from longer reads (150bp at least; 250bp would be better).

Try to aim for at least ~~15x~~ 20x kmer coverage per ploidy minimum. 30x is better.

Sometimes you can improve things by merging your paired reads, if they are overlapping. That can increase the number of long kmers yielded per pair.

ADD REPLY • link updated 14 months ago by Ram 43k • written 8.9 years ago by Brian Bushnell 20k

0

Entering edit mode

I think it is great that you've tried many kmer values, but you should also try other assemblers too. I have had great luck with MaSuRCA as compared all other assemblers that I've tested (eg., ALLPATHS N50 was 27kb, SOAP-denovo N50 was 16KB, RAY, I don't want to mention, and MaSuRCA was 2.4Mb!!).

Also, repeat content. What is your genomes estimated repeat content? I know animals have relatively low repeat content, but it will definitely matter!

ADD REPLY • link updated 14 months ago by Ram 43k • written 8.9 years ago by arnstrm ★ 1.8k

0

Entering edit mode

Thanks very much to all of you for your help. I really appreciate it. I suspected that our sequencing coverage was too low for a while as many of you have said, but I have just been trying to make the best of the existing data. But your comments have helped me decide that we probably can't improve contig length too much without additional sequencing with another insert library size and higher coverage, so that is what we are going to look into next.

To answer, I do believe that adapters were removed from reads, and error correction has been done, although I was not familiar with BFC. That might be something to use if we get another paired-end library. I am pretty new to this type of work so I haven't estimated repeat content, but most bird genomes have low repeat content for amniotes.

Abyss has worked well for us compared to SOAPdenovo, but these are the only assemblers we have tried. I'll look into MaSuRCA. Any suggestions on what an appropriate insert size for another library would be?

Thanks,
Zach

ADD REPLY • link updated 14 months ago by Ram 43k • written 8.9 years ago by zgayk ▴ 90

0

Entering edit mode

Overlapping 2x250bp reads

ADD REPLY • link 8.9 years ago by Rayan Chikhi ★ 1.5k