Question: Fragmented ABySS assembly
1
gravatar for zgayk
3.9 years ago by
zgayk90
United States
zgayk90 wrote:

Hello,

We are trying to assemble the genome of the common loon, and I have used abyss (v. 1.5.2) to produce de novo assemblies with the following output for different values of k:

n n:500 n:N50 min N80 N50 N20 E-size max sum name k37
4689853 152662 53975 500 579 761 1177 935 10767 1.19E+08 test-unitigs.fa  
4689829 152665 53982 500 579 761 1177 935 10767 1.19E+08 test-contigs.fa  
4689653 152727 53871 500 579 762 1178 935 10767 1.19E+08 test-scaffolds.fa  
                      k55
2564599 25203 9944 500 542 639 867 795 14461 1.70E+07 test-unitigs.fa  
2564423 25179 9885 500 543 641 877 799 14461 1.71E+07 test-contigs.fa  
2564028 25142 9769 500 543 645 902 811 14461 1.72E+07 test-scaffolds.fa  
                      k32
5038033 198105 67641 500 591 802 1287 1005 7812 1.61E+08 test-unitigs.fa  
5038000 198106 67653 500 591 802 1287 1005 7812 1.61E+08 test-contigs.fa  
5037795 198153 67499 500 591 803 1287 1005 7812 1.62E+08 test-scaffolds.fa  
                      k48
3736945 62667 24079 500 554 678 955 804 9769 4.42E+07 test-unitigs.fa  
3736733 62628 24040 500 554 679 961 806 9769 4.43E+07 test-contigs.fa  
3735950 62435 23669 500 555 684 986 817 9769 4.45E+07 test-scaffolds.fa  
                      k64
1636437 5055 1730 500 542 655 1196 1133 12872 3.72E+06 test-unitigs.fa  
1636380 5054 1717 500 542 657 1203 1142 12872 3.83E+06 test-contigs.fa  
1636124 5088 1698 500 545 669 1282 1159 12872 3.83E+06 test.scaffolds  
                      k25
6946557 228359 83689 500 578 747 1096 873 5000 1.74E+08 test-unitigs.fa  
6946544 228358 83694 500 578 747 1096 873 5000 1.74E+08 test-contigs.fa  
6946414 228386 83762 500 578 747 1096 873 5000 1.74E+08 test.scaffolds  
                      k31
5114778 207133 70364 500 593 809 1301 1015 7999 1.70E+08 test-unitigs.fa  
5114751 207137 70181 500 593 810 1301 1015 7999 1.70E+08 test-contigs.fa  
5114566 207200 70239 500 593 810 1302 1015 7999 1.70E+08 test.scaffolds  
                      k30
5192389 216073 73119 500 595 814 1312 1022 7998 1.78E+08 test-unitigs.fa  
5192361 216073 73130 500 595 814 1312 1022 7998 1.78E+08 test-contigs.fa  
5192194 216128 72984 500 595 814 1313 1022 7998 1.78E+08 test.scaffolds  

For the assembly with the highest N50 (814 bp), the contigs are small and highly fragmented (and essentially no scaffolds are produced) even after mapping these contigs to the available red-throated loon genome:

Minimum     Number            Number            Total             Total             Scaffold
Scaffold    of                of                Scaffold          Contig            Contig  
Length      Scaffolds         Contigs           Length            Length            Coverage
--------    --------------    --------------    --------------    --------------    --------
    All          5,237,924         5,238,436       767,438,425       767,326,331      99.99%
     50          3,616,441         3,616,953       710,236,525       710,124,431      99.98%
    100          2,146,720         2,147,232       604,271,394       604,159,300      99.98%
    250            743,885           744,397       394,016,485       393,904,391      99.97%
    500            247,247           247,755       223,350,732       223,238,838      99.95%
   1 KB             62,044            62,409        98,533,822        98,431,583      99.90%
 2.5 KB              5,725             5,731        18,713,830        18,710,728      99.98%
   5 KB                231               231         1,310,589         1,310,589     100.00%

 

What I am wondering is whether anyone has any ideas why our assembly is so fragmented and if there are any techniques we can use to improve contig length. Sequence data are in the form of pe reads (291,098,878  after filtering) drawn from one insert library size (8kb)? Could the fact that we do not have multiple library sizes be to blame for the small contigs? I do not have an estimate of genome size, but it should be in the range of 1 Gb, and the species is diploid. 

Here is the comand I used to run abyss for different k-mer sizes: nohup abyss-pe k=29 name=test29 np=48 in='/share/apps/Data/Loon/COLO1527-8kb_1.filtered.fastq.gz /share/apps/Data/Loon/COLO1527-8kb_2.fastq.gz' &

I am really hoping to find a way to improve contig length, but so far I have not found a way to do this or produce viable scaffolds. Thanks very much for any suggestions.

Zach

 

assembly • 1.7k views
ADD COMMENTlink written 3.9 years ago by zgayk90
  • Indeed, if contigs are short, it will be challenging to scaffold using a single large insert library.
  • In the first table, what were the kmer sizes? Also, is the "sum" column the size of the assembly? Seems low.
  • Is your sequencing coverage well above 30x?
  • Perhaps the heterozygosity is high? There are specific assemblers for that, e.g. Platanus. Anyhow, it's always a good idea to try another assembler.
ADD REPLYlink written 3.9 years ago by Rayan Chikhi1.4k

Thank you. The k-mer sizes are on the very right hand side of the first table (they range from 25 to 64, but I did not do every in between). In the sum column these are apparently the sum of the lengths of the contigs at least 100 bp in length).

I am not sure about the heterozygosity, but I have not heard of Platanus. Thanks for that although, I can't imagine why the heterozygosity is higher than other bird genomes assembled using abyss or SOAPdenovo.

The company that did the sequencing for us did an initial SOAPdenovo assembly and we moved on to abyss since their assembly had a contig N50 of approximately 200. So at 814 we have improved with abyss slighly, but not as much as could be hoped. 

I believe the read depth was  about 33X.

ADD REPLYlink written 3.9 years ago by zgayk90

Ah, didn't see the horizontal scrolling bar.. alright. 

33x is rather low coverage for de novo assembly. You'd probably get higher contiguity if you sequenced an extra paired-end library.

Follow-up questions:

  • Did you check for adapters in reads?
  • How about error correction? See a recent paper (which also reviews other correction tools). 
ADD REPLYlink written 3.9 years ago by Rayan Chikhi1.4k

I concur; 33x is way too low for 100bp reads on a large diploid genome.  With even K=55 you already have only 46 kmers per read, or 33*(46/100)/2 = 7.59x kmer coverage per ploidy, and for a good assembly, you would want an even higher K.  You need a lot more coverage for a good assembly, preferably from longer reads (150bp at least; 250bp would be better).

Try to aim for at least 15x 20x kmer coverage per ploidy minimum.  30x is better.

Sometimes you can improve things by merging your paired reads, if they are overlapping.  That can increase the number of long kmers yielded per pair.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Brian Bushnell16k

I think it is great that you've tried many kmer values, but you should also try other assemblers too. I have had great luck with MaSuRCA as compared all other assemblers that I've tested (eg., ALLPATHS N50 was 27kb, SOAP-denovo N50 was 16KB, RAY, I don't want to mention,  and MaSuRCA was 2.4Mb!!).

Also, repeat content. What is your genomes estimated repeat content? I know animals have relatively low repeat content, but it will definitely matter!

ADD REPLYlink written 3.9 years ago by arnstrm1.7k

Thanks very much to all of you for your help. I really appreciate it. I suspected that our sequencing coverage was too low for a while as many of you have said, but I have just been trying to make the best of the existing data. But your comments have helped me decide that we probably can't improve contig length too much without additional sequencing with another insert library size and higher coverage, so that is what we are going to look into next. 

To answer, I do believe that adapters were removed from reads, and error correction has been done, although I was not familiar with BFC. That might be something to use if we get another paired-end library. I am pretty new to this type of work so I haven't estimated repeat content, but most bird genomes have low repeat content for amniotes. 

Abyss has worked well for us compared to SOAPdenovo, but these are the only assemblers we have tried. I'll look into MaSuRCA. Any suggestions on what an appropriate insert size for another library would be?

Thanks,

Zach

 

ADD REPLYlink written 3.9 years ago by zgayk90

Overlapping 2x250bp reads

ADD REPLYlink written 3.9 years ago by Rayan Chikhi1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1988 users visited in the last hour