Question: question about canu outfiles
gravatar for celine.petitjean
2.7 years ago by
celine.petitjean30 wrote:

Hi all,

This is my second post, again about sequencing reads and genome assembly. I have assembled reads from a protist genome sequenced with PacBio. As proposed by PacBio, I used the package canu to do it, and I end up with a list of files but I am not sure about wich correspond to what. I have read the canu information, but as a very beginner in the field, I have a lot of doubts!

Particularly Can gave me 4 differents fasta outfiles:

x.bubbles.fasta (0 sequences - size: 0B)

x.contigs.fasta (2559 sequences - size: 66M)

x.unassembled.fasta(111507 sequences - size: 521M)

x.unitigs.fasta (6927 sequences - size: 90M)

If I understood well what I read, the contigs are containing all the read that could have been assembled, and the unassembled contains the remaining reads, that could have been integrated to the assembly? In this case it means that the totality of my "genome" would be contained in these two files? But I am concerned about the unitigs files, for which I can't find a proper description. Based on other posts I have read here, I underestood that it is all the singles read who have been integreated to the contigs, but in a unique version (if a sequence is present twice in the contigs, it will be present only once in the unitig). But if it is the case, I don't understand why I end up with a unitig file 1.5x bigger than the contig one...

Also for information, the genome size have been estimate around 176.5 Mb.

If this is redundant with another post, seems naive or if I am not using the right vocabulary, I apologize in advance, and will be grateful to be corrected! Thank you in advance!

next-gen assembly genome • 2.0k views
ADD COMMENTlink modified 2.7 years ago by igor8.8k • written 2.7 years ago by celine.petitjean30
gravatar for igor
2.7 years ago by
United States
igor8.8k wrote:

You can find an explanation of those files here:

'contigs' will span repeats, as long as the repeat is unambiguous.

'unitigs' are derived from contigs. Wherever a contig end intersects the middle of another contig, the contig is split.

'bubbles' are deprecated and will be removed in the next release. Treat them as contigs for now.

'unassembled' contains mostly reads that failed to assemble into a contig. There will be some assembled sequences, but these will be short and nearly the same as the longest read in them.

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by igor8.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1973 users visited in the last hour