Question

question about canu outfiles

1

Entering edit mode

7.2 years ago

celine.petitjean ▴ 30

Hi all,

This is my second post, again about sequencing reads and genome assembly. I have assembled reads from a protist genome sequenced with PacBio. As proposed by PacBio, I used the package canu to do it, and I end up with a list of files but I am not sure about wich correspond to what. I have read the canu information, but as a very beginner in the field, I have a lot of doubts!

Particularly Can gave me 4 differents fasta outfiles:

x.bubbles.fasta (0 sequences - size: 0B)

x.contigs.fasta (2559 sequences - size: 66M)

x.unassembled.fasta(111507 sequences - size: 521M)

x.unitigs.fasta (6927 sequences - size: 90M)

If I understood well what I read, the contigs are containing all the read that could have been assembled, and the unassembled contains the remaining reads, that could have been integrated to the assembly? In this case it means that the totality of my "genome" would be contained in these two files? But I am concerned about the unitigs files, for which I can't find a proper description. Based on other posts I have read here, I underestood that it is all the singles read who have been integreated to the contigs, but in a unique version (if a sequence is present twice in the contigs, it will be present only once in the unitig). But if it is the case, I don't understand why I end up with a unitig file 1.5x bigger than the contig one...

Also for information, the genome size have been estimate around 176.5 Mb.

If this is redundant with another post, seems naive or if I am not using the right vocabulary, I apologize in advance, and will be grateful to be corrected! Thank you in advance!

Assembly genome next-gen • 3.6k views

ADD COMMENT • link updated 7.2 years ago by igor 13k • written 7.2 years ago by celine.petitjean ▴ 30

score 1 · Answer 1 · 2017-02-15

You can find an explanation of those files here: https://github.com/marbl/canu/issues/286

'contigs' will span repeats, as long as the repeat is unambiguous.

'unitigs' are derived from contigs. Wherever a contig end intersects the middle of another contig, the contig is split.

'bubbles' are deprecated and will be removed in the next release. Treat them as contigs for now.

'unassembled' contains mostly reads that failed to assemble into a contig. There will be some assembled sequences, but these will be short and nearly the same as the longest read in them.