Question: question about canu outfiles
0
gravatar for celine.petitjean
9 months ago by
celine.petitjean0 wrote:

Hi all,

This is my second post, again about sequencing reads and genome assembly. I have assembled reads from a protist genome sequenced with PacBio. As proposed by PacBio, I used the package canu to do it, and I end up with a list of files but I am not sure about wich correspond to what. I have read the canu information, but as a very beginner in the field, I have a lot of doubts!

Particularly Can gave me 4 differents fasta outfiles:

x.bubbles.fasta (0 sequences - size: 0B)

x.contigs.fasta (2559 sequences - size: 66M)

x.unassembled.fasta(111507 sequences - size: 521M)

x.unitigs.fasta (6927 sequences - size: 90M)

If I understood well what I read, the contigs are containing all the read that could have been assembled, and the unassembled contains the remaining reads, that could have been integrated to the assembly? In this case it means that the totality of my "genome" would be contained in these two files? But I am concerned about the unitigs files, for which I can't find a proper description. Based on other posts I have read here, I underestood that it is all the singles read who have been integreated to the contigs, but in a unique version (if a sequence is present twice in the contigs, it will be present only once in the unitig). But if it is the case, I don't understand why I end up with a unitig file 1.5x bigger than the contig one...

Also for information, the genome size have been estimate around 176.5 Mb.

If this is redundant with another post, seems naive or if I am not using the right vocabulary, I apologize in advance, and will be grateful to be corrected! Thank you in advance!

next-gen assembly genome • 475 views
ADD COMMENTlink modified 9 months ago by igor4.7k • written 9 months ago by celine.petitjean0
1
gravatar for igor
9 months ago by
igor4.7k
United States
igor4.7k wrote:

You can find an explanation of those files here: https://github.com/marbl/canu/issues/286

'contigs' will span repeats, as long as the repeat is unambiguous.

'unitigs' are derived from contigs. Wherever a contig end intersects the middle of another contig, the contig is split.

'bubbles' are deprecated and will be removed in the next release. Treat them as contigs for now.

'unassembled' contains mostly reads that failed to assemble into a contig. There will be some assembled sequences, but these will be short and nearly the same as the longest read in them.

ADD COMMENTlink modified 9 months ago • written 9 months ago by igor4.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1661 users visited in the last hour