Nonconforming FASTG files produced by SPAdes
0
0
Entering edit mode
4.4 years ago
Nick Stoler ▴ 70

I'm working with the SPAdes assembler (version 3.1.1), and trying to use its FASTG graph output.

Looking at the contigs.fastg file it produces, I find header lines I can't understand, even after carefully reading the FASTG specification. The spec says headers are supposed to be in this format:

>edge1:edge2,edge3,edge4:properties;


That is, 3 colon-separated fields: The name of the current sequence, the list of the sequences it connects to, and some properties. The list of connecting sequences is supposed to be comma-separated.

>edge1:edge2:edge3:edge4:edge5;


E.g. the sequence name and a colon-separated list of connecting sequences. The thing is, Bandage reads this file fine, and I know the SPAdes developers use it to view these FASTG files. Also, reading Heng Li's parser he wrote specifically for SPAdes' FASTG output, it's clear that the list of edges is what it appears to be (connecting sequences).

However, when I view the file in Bandage and try to correlate the graphs to the headers in the file, I can't understand how Bandage arrives at the graph I observe from the input FASTG. Edges that are listed in the same header (i.e. supposedly adjacent) will show up in a completely disconnected graph in Bandage.

Can anyone help me understand what's going on? Is there some other, unpublished update to the FASTG spec that everyone but me knows?

Assembly • 1.4k views
0
Entering edit mode

The old, SPAdes 3.1.1 colon-delimited format does seem to be what I suggested above, where the first edge is the name of the sequence, and the following edges are its neighbors.

Also, the Bandage visualization is weird because it turns out Bandage doesn't support the old format. It's looking increasingly like SPAdes 3.1.1 just didn't write correct FASTG.