Hi everyone
Hi!
I am trying to understand pangenomic data/file formats. In the vg's
descripton of file format, it says gam is similar to sam/bam and
usually in binary (compressed) form. Is there a difference between
sam and gam in terms of format?
yes. a .sam or .bam is a linear alignment map. a .gam is a graph-based alignment map. this accounts for the differences in the structure/organization of the file; that is, .gam files contain additional fields that describe the paths a read takes through the pangenome graph, including nodes visited and edge traversals - these extra fields allow the GAM to store complex paths that are non-linear in structure.
gam files always compressed?
a .bam is a binary sam file; a .sam file should be plain text. by contrast, a .gam is usually binary, but can be converted to a readable format (like json). there are also further compressions of .bam files. suppose you are running a clinical sequencing operation, and you need to keep hundreds of. (very large) .bam files in storage long-term. in this case, you may want to try to further compress them. there are CRAM files and other compression methods that are used for this.
If not, how can we know if it binary or not?
there are lots of ways. probably the simplest thing is something like cat myfile.gam. if that outputs nonsense, its binary, but this isn't the most exact method. it's better, if youre in a linux environment, to use something like:
file mygile.gam
which will indicate either that its a "data" file (binary) or "ASCII text". alternatively, you can use vg itself; something like vg view -a myfile.gam will output the file to a json serialization
hope it helps!
Thanks a lot, it is really helpful. Actually, I am trying to know if it is binary inside C++ code, so which method would be better to use inside the code. Can I use the same parsing in the code as in the sam format?