Question: BAM to CRAM to BAM
2
gravatar for ElUretsky
20 months ago by
ElUretsky20
Germany
ElUretsky20 wrote:

I would like to know, can someone post a working example of converting a BAM file to CRAM and BAM again?

I tried samtools, cramtools but each time, the BAM tags get mangled. Can someone show me how to get:

samtools view file1.bam | md5sum 
ed071dff8d60c74eefd1d14eb1f4bb21

then convert to CRAM, back to bam file2.bam such that:

samtools view file2.bam | md5sum 
ed071dff8d60c74eefd1d14eb1f4bb21

Sorry if this is trivial but I have spent a few hours on this.

cram bam • 2.0k views
ADD COMMENTlink modified 20 months ago by James Bonfield130 • written 20 months ago by ElUretsky20

Unfortunately, due to the number of different ways the same data can be stored in SAM/BAM/CRAM, checksums can not be used as a proxy for knowing what data is stored. Slight variations in compression would change the checksum without touching the data at all. SAM -> BAM and BAM -> SAM are not entirely lossless, and neither is BAM -> CRAM. I doubt it was even ever intended to be so.

Typically with user-definable properties you ensure these properties are sorted to prevent this problem (and others) from occurring, but bioinformatics is anything but typical ^_^

ADD REPLYlink written 20 months ago by John12k
3
gravatar for Istvan Albert
20 months ago by
Istvan Albert ♦♦ 79k
University Park, USA
Istvan Albert ♦♦ 79k wrote:

I think the source of the problem is that the BAM specification does not require or prescribe an order to the tags, hence these can be reordered and still represent the same information. I would contact the SAM developers and see what could be done about this:

https://github.com/samtools/

ADD COMMENTlink written 20 months ago by Istvan Albert ♦♦ 79k
4
gravatar for James Bonfield
20 months ago by
James Bonfield130 wrote:

CRAM doesn't entirely know whether to generate MD and NM tags or not. It'll either always generate them or never generate them. The RG tag is also stored differently in CRAM and gets added on at the end, which causes it to change position. This is valid as the order is stated to be meaningless in SAM, but it makes it hard to validate.

It is possible, just about, to get RG/NM/MD working correctly, but it's a fudge. You can get the RG, MD and NM tags stored verbatim in the CRAM file and then ask for CRAM not to autogenerate them. I don't think we have a way to do this in samtools though and it's a bit of a contrived example too. If you really want to try that though (I don't know what the difference you are seeing is - an example would help!) then try "scramble -P" (scramble from Staden io_lib), and either "scramble" or "samtools view --input-fmt-options decode_md=0" to decode without the MD/NM tags.

There are also scramble -q (don't add @PG header lines) and -p (preserve BAM i vs s vs c (etc) aux size) options, which may help too, but you wouldn't be able to notice these differences in your view | md5sum above.

Unfortunately there are other issues too that CRAM lacks perfect round-trip information on. It always fills out the pnext/rnext/tlen fields when it can and can also in some situations fix invalid information in the flags, much like samtools fixmates would. It's arguable whether that is a bug or not!

Finally, if you're just trying to validate the round-trip, perhaps it is better to use a format aware comparison function instead of md5sum so that changes that are valid within the specification (reordering) don't show up as a problem. Htslib tests come with a compare_sam.pl script for this purpose. It may also help to canonicalise your input first - eg running samtools fixmates to ensure it doesn't contain invalid values.

ADD COMMENTlink modified 20 months ago • written 20 months ago by James Bonfield130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1886 users visited in the last hour