Entering edit mode
7.0 years ago
ElUretsky ▴ 30

I would like to know, can someone post a working example of converting a BAM file to CRAM and BAM again?

I tried samtools, cramtools but each time, the BAM tags get mangled. Can someone show me how to get:

samtools view file1.bam | md5sum 

then convert to CRAM, back to bam file2.bam such that:

samtools view file2.bam | md5sum 

Sorry if this is trivial but I have spent a few hours on this.

bam cram • 5.7k views
Entering edit mode

Unfortunately, due to the number of different ways the same data can be stored in SAM/BAM/CRAM, checksums can not be used as a proxy for knowing what data is stored. Slight variations in compression would change the checksum without touching the data at all. SAM -> BAM and BAM -> SAM are not entirely lossless, and neither is BAM -> CRAM. I doubt it was even ever intended to be so.

Typically with user-definable properties you ensure these properties are sorted to prevent this problem (and others) from occurring, but bioinformatics is anything but typical ^_^

Entering edit mode
7.0 years ago

I think the source of the problem is that the BAM specification does not require or prescribe an order to the tags, hence these can be reordered and still represent the same information. I would contact the SAM developers and see what could be done about this:

Entering edit mode
6.9 years ago

CRAM doesn't entirely know whether to generate MD and NM tags or not. It'll either always generate them or never generate them. The RG tag is also stored differently in CRAM and gets added on at the end, which causes it to change position. This is valid as the order is stated to be meaningless in SAM, but it makes it hard to validate.

It is possible, just about, to get RG/NM/MD working correctly, but it's a fudge. You can get the RG, MD and NM tags stored verbatim in the CRAM file and then ask for CRAM not to autogenerate them. I don't think we have a way to do this in samtools though and it's a bit of a contrived example too. If you really want to try that though (I don't know what the difference you are seeing is - an example would help!) then try "scramble -P" (scramble from Staden io_lib), and either "scramble" or "samtools view --input-fmt-options decode_md=0" to decode without the MD/NM tags.

There are also scramble -q (don't add @PG header lines) and -p (preserve BAM i vs s vs c (etc) aux size) options, which may help too, but you wouldn't be able to notice these differences in your view | md5sum above.

Unfortunately there are other issues too that CRAM lacks perfect round-trip information on. It always fills out the pnext/rnext/tlen fields when it can and can also in some situations fix invalid information in the flags, much like samtools fixmates would. It's arguable whether that is a bug or not!

Finally, if you're just trying to validate the round-trip, perhaps it is better to use a format aware comparison function instead of md5sum so that changes that are valid within the specification (reordering) don't show up as a problem. Htslib tests come with a script for this purpose. It may also help to canonicalise your input first - eg running samtools fixmates to ensure it doesn't contain invalid values.


Login before adding your answer.

Traffic: 1462 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6