Question: BAM to CRAM to BAM
gravatar for ElUretsky
2.2 years ago by
ElUretsky20 wrote:

I would like to know, can someone post a working example of converting a BAM file to CRAM and BAM again?

I tried samtools, cramtools but each time, the BAM tags get mangled. Can someone show me how to get:

samtools view file1.bam | md5sum 

then convert to CRAM, back to bam file2.bam such that:

samtools view file2.bam | md5sum 

Sorry if this is trivial but I have spent a few hours on this.

cram bam • 2.5k views
ADD COMMENTlink modified 2.2 years ago by James Bonfield130 • written 2.2 years ago by ElUretsky20

Unfortunately, due to the number of different ways the same data can be stored in SAM/BAM/CRAM, checksums can not be used as a proxy for knowing what data is stored. Slight variations in compression would change the checksum without touching the data at all. SAM -> BAM and BAM -> SAM are not entirely lossless, and neither is BAM -> CRAM. I doubt it was even ever intended to be so.

Typically with user-definable properties you ensure these properties are sorted to prevent this problem (and others) from occurring, but bioinformatics is anything but typical ^_^

ADD REPLYlink written 2.2 years ago by John12k
gravatar for Istvan Albert
2.2 years ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

I think the source of the problem is that the BAM specification does not require or prescribe an order to the tags, hence these can be reordered and still represent the same information. I would contact the SAM developers and see what could be done about this:

ADD COMMENTlink written 2.2 years ago by Istvan Albert ♦♦ 81k
gravatar for James Bonfield
2.2 years ago by
James Bonfield130 wrote:

CRAM doesn't entirely know whether to generate MD and NM tags or not. It'll either always generate them or never generate them. The RG tag is also stored differently in CRAM and gets added on at the end, which causes it to change position. This is valid as the order is stated to be meaningless in SAM, but it makes it hard to validate.

It is possible, just about, to get RG/NM/MD working correctly, but it's a fudge. You can get the RG, MD and NM tags stored verbatim in the CRAM file and then ask for CRAM not to autogenerate them. I don't think we have a way to do this in samtools though and it's a bit of a contrived example too. If you really want to try that though (I don't know what the difference you are seeing is - an example would help!) then try "scramble -P" (scramble from Staden io_lib), and either "scramble" or "samtools view --input-fmt-options decode_md=0" to decode without the MD/NM tags.

There are also scramble -q (don't add @PG header lines) and -p (preserve BAM i vs s vs c (etc) aux size) options, which may help too, but you wouldn't be able to notice these differences in your view | md5sum above.

Unfortunately there are other issues too that CRAM lacks perfect round-trip information on. It always fills out the pnext/rnext/tlen fields when it can and can also in some situations fix invalid information in the flags, much like samtools fixmates would. It's arguable whether that is a bug or not!

Finally, if you're just trying to validate the round-trip, perhaps it is better to use a format aware comparison function instead of md5sum so that changes that are valid within the specification (reordering) don't show up as a problem. Htslib tests come with a script for this purpose. It may also help to canonicalise your input first - eg running samtools fixmates to ensure it doesn't contain invalid values.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by James Bonfield130
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 931 users visited in the last hour