CRAM to BAM using SAMTOOLS
1
0
Entering edit mode
8 weeks ago

Hi, I'm trying to convert a .cram file into a .bam file and then index it using the commands samtools view, samtools sort, and samtools index, but I keep getting these errors:

[W::sanitise_SQ_lines] Header @SQ length mismatch for ref chr1, 248956422 vs 1829166
[W::cram_decode_slice] Slice ends beyond reference end
[E::cram_decode_slice] MD5 checksum reference mismatch for ref 0 pos 1738847..1829166
[E::cram_decode_slice] CRAM: 7967333fff01950056c4d027080b6d64
[E::cram_next_slice] Failure to decode slice
[main_samview] truncated file.


I'm pretty sure that the reference file I'm using is correct so I don't know what the problem is.

Thanks.

SAMTOOLS CRAM BAM • 393 views
0
Entering edit mode

what is the output of

samtools view -H in.cram | grep '^@SQ' | grep -w chr1


and the output of

grep -w chr1 /path/to/reference.fa.fai

0
Entering edit mode

samtools view -H 6026111_23153_0_0.cram | grep '^@SQ' | grep -w chr1

@SQ SN:chr1 LN:248956422 M5:6aef897c3d6ff0c78aff06ac189178dd UR:/home/dnanexus/./GRCh38_full_analysis_set_plus_decoy_hla/GRCh38_full_analysis_set_plus_decoy_hla.fa

and

grep -w chr1 ~/GRCh38_full_analysis_set_plus_decoy_hla.fa.fai

chr1 1829166 112 70 71

1
Entering edit mode

this is NOT the same reference.

while in the fasta file the length of chr 1 is 1829166

0
Entering edit mode
7 weeks ago
jkbonfield ▴ 700

This looks very much like a reference mismatch. Your CRAM header will have M5 tags, so it's best to find references that way than trying to force it to use a specific reference file (and potentially getting it wrong). It's good that it calls this out too, otherwise you'd just decode and get the wrong data back.

Check the samtools.1 man page and search for REF_PATH and REF_CACHE environment variables. If you have them unset and have an internet connection it'll automatically download them from EBI and cache them. If not, you can prepopulate it by downloading them yourself and pointing REF_PATH to that directory structure, or using the samtools/misc/seq_cache_populate.pl script on the (correct!) reference to generate the MD5 files.

Note this is also both a faster way of driving CRAM because it doesn't need any newline / uppercasing on the fasta file as it reads the CRAM, permitting mmap usage, but it also is safer as it's immune to strangeness of Chr1 vs chr1 vs 1 shenanigans.

0
Entering edit mode

Here is what the CRAM file is telling me: @SQ SN:HLA-DRB115:01:01:04 LN:11056 AH: M5:3bb2753595aaf099a490a7ce1e042fce UR:/home/dnanexus/./GRCh38_full_analysis_set_plus_decoy_hla/GRCh38_full_analysis_set_plus_decoy_hla.fa

And the the reference that I'm using is GRCh38_full_analysis_set_plus_decoy_hla.fa so I really don't know what the problem is.