Hi, I tried to transfer cram file to bam file using:
samtools view -T ./hg38new/hg38.fa -b -o 12bj1_withrefernece_bam.bam GTEx_Analysis_2021-02-11_v9_WGS_CRAM_files_GTEX-12BJ1-0001-SM-7DRPO.cram
And I have checked the information from cram file using
samtools view -H GTEx_Analysis_2021-02-11_v9_WGS_CRAM_files_GTEX-12BJ1-0001-SM-7DRPO.cram| grep '^@SQ' | grep -w chr1
it has output like:
@SQ SN:chr1 LN:248956422 M5:6aef897c3d6ff0c78aff06ac189178dd AS:38 UR:/cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta SP:Homo sapiens
However, I received a bug:
[E::cram_decode_slice] MD5 checksum reference mismatch at chr1:248746535-248770115
[E::cram_decode_slice] CRAM : afcc018588dfd618b122bdb507d7925e
[E::cram_decode_slice] Ref : 482590e5c9e3d07b221fe24f34feffe8
[E::cram_decode_slice] @SQ M5: 6aef897c3d6ff0c78aff06ac189178dd
[E::cram_decode_slice] Please check the reference given is correct
[E::cram_next_slice] Failure to decode slice
samtools view: error reading file "GTEx_Analysis_2021-02-11_v9_WGS_CRAM_files_GTEX-12BJ1-0001-SM-7DRPO.cram"
It seems that this bug was cauased by mismatch, however, I used hg38 for decoding. Does anyone have similar experience? Thanks.
That SQ M5 is chr1 in my copy of GRCh38 (Homo_sapiens.GRCh38_full_analysis_set_plus_decoy_hla.fa).
I'm assuming your hg38.fa is not the same as GRCh38 though. This is a curse of the bioinformatics community, with people doing weird stuff to references and blithly assuming it'll work elsewhere. Just download a new reference maybe, or remove it completely and let htslib download it from EBI's reference server (and cache it locally).
Generally speaking, I prefer to always use a caching reference server and let the software figure out which reference is which from the MD5 sums. Attempting to explicitly give it a reference is just an accident waiting to happen due to mismatching files. An example web server configuration for that is here: https://github.com/daviesrob/ref-cache-nginx
Just to be certain, you are saying that if OP runs
samtools view
without the explicit reference it will be automatically downloaded from EBI?Assuming you have network access and the md5sum is registered with the EBI's reference server, yes - it'll be downloaded and cached locally.
If I remove the -T, it can work but I am not sure about the reference genome.
you cannot decode a CRAM without a reference.
Based on the name of the file this appears to be a GTEx dataset.
Download the proper reference based on the version of your file from here: https://gtexportal.org/home/downloads/adult-gtex/reference
Edit: Looks like you may have a v9 dataset and the references only are up to v8 on GTEx portal.
Thanks. I only need the fasta file because I am working on individual sequencing data. Do you have any idea about directly transferring cram file to fasta file?
Looking at this MD5 sum this reference is likely one mentioned in mine or @Wouter's post here: Find reference fasta based on M5/MD5 string