Questions about a bug when transferring cram file to bam file
1
0
Entering edit mode
5 weeks ago
me • 0

Hi, I tried to transfer cram file to bam file using:

samtools view -T ./hg38new/hg38.fa -b -o 12bj1_withrefernece_bam.bam GTEx_Analysis_2021-02-11_v9_WGS_CRAM_files_GTEX-12BJ1-0001-SM-7DRPO.cram

And I have checked the information from cram file using

samtools view -H GTEx_Analysis_2021-02-11_v9_WGS_CRAM_files_GTEX-12BJ1-0001-SM-7DRPO.cram| grep '^@SQ' | grep -w chr1

it has output like:

@SQ     SN:chr1 LN:248956422    M5:6aef897c3d6ff0c78aff06ac189178dd     AS:38   UR:/cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta SP:Homo sapiens

However, I received a bug:

[E::cram_decode_slice] MD5 checksum reference mismatch at chr1:248746535-248770115
[E::cram_decode_slice] CRAM  : afcc018588dfd618b122bdb507d7925e
[E::cram_decode_slice] Ref   : 482590e5c9e3d07b221fe24f34feffe8
[E::cram_decode_slice] @SQ M5: 6aef897c3d6ff0c78aff06ac189178dd
[E::cram_decode_slice] Please check the reference given is correct
[E::cram_next_slice] Failure to decode slice
samtools view: error reading file "GTEx_Analysis_2021-02-11_v9_WGS_CRAM_files_GTEX-12BJ1-0001-SM-7DRPO.cram"

It seems that this bug was cauased by mismatch, however, I used hg38 for decoding. Does anyone have similar experience? Thanks.

sequence samtools bcftools • 1.1k views
ADD COMMENT
1
Entering edit mode

That SQ M5 is chr1 in my copy of GRCh38 (Homo_sapiens.GRCh38_full_analysis_set_plus_decoy_hla.fa).

I'm assuming your hg38.fa is not the same as GRCh38 though. This is a curse of the bioinformatics community, with people doing weird stuff to references and blithly assuming it'll work elsewhere. Just download a new reference maybe, or remove it completely and let htslib download it from EBI's reference server (and cache it locally).

Generally speaking, I prefer to always use a caching reference server and let the software figure out which reference is which from the MD5 sums. Attempting to explicitly give it a reference is just an accident waiting to happen due to mismatching files. An example web server configuration for that is here: https://github.com/daviesrob/ref-cache-nginx

ADD REPLY
1
Entering edit mode

Just to be certain, you are saying that if OP runs samtools view without the explicit reference it will be automatically downloaded from EBI?

ADD REPLY
0
Entering edit mode

Assuming you have network access and the md5sum is registered with the EBI's reference server, yes - it'll be downloaded and cached locally.

ADD REPLY
0
Entering edit mode

If I remove the -T, it can work but I am not sure about the reference genome.

ADD REPLY
0
Entering edit mode

you cannot decode a CRAM without a reference.

ADD REPLY
0
Entering edit mode

Based on the name of the file this appears to be a GTEx dataset.

Download the proper reference based on the version of your file from here: https://gtexportal.org/home/downloads/adult-gtex/reference

Edit: Looks like you may have a v9 dataset and the references only are up to v8 on GTEx portal.

ADD REPLY
0
Entering edit mode

Thanks. I only need the fasta file because I am working on individual sequencing data. Do you have any idea about directly transferring cram file to fasta file?

ADD REPLY
0
Entering edit mode

@SQ M5: 6aef897c3d6ff0c78aff06ac189178dd

Looking at this MD5 sum this reference is likely one mentioned in mine or @Wouter's post here: Find reference fasta based on M5/MD5 string

ADD REPLY
0
Entering edit mode
5 weeks ago

samtools view -H GTEx_Analysis_2021-02-11_v9_WGS_CRAM_files_GTEX-12BJ1-0001-SM-7DRPO.cram| grep '^@SQ' | grep -w chr1

doesn't mean that it is the very same reference.

You could use

samtools samples -h -f hg38.fa  in.cram 

if the column for the REF is a dot, then it's not the same ref (chrom size, number of chroms)

also, try to validate the cram

samtools quickcheck in.cram && echo OK
ADD COMMENT
0
Entering edit mode

Hi, I have checked that the cram file prints ok. And yes, it is a dot... Interesting, does this mean I cannot find the reference of this cram file?

ADD REPLY
0
Entering edit mode

Interesting, does this mean I cannot find the reference of this cram file?

it means that the REF you're using is not the one that was associated to generate the cram

ADD REPLY

Login before adding your answer.

Traffic: 2290 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6