Issue: When trying to extract chrY from a 1000 Genomes CRAM file using samtools view, I encounter a reference MD5 checksum mismatch specifically for chrY. The error persists even after using the official 1000 Genomes reference genome.
Error Message:
[E::cram_decode_slice] MD5 checksum reference mismatch at chrY:2781477-2857046
[E::cram_decode_slice] CRAM : 2b9a5e1407a4a59d4c9d3112ab77fb81
[E::cram_decode_slice] Ref : f0de8a8cc1c1d1a691f4d26a7698f8b0
[E::cram_decode_slice] @SQ M5: ce3e31103314a704255f3cd90369ecce
Steps Taken:
- Reference Genome Download: Used the official 1000 Genomes reference:
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
Full file MD5: 64b32de2fc934679c16e83a2bc072064 (matches documentation)
Extracted with gzip -d, no reformatting applied.
- Validation:
Whole reference file MD5 matches expectations.
chrY-specific MD5 mismatch:
samtools faidx GRCh38_full_analysis_set_plus_decoy_hla-20190212.fa chrY | md5sum
# Returns: f0de8a8cc1c1d1a691f4d26a7698f8b0
vs CRAM header expectation: ce3e31103314a704255f3cd90369ecce.
- Command Used:
samtools view -h -b -f 3 \
-T GRCh38_full_analysis_set_plus_decoy_hla.fa \
HG00465.final.cram chrY > chrY.bam
Key Observations:
The issue is chromosome-specific (chrY fails, other chromosomes work).
Official reference files were used without modification.
Reproduced across multiple downloads of the reference genome.
Question:
Why does the chrY sequence in the official 1000 Genomes reference (GRCh38_full_analysis_set_plus_decoy_hla.fa) have a different MD5 (f0de8a8cc1c1d1a691f4d26a7698f8b0) than what the CRAM file expects (ce3e31103314a704255f3cd90369ecce )?
How can I obtain the exact reference version used to generate these CRAM files?
Are there alternative approaches to resolve this version mismatch? Environment:
samtools 1.21
Using htslib 1.21
Reference genome: GRCh38_full_analysis_set_plus_decoy_hla.fa
CRAM source: 1000 Genomes Project Phase 3 data
this is not the chrY sequence checksum. Samtoools faidx prints the fasta header and the 'return' char after each line of fasta.
what is the output of
samtools view -H HG00465.final.cram | grep -F '@SEQ' | grep chrY
?samtools view -H HG00465.final.cram | grep -F '@SEQ' | grep chrY => no output for above command
samtools faidx GRCh38_full_analysis_set_plus_decoy_hla-20190212.fa chrY | md5sum => this is type error, it should be samtools faidx GRCh38_full_analysis_set_plus_decoy_hla.fa chrY | md5sum
ah sorry it's @SQ not @SEQ
again, the checksum cannot be calculated from the raw output of samtools faidx.
output:
output:
AGAIN:
the checksum cannot be calculated from the raw output of samtools faidx. the checksum cannot be calculated from the raw output of samtools faidx. the checksum cannot be calculated from the raw output of samtools faidx. the checksum cannot be calculated from the raw output of samtools faidx. the checksum cannot be calculated from the raw output of samtools faidx.
see my comment above.
Yes. You are right. I check the logs, file GRCh38_full_analysis_set_plus_decoy_hla.fa is OK, and is mapping to HG00465.final.cram. Thanks a lot.
I assume the issue is sorted out now. Which of the two threads here have helped you sort the issue. Once you indicate that @Pierre or my comment can be moved to an answer. You can then select that answer (green check mark) to provide closure to this thread.
If I recall a prior discussion right,
samtools
should download the correct reference from Ensembl per one of thesamtools
dev/maintainer (i.e. no need to provide the-T
option. You don't need to manually do it. Have you tried that?no need to provide the -T option remove -T option I can get include chrY records. But not I want. I need to get original fasta file of HG00465.final.cram. In the end need mapping to T2T reference genome.
That means
samtools
is able to find the correct reference but the one you are downloading manually is incorrect.Why do you need to do this if you just want to get the original reads to map to T2T genome. Which you seem to be able to do by doing the following
The reference genomes GRCh38_full_analysis_set_plus_decoy_hla.fa is download from 1000g which download link designated by official 1000g . And download 2 times confirm its whole file md5 is '64b32de2fc934679c16e83a2bc072064'.
Do you need the original reads to map to the T2T genome or are you interested in the original reference to map to T2T genome. You do not need to get the reference if you just need the original reads.
Note: Please use
ADD REPLY
when replying to existing posts.SUBMIT ANSWER
is for new answers to the original question.