MD5 Mismatch on chrY When Extracting CRAM Region Despite Correct Reference Genome
0
0
Entering edit mode
7 days ago
offset • 0

Issue: When trying to extract chrY from a 1000 Genomes CRAM file using samtools view, I encounter a reference MD5 checksum mismatch specifically for chrY. The error persists even after using the official 1000 Genomes reference genome.

Error Message:
[E::cram_decode_slice] MD5 checksum reference mismatch at chrY:2781477-2857046  
[E::cram_decode_slice] CRAM  : 2b9a5e1407a4a59d4c9d3112ab77fb81  
[E::cram_decode_slice] Ref   : f0de8a8cc1c1d1a691f4d26a7698f8b0  
[E::cram_decode_slice] @SQ M5: ce3e31103314a704255f3cd90369ecce

Steps Taken:

  1. Reference Genome Download: Used the official 1000 Genomes reference:
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

Full file MD5: 64b32de2fc934679c16e83a2bc072064 (matches documentation)

Extracted with gzip -d, no reformatting applied.

  1. Validation:

Whole reference file MD5 matches expectations.

chrY-specific MD5 mismatch:

samtools faidx GRCh38_full_analysis_set_plus_decoy_hla-20190212.fa chrY | md5sum
# Returns: f0de8a8cc1c1d1a691f4d26a7698f8b0
vs CRAM header expectation: ce3e31103314a704255f3cd90369ecce.
  1. Command Used:
samtools view -h -b -f 3 \
  -T GRCh38_full_analysis_set_plus_decoy_hla.fa \
  HG00465.final.cram chrY > chrY.bam

Key Observations:

The issue is chromosome-specific (chrY fails, other chromosomes work).

Official reference files were used without modification.

Reproduced across multiple downloads of the reference genome.

Question:

Why does the chrY sequence in the official 1000 Genomes reference (GRCh38_full_analysis_set_plus_decoy_hla.fa) have a different MD5 (f0de8a8cc1c1d1a691f4d26a7698f8b0) than what the CRAM file expects (ce3e31103314a704255f3cd90369ecce )?

How can I obtain the exact reference version used to generate these CRAM files?

Are there alternative approaches to resolve this version mismatch? Environment:

samtools 1.21
Using htslib 1.21

Reference genome: GRCh38_full_analysis_set_plus_decoy_hla.fa

CRAM source: 1000 Genomes Project Phase 3 data

reference-genome • 732 views
ADD COMMENT
0
Entering edit mode

chrY-specific MD5 mismatch: samtools faidx GRCh38_full_analysis_set_plus_decoy_hla-20190212.fa chrY | md5sum

this is not the chrY sequence checksum. Samtoools faidx prints the fasta header and the 'return' char after each line of fasta.

what is the output of samtools view -H HG00465.final.cram | grep -F '@SEQ' | grep chrY ?

ADD REPLY
0
Entering edit mode

samtools view -H HG00465.final.cram | grep -F '@SEQ' | grep chrY => no output for above command

samtools faidx GRCh38_full_analysis_set_plus_decoy_hla-20190212.fa chrY | md5sum => this is type error, it should be samtools faidx GRCh38_full_analysis_set_plus_decoy_hla.fa chrY | md5sum

ADD REPLY
0
Entering edit mode

=> no output for above command

ah sorry it's @SQ not @SEQ

this is type error

again, the checksum cannot be calculated from the raw output of samtools faidx.

$ cat rotavirus_rf.dict  | grep RF01 | grep -Eo "M5\:[a-z0-9]+"
M5:59dccb944425dd61f895a564ad7b56a7
$ samtools faidx rotavirus_rf.fa "RF01"| md5sum
c68a132286b3c7b0600e0e7543cb6d17  -
$ samtools faidx rotavirus_rf.fa "RF01"| grep -v '^>' | tr -d '\n' | tr "a-z" "A-Z" | md5sum
59dccb944425dd61f895a564ad7b56a7  -
ADD REPLY
0
Entering edit mode
  samtools view -H  HG00465/HG00465.final.cram | grep -F '@SQ' | grep chrY

output:

@SQ     SN:chrY LN:57227415     M5:ce3e31103314a704255f3cd90369ecce     UR:/gpfs/internal/sweng/production/Resources/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa
@SQ     SN:chrY_KI270740v1_random       LN:37240        M5:69e42252aead509bf56f1ea6fda91405     UR:/gpfs/internal/sweng/production/Resources/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa

samtools faidx  GRCh38_full_analysis_set_plus_decoy_hla.fa chrY | md5sum

output:

f0de8a8cc1c1d1a691f4d26a7698f8b0  -
ADD REPLY
0
Entering edit mode

samtools faidx GRCh38_full_analysis_set_plus_decoy_hla.fa chrY | md5sum

AGAIN:

the checksum cannot be calculated from the raw output of samtools faidx. the checksum cannot be calculated from the raw output of samtools faidx. the checksum cannot be calculated from the raw output of samtools faidx. the checksum cannot be calculated from the raw output of samtools faidx. the checksum cannot be calculated from the raw output of samtools faidx.

see my comment above.

ADD REPLY
0
Entering edit mode

the checksum cannot be calculated from the raw output of samtools faidx.

Yes. You are right. I check the logs, file GRCh38_full_analysis_set_plus_decoy_hla.fa is OK, and is mapping to HG00465.final.cram. Thanks a lot.

ADD REPLY
0
Entering edit mode

I assume the issue is sorted out now. Which of the two threads here have helped you sort the issue. Once you indicate that @Pierre or my comment can be moved to an answer. You can then select that answer (green check mark) to provide closure to this thread.

ADD REPLY
0
Entering edit mode

If I recall a prior discussion right, samtools should download the correct reference from Ensembl per one of the samtools dev/maintainer (i.e. no need to provide the -T option. You don't need to manually do it. Have you tried that?

ADD REPLY
0
Entering edit mode

no need to provide the -T option remove -T option I can get include chrY records. But not I want. I need to get original fasta file of HG00465.final.cram. In the end need mapping to T2T reference genome.

ADD REPLY
0
Entering edit mode

That means samtools is able to find the correct reference but the one you are downloading manually is incorrect.

I need to get original fasta file of HG00465.final.cram.

Why do you need to do this if you just want to get the original reads to map to T2T genome. Which you seem to be able to do by doing the following

remove -T option I can get include chrY records.

ADD REPLY
0
Entering edit mode

That means samtools is able to find the correct reference but the one you are downloading manually is incorrect.

The reference genomes GRCh38_full_analysis_set_plus_decoy_hla.fa is download from 1000g which download link designated by official 1000g . And download 2 times confirm its whole file md5 is '64b32de2fc934679c16e83a2bc072064'.

ADD REPLY
0
Entering edit mode

Do you need the original reads to map to the T2T genome or are you interested in the original reference to map to T2T genome. You do not need to get the reference if you just need the original reads.

Note: Please use ADD REPLY when replying to existing posts. SUBMIT ANSWER is for new answers to the original question.

ADD REPLY

Login before adding your answer.

Traffic: 1422 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6