Converting CRAM to BAM without reference fasta
2
1
Entering edit mode
3 months ago
Ld_60 ▴ 40

Hello,

I have a couple of CRAM files which I am aiming to convert to BAM files. I tried with these two commands:

samtools view -b -T ref.fa -o output_bam.bam input_cram.cram

and:

samtools view -b -o output_bam.bam input_cram.cram

The second command does not specify a reference file. I had the exact same BAM output from both commands: can someone please explain why this is the case? I read pretty much everywhere that the reference file is required for conversion (since CRAM does not contain bases that match with the reference). Is it really needed?

Of note, I am running samtools v 1.9.

Thanks for your help.

samtools BAM CRAM • 197 views
ADD COMMENT
2
Entering edit mode
3 months ago

samtools is able to find the reference sequence if the path to the fasta is specified in the sam dictionary in the UR field.

@SQ SN:ref  LN:45   M5:7a66cae8ab14aef8d635bc80649e730b UR:/path/to/toy.fa
@SQ SN:ref2 LN:40   M5:1636753510ec27476fdd109a6684680e UR:/path/to/toy.fa
ADD COMMENT
0
Entering edit mode

Hi Pierre, many thanks for the very quick reply! I also suspected the UR field in the CRAM header, however what I do not understand is that the UR path is not local to my machine: indeed, I got these CRAM files from a publication. The UR is something like this: UR:/lustre/.../fasta/hs37d5.fa. There is also an AS field that reads as: AS:NCBI37. Is there a way for samtools to get the reference even when it is not local to my working space at all? Thanks!

ADD REPLY
2
Entering edit mode

yes: if samtools finds the MD5 SUM and is able to find/fetch the reference from REF_PATH or REF_CACHE . see : http://www.htslib.org/doc/samtools.html

The search order to obtain a reference is:

Use any local file specified by the command line options (eg -T).

Look for MD5 via REF_CACHE environment variable.

Look for MD5 in each element of the REF_PATH environment variable.

Look for a local file listed in the UR: header tag.

ADD REPLY
0
Entering edit mode
3 months ago
Ld_60 ▴ 40

Mystery solved! I had a quick read at http://www.htslib.org/doc/samtools.html#REFERENCE_SEQUENCES (see also environment variables). To quickly summarise what's happening: if no reference is specified in the command, samtools would look at the fields present in the @SQ headers, including the M5: tags. samtools uses the M5 tag of each contig in the header and would look for it under http://www.ebi.ac.uk/ena/cram/md5/, then download its corresponding sequence to $HOME/.cache if $XDG_CACHE_HOME is not set.

I checked my $HOME/.cache directory and could find the references downloaded there, according to their MD5, which explains why the second command above worked without specifying the reference.

ADD COMMENT

Login before adding your answer.

Traffic: 1949 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6