Is it possible to directly convert fastq to CRAM ?
3
2
Entering edit mode
2.4 years ago

Hi,

I am looking for efficient sequencing data storage solution. Based on this data, I believe that I could save a lot of space if I compress my fastq or bam files into CRAM (loseless).

While I can easily convert BAM to CRAM (via samtools), I am wondering if it is possible to directly convert fastq to CRAM. or I will have to go via :

1. fastq -> unaligned BAM

2. unaligned BAM -> CRAM

For 1. I think I can use picard

Any clues? or better ideas than this ? I welcome open discussion based on your experiences with CRAM.

cram bam fastq storage • 1.9k views
1
Entering edit mode

I would say 'no' because CRAM needs a reference genome to store its' bases, hence the reads need to be mapped.

what ATpoint said :-)

8
Entering edit mode
2.4 years ago
jkbonfield ▴ 720

While CRAM is indeed much smaller than BAM, its primary benefit is when using aligned data so you can use a reference sequence (whether external or embedded). It does however still work without a reference.

I wouldn't recommend creating a fake reference just to get it to swallow things. You shouldn't need -T at all as with unaligned data there are no references to compare against anyway. Also note for aligned data, if you really need to, there is a way to enable referenceless encoding using "--output-fmt-option no_ref=1", although it's not going to be hugely beneficial.

Incidentally for FASTQ compression things have hotted up in recent years and there are far better tools out there, albeit due to doing mini denovo-assemblies (either by bloom-filter, graphs, or kmer counting strategies). They're often very CPU and memory hungry, but obviously yield far smaller files than CRAM as they're essentially doing reference-based compression with the reference computed on the fly.

FaStore, Spring and FQSqueezer are modern tools for this process.

0
Entering edit mode

For reference manual page appears to say otherwise. Or should that additionally say "not required" when working with unaligned data?

-T option is required whenever writing CRAM output.

1
Entering edit mode

I just noticed that the -T reference in the manual has now been changed to:

-T FILE

A FASTA format reference FILE, optionally compressed by bgzip and ideally indexed by samtools faidx. If an index is not present, one will be generated for you

6
Entering edit mode
2.4 years ago
ATpoint 55k

CRAM is indeed smaller in size than BAM due to the superior compression and according to James Bonfield (https://www.sanger.ac.uk/people/directory/bonfield-james) it even gets smaller once an alignment is present, see his response to that tweet: We typically get sequencing data in uBAM format and the facility uses fastqtosam from picard if that helps you. I would check if one can send the output to stdout (probably one can) and then simply pipe it into samtools view like fastqtosam (...) | samtools view -T ref.fa -o out.cram. Still, writing CRAM is pretty slow (have not benchmarked but it really takes a while, notably slower than BAM) so I personally only use it for storage purposes, mainly of the uBAM raw data.

Edit: By the way because we just had this discussion in the slack, based on my short testing it is irrelevant what sequence you provide as -T, so if you don't have a reference you can just use any random fasta (even if it has just one chromosome with 1bp) and the resulting CRAM should be the exact same in terms of size.

Edit2: As jkbonfield says, when compressing aligned BAM the compression (for me) typically reduces the file size to roughly 30% of the original BAM, quite impressive and useful for storage purposes such as long time archiving.

1
Entering edit mode

This would be a very nice tutorial. I've used the tool SPRING to convert fastq directly and written helper scripts for this, but I can't bring myself to commit all my institutions FASTQs to this and delete the originals. Why ? Because Spring is just the work of one talented developer who will likely not be able to maintain it forever.

CRAM on the other hand is an international standard, even if it has taken ~10(?) years to start taking off. I would feel a lot safer using this, especially if reinstating the BAM/FASTQ is not completely dependent on the reference (another big risk factor ... ).

0
Entering edit mode

Thanks for the suggestion

0
Entering edit mode

Oh, thanks for that information. That was really helpful and important piece of information

0
Entering edit mode
2.4 years ago

While some programs require uncompressed .fastq files, many/most will accept .fastq.gz files.

Creating .fastq.gz files will considerably save space for long-term storage.

I'm not sure how this compares to CRAM, but it has considerable storage savings with greater functionality (since very few programs will accept a .cram file as an input for an alignment)

1
Entering edit mode

Thanks for this comment. However, see this post where we have long discussions on fastq file format.

The struggle between fastq and fastq.gz, compressed v/s uncompressed file formats

http://omicsomics.blogspot.com/2012/12/the-trouble-with-fastq.html

0
Entering edit mode

Thank you for pointing out the .fastq versus .fastq.gz discussion.

.cram functionality is kind of important for both reads (which you are discussing) as well as alignment (since I believe there are programs that don't accept .cram as an input). However, the feedback about the reference being important for the compression is good for other people to know about (and .fastq.gz is not relevant for discussions of .sam versus .bam versus .cram).

Thank you very much for your contribution!