Question: Is it possible to directly convert fastq to CRAM ?
2
gravatar for lakhujanivijay
12 months ago by
lakhujanivijay5.1k
India
lakhujanivijay5.1k wrote:

Hi,

I am looking for efficient sequencing data storage solution. Based on this data, I believe that I could save a lot of space if I compress my fastq or bam files into CRAM (loseless).

Screenshot-from-2019-07-16-11-24-02

SOURCE: https://www.uppmax.uu.se/support/user-guides/using-cram-to-compress-bam-files/

While I can easily convert BAM to CRAM (via samtools), I am wondering if it is possible to directly convert fastq to CRAM. or I will have to go via :

  1. fastq -> unaligned BAM

  2. unaligned BAM -> CRAM

For 1. I think I can use picard

Any clues? or better ideas than this ? I welcome open discussion based on your experiences with CRAM.

storage cram bam fastq • 784 views
ADD COMMENTlink modified 11 months ago • written 12 months ago by lakhujanivijay5.1k
1

I would say 'no' because CRAM needs a reference genome to store its' bases, hence the reads need to be mapped.

what ATpoint said :-)

ADD REPLYlink modified 12 months ago • written 12 months ago by Pierre Lindenbaum129k
8
gravatar for jkbonfield
11 months ago by
jkbonfield390
jkbonfield390 wrote:

While CRAM is indeed much smaller than BAM, its primary benefit is when using aligned data so you can use a reference sequence (whether external or embedded). It does however still work without a reference.

I wouldn't recommend creating a fake reference just to get it to swallow things. You shouldn't need -T at all as with unaligned data there are no references to compare against anyway. Also note for aligned data, if you really need to, there is a way to enable referenceless encoding using "--output-fmt-option no_ref=1", although it's not going to be hugely beneficial.

Incidentally for FASTQ compression things have hotted up in recent years and there are far better tools out there, albeit due to doing mini denovo-assemblies (either by bloom-filter, graphs, or kmer counting strategies). They're often very CPU and memory hungry, but obviously yield far smaller files than CRAM as they're essentially doing reference-based compression with the reference computed on the fly.

FaStore, Spring and FQSqueezer are modern tools for this process.

ADD COMMENTlink written 11 months ago by jkbonfield390

For reference manual page appears to say otherwise. Or should that additionally say "not required" when working with unaligned data?

-T option is required whenever writing CRAM output.

ADD REPLYlink modified 11 months ago • written 11 months ago by genomax85k
1

I just noticed that the -T reference in the manual has now been changed to:

-T FILE

    A FASTA format reference FILE, optionally compressed by bgzip and ideally indexed by samtools faidx. If an index is not present, one will be generated for you
ADD REPLYlink written 7 months ago by genomax85k
6
gravatar for ATpoint
12 months ago by
ATpoint36k
Germany
ATpoint36k wrote:

CRAM is indeed smaller in size than BAM due to the superior compression and according to James Bonfield (https://www.sanger.ac.uk/people/directory/bonfield-james) it even gets smaller once an alignment is present, see his response to that tweet: We typically get sequencing data in uBAM format and the facility uses fastqtosam from picard if that helps you. I would check if one can send the output to stdout (probably one can) and then simply pipe it into samtools view like fastqtosam (...) | samtools view -T ref.fa -o out.cram. Still, writing CRAM is pretty slow (have not benchmarked but it really takes a while, notably slower than BAM) so I personally only use it for storage purposes, mainly of the uBAM raw data.

Edit: By the way because we just had this discussion in the slack, based on my short testing it is irrelevant what sequence you provide as -T, so if you don't have a reference you can just use any random fasta (even if it has just one chromosome with 1bp) and the resulting CRAM should be the exact same in terms of size.

Edit2: As jkbonfield says, when compressing aligned BAM the compression (for me) typically reduces the file size to roughly 30% of the original BAM, quite impressive and useful for storage purposes such as long time archiving.

ADD COMMENTlink modified 7 months ago • written 12 months ago by ATpoint36k
1

This would be a very nice tutorial. I've used the tool SPRING to convert fastq directly and written helper scripts for this, but I can't bring myself to commit all my institutions FASTQs to this and delete the originals. Why ? Because Spring is just the work of one talented developer who will likely not be able to maintain it forever.

CRAM on the other hand is an international standard, even if it has taken ~10(?) years to start taking off. I would feel a lot safer using this, especially if reinstating the BAM/FASTQ is not completely dependent on the reference (another big risk factor ... ).

ADD REPLYlink written 12 weeks ago by colindaven2.3k

Thanks for the suggestion

ADD REPLYlink written 12 months ago by lakhujanivijay5.1k

Oh, thanks for that information. That was really helpful and important piece of information

ADD REPLYlink written 12 months ago by lakhujanivijay5.1k
0
gravatar for Charles Warden
11 months ago by
Charles Warden7.8k
Duarte, CA
Charles Warden7.8k wrote:

While some programs require uncompressed .fastq files, many/most will accept .fastq.gz files.

Creating .fastq.gz files will considerably save space for long-term storage.

I'm not sure how this compares to CRAM, but it has considerable storage savings with greater functionality (since very few programs will accept a .cram file as an input for an alignment)

ADD COMMENTlink written 11 months ago by Charles Warden7.8k
1

Hi Charles Warden

Thanks for this comment. However, see this post where we have long discussions on fastq file format.

The struggle between fastq and fastq.gz, compressed v/s uncompressed file formats

http://omicsomics.blogspot.com/2012/12/the-trouble-with-fastq.html

ADD REPLYlink modified 11 months ago • written 11 months ago by lakhujanivijay5.1k

Thank you for pointing out the .fastq versus .fastq.gz discussion.

.cram functionality is kind of important for both reads (which you are discussing) as well as alignment (since I believe there are programs that don't accept .cram as an input). However, the feedback about the reference being important for the compression is good for other people to know about (and .fastq.gz is not relevant for discussions of .sam versus .bam versus .cram).

Thank you very much for your contribution!

ADD REPLYlink written 11 months ago by Charles Warden7.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 996 users visited in the last hour