Question: Is it possible to directly convert fastq to CRAM ?
0
gravatar for lakhujanivijay
5 weeks ago by
lakhujanivijay4.3k
India
lakhujanivijay4.3k wrote:

Hi,

I am looking for efficient sequencing data storage solution. Based on this data, I believe that I could save a lot of space if I compress my fastq or bam files into CRAM (loseless).

Screenshot-from-2019-07-16-11-24-02

SOURCE: https://www.uppmax.uu.se/support/user-guides/using-cram-to-compress-bam-files/

While I can easily convert BAM to CRAM (via samtools), I am wondering if it is possible to directly convert fastq to CRAM. or I will have to go via :

  1. fastq -> unaligned BAM

  2. unaligned BAM -> CRAM

For 1. I think I can use picard

Any clues? or better ideas than this ? I welcome open discussion based on your experiences with CRAM.

storage cram bam fastq • 230 views
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by lakhujanivijay4.3k
1

I would say 'no' because CRAM needs a reference genome to store its' bases, hence the reads need to be mapped.

what ATpoint said :-)

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Pierre Lindenbaum122k
6
gravatar for jkbonfield
5 weeks ago by
jkbonfield160
jkbonfield160 wrote:

While CRAM is indeed much smaller than BAM, its primary benefit is when using aligned data so you can use a reference sequence (whether external or embedded). It does however still work without a reference.

I wouldn't recommend creating a fake reference just to get it to swallow things. You shouldn't need -T at all as with unaligned data there are no references to compare against anyway. Also note for aligned data, if you really need to, there is a way to enable referenceless encoding using "--output-fmt-option no_ref=1", although it's not going to be hugely beneficial.

Incidentally for FASTQ compression things have hotted up in recent years and there are far better tools out there, albeit due to doing mini denovo-assemblies (either by bloom-filter, graphs, or kmer counting strategies). They're often very CPU and memory hungry, but obviously yield far smaller files than CRAM as they're essentially doing reference-based compression with the reference computed on the fly.

FaStore, Spring and FQSqueezer are modern tools for this process.

ADD COMMENTlink written 5 weeks ago by jkbonfield160

For reference manual page appears to say otherwise. Or should that additionally say "not required" when working with unaligned data?

-T option is required whenever writing CRAM output.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by genomax70k
5
gravatar for ATpoint
5 weeks ago by
ATpoint21k
Germany
ATpoint21k wrote:

CRAM is indeed smaller in size than BAM due to the superior compression and according to James Bonfield (https://www.sanger.ac.uk/people/directory/bonfield-james) it even gets smaller once an alignment is present, see his response to that tweet:

We typically get sequencing data in uBAM format and the facility uses fastqtosam from picard if that helps you. I would check if one can send the output to stdout (probably one can) and then simply pipe it into samtools view like fastqtosam (...) | samtools view -T ref.fa -o out.cram. Still, writing CRAM is pretty slow (have not benchmarked but it really takes a while, notably slower than BAM) so I personally only use it for storage purposes, mainly of the uBAM raw data.

Edit: By the way because we just had this discussion in the slack, based on my short testing it is irrelevant what sequence you provide as -T, so if you don't have a reference you can just use any random fasta (even if it has just one chromosome with 1bp) and the resulting CRAM should be the exact same in terms of size.

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by ATpoint21k

Thanks for the suggestion

ADD REPLYlink written 5 weeks ago by lakhujanivijay4.3k

Oh, thanks for that information. That was really helpful and important piece of information

ADD REPLYlink written 5 weeks ago by lakhujanivijay4.3k
0
gravatar for Charles Warden
5 weeks ago by
Charles Warden7.2k
Duarte, CA
Charles Warden7.2k wrote:

While some programs require uncompressed .fastq files, many/most will accept .fastq.gz files.

Creating .fastq.gz files will considerably save space for long-term storage.

I'm not sure how this compares to CRAM, but it has considerable storage savings with greater functionality (since very few programs will accept a .cram file as an input for an alignment)

ADD COMMENTlink written 5 weeks ago by Charles Warden7.2k
1

Hi Charles Warden

Thanks for this comment. However, see this post where we have long discussions on fastq file format.

The struggle between fastq and fastq.gz, compressed v/s uncompressed file formats

http://omicsomics.blogspot.com/2012/12/the-trouble-with-fastq.html

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by lakhujanivijay4.3k

Thank you for pointing out the .fastq versus .fastq.gz discussion.

.cram functionality is kind of important for both reads (which you are discussing) as well as alignment (since I believe there are programs that don't accept .cram as an input). However, the feedback about the reference being important for the compression is good for other people to know about (and .fastq.gz is not relevant for discussions of .sam versus .bam versus .cram).

Thank you very much for your contribution!

ADD REPLYlink written 5 weeks ago by Charles Warden7.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 982 users visited in the last hour