Question

Is it possible to output uncompressed FastQ files with bcl2fastq2 ?

0

Entering edit mode

4.7 years ago

乙 ▴ 210

I would like to know if it's possible to produce uncompressed FastQ files from raw data (BCL files) using Illumina's sofware bcl2fastq2.

From the documentation bcl2fastq2 v.2.20, I saw the following:

--no-bgzf-compression: Turn off BGZF and use GZIP to compress FASTQ files. BGZF compression allows downstream applications to decompress in parallel. This option is available for FASTQ data consumers that cannot handle standard GZIP formats.

However, this only turns off BGZF compression but still produces *.gz files

Anyone know if there's a way to produce uncompressed FastQ files with this software ?

Otherwise, could you suggest other softwares (other than Picard IlluminaBasecallsToFastq) that would perform such operations using the RunInfo.xml and other files found in the raw folder produced by sequencing ?

The reason is that the gzipped FastQ produced uses the deflate algorithm and because of that, unzipping huge data even with a software like pigz takes a lot of them since it will use only one thread. Imagine having 100 samples zipped, it would take like 4 days to unzip all of them.

The idea would be to use a more performing compression tool like lbzip2 which allows multithreading compression/decompression with size even lower than bgzipped files.

decompression bcl2fastq2 fastq bcl raw • 2.1k views

ADD COMMENT • link 4.7 years ago by 乙 ▴ 210

0

Entering edit mode

Imagine having 100 samples zipped, it would take like 4 days to unzip all of them.

but you can do this in parallel.

but most softwares like 'bwa' can read gzip files.

ADD REPLY • link 4.7 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

but you can do this in parallel.

Yeah, it's possible to do it in parallel. I have a cluster for that processes these files 15 by 15 but if the files are too big, even paralleling the samples would take time.

but most softwares like 'bwa' can read gzip files.

The issue is not about reading gzip files. When sequencing lots of data, they take too much space on the storage and this becomes costly in the long run. So, I am looking for a way to store the data in a better way. For this, I only want to produce uncompressed FastQ from the BCL files

ADD REPLY • link 4.7 years ago by 乙 ▴ 210

1

Entering edit mode

When sequencing lots of data, they take too much space on the storage and this becomes costly in the long run. So, I am looking for a way to store the data in a better way. For this, I only want to produce uncompressed FastQ from the BCL files

That is a different issue then. How are you planning to use uncompressed fastq files to save space?

There are other ways of compressing fastq files (e.g. storing them as unaligned BAM/CRAM or using alternate fastq compression techniques like SPRING). I am not sure it is worth your time to mess with primary data (which should really be backed up as is). If you have the desire/time to do this then you can certainly pursue that path.

ADD REPLY • link 4.7 years ago by GenoMax 141k

0

Entering edit mode

Thanks for the SPRING link. That's useful !

How are you planning to use uncompressed fastq files to save space?

I explained before:

I am not sure it is worth your time to mess with primary data (which should really be backed up as is).

I am not altering the primary data in any way ... just optimizing the compression.

The idea would be to use a more performing compression tool like lbzip2 which allows multithreading compression/decompression with size even lower than bgzipped files.

I will compress the FastQ files with lbzip2 which is more powerful tool for compression/decompression. What I am doing now is once bcl2fastq2 outputs Fastq.gz, I decompress them and then compress them with lbzip2.

ADD REPLY • link 4.7 years ago by 乙 ▴ 210

0

Entering edit mode

ok, I see.

ADD REPLY • link 4.7 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Why do you want this?

Most bioinformatic tools can read von (b)gzip'ed files. And all of them - which provide random access - expect that it is bgzip. Otherwise no random access is possible.

ADD REPLY • link 4.7 years ago by finswimmer 16k

0

Entering edit mode

Why do you want this?

In the lab, we do lots of sequencing that produces very huge raw data, for example 1.1T. I have a sequencing analysis pipeline that does everything and ending up with 5.6T for the same size example given. Using different compression methods, you could save lots of space and time mostly.

ADD REPLY • link 4.7 years ago by 乙 ▴ 210

0

Entering edit mode

Consider (or convince your supervisor) the need for purchasing additional storage/backup as cost of doing business. If you are doing lot of sequencing (which means you are buying reagents that are rather expensive) then plan to add more backup/storage space alongside.

ADD REPLY • link 4.7 years ago by GenoMax 141k

0

Entering edit mode

Thank you for your advice.

We have enough of storage. But it wouldn't be wise though to keep upgrading storage and spending money when we can find solutions to optimize compression, save money and time.

ADD REPLY • link 4.7 years ago by 乙 ▴ 210

1

Entering edit mode

Storage, money and time - pick two to save. You cannot save all three. No matter how optimized your compression, it will not stop your storage requirement from ballooning. Compression algorithms can save space, not time.

ADD REPLY • link 4.7 years ago by Ram 43k

0

Entering edit mode

I am wondering if there is a way to have bcl2fastq output uncompressed fastq files and I am not seeing that addressed in the answers and comments here.
Is this possible to go directly to uncompressed fastq files using bcl2fastq?

Some parts of my pipeline only work on uncompressed fastq files. Starting with uncompressed fastq would save hours off my process.

ADD REPLY • link 3.4 years ago by Tawny ▴ 180

0

Entering edit mode

Yes there is. Use following option.

--no-bgzf-compression                           turn off BGZF compression for FASTQ files

ADD REPLY • link 3.4 years ago by GenoMax 141k