bedtools bamtofastq generates fastq which cannot be gzip correctly?
1
0
Entering edit mode
6 months ago
francois ▴ 20

I align forward and reverse fastq reads to a fasta reference using bwa, then sort the alignment file:

bwa mem -M -t 16 ref_galr2a_AA.fa E01_S49_L001_R1_001.fastq.gz E01_S49_L001_R2_001.fastq.gz | samtools sort > test.bam

Now say I want to convert back this bam to fastq files; I am used to doing:

bedtools bamtofastq -i test.bam -fq  forward.fastq -fq2 reverse.fastq

Then put the result in a gunzip

gzip forward.fastq

It creates file forward.fastq.gz

On macOS, I can typically double-click a file like this, and I'll get the fastq file inside (it uses Archive Utility by default).

However, with this one I get:

Unable to expand "forward.fastq.gz" into "folder". (Error 79 - Inappropriate file type or format.)

This does not happen if I unzip/put back in a gzip the original fastq directly.

i.e. say I unzip E01_S49_L001_R1_001.fastq.gz by double-clicking it (Archive Utility), then run:

gzip E01_S49_L001_R1_001.fastq

It generates back E01_S49_L001_R1_001.fastq.gz, which I am able to open with Archive Utility.

So that tells me bedtools bamtofastq is not innocent? What is going on?

gzip miseq bedtools fastq bamtofastq • 1.6k views
ADD COMMENT
1
Entering edit mode

what is the output of

file  forward.fastq.gz

and the output of

gunzip -t  forward.fastq.gz
ADD REPLY
0
Entering edit mode
file forward.fastq.gz

gives

forward.fastq.gz: gzip compressed data, was "forward.fastq", last modified: Thu Mar 25 18:58:43 2021, from Unix, original size modulo 2^32 2839777


gunzip -t forward.fastq.gz

Does not give anything at all, weirdly.


Of note;

gunzip forward.fastq.gz

Works fine. It gives forward.fastq, which I can read in a text editor.

So the issue is somehow related to Archive Utility? But only on fastq.gz generated from fastq generated from bedtools bamtofastq? Awkward

ADD REPLY
1
Entering edit mode

my interest perked up - I'd like to see that file that cannot be decompressed -

it would be quite the achievement and a world first -

if true - however unlikely that is - I find fitting that a bioinformatics program would create that file

ADD REPLY
0
Entering edit mode

Haha, thanks for your interest.

Can you download it here?

ADD REPLY
3
Entering edit mode
6 months ago

A cool find, though it ends up being caused by a ludicrous IOS bug.

The file may have content that cannot be decompressed by double-clicking on a Mac, as it matches the mtree format. The content does not have to be complicated, for example, even simple words can cause problems. A quick demo:

echo hello | gzip > foo.gz

The resulting foo.gz file, when double-clicked on the Mac, will fail to decompress with the same error message:

(Error 79 - Inappropriate file type or format.)

More details can be read here:

ADD COMMENT
0
Entering edit mode

by a ludicrous IOS bug

This does not happen on macOS big sur. I can make the compressed file (gzip forward.fastq) and then click on compressed file in finder. It generates forward.fastq file and keeps the compressed copy. So perhaps this iOS specific.

ADD REPLY
0
Entering edit mode

I also have bigSur, mine is version 1.11 and does happen for me when I do:

echo hello | gzip > forward.gz
ADD REPLY
0
Entering edit mode

I can confirm that this exact command generates the same error on Big Sur v. 11.2.3 via finder (but gunzip forward.gz on command line works).

ADD REPLY
0
Entering edit mode

On Mojave 10.14.6 all code examples of this thread work fine, curious to see what this is all about.

ADD REPLY
0
Entering edit mode

the problem was introduced starting with Catalina and is present in Big Sur as well.

ADD REPLY
0
Entering edit mode

Ha, that sounds like the solution!

Would love to call it a day, but wait... It happens with any file name I think?

Example


bwa mem -M -t 16 ref_galr2a_AA.fa E01_S49_L001_R1_001.fastq.gz E01_S49_L001_R2_001.fastq.gz
| samtools sort > E01.bam

bedtools bamtofastq -i E01.bam -fq  E01_R1.fastq -fq2 E01_R2.fastq

gzip E01_R1.fastq

I get E01_R1.fastq.gz, which I still cannot open with Archive Utility (same error). Same as before, I can open it correctly in command line (gunzip E01_R1.fastq.gz) or with another unarchiving tool (eg. The Unarchiver).


Surely you must be right about the mtree format though... Would there also be something that bedtools bamtofastq writes to the file?

I have added E01.bam here if you want to download it

ADD REPLY
1
Entering edit mode

My previous explanation was a bit off. It is not the file name but the way the file starts that causes the problem.

Seemingly simple content inside the file can lead to the error. Example:

# This works
echo foo | gzip > foo.gz

# This fails
echo hello | gzip > foo.gz

# This works
echo "a:b" | gzip > foo.gz

# This fails
echo "aa:bb" | gzip > foo.gz

Your FASTQ file starts with @M00865:351:000000000-DBM8J:1:1101:16572:2773 as the read name. Low and behold it fails:

echo '@M00865:351:000000000-DBM8J:1:1101:16572:2773' | gzip > foo.gz

Today I learned something new!

When using MacOS Catalina or Big Sur, gzipped FASTQ files that use the standard Illumina naming convention cannot be decompressed by double-clicking on them!

ADD REPLY
1
Entering edit mode

Gzipped FASTQ files that use the standard Illumina naming convention cannot be decompressed by double-clicking on them!

I am afraid not true.

$ more forward.fastq
@A00153:690:HJVN5DSXY:1:1101:1470:1000 1:N:0:GAACGTGA+AACTGGTG
AGGTCAGTCACATGGTTAGGACGCAGATAGACAACGAAAACGAACGGGATAAAATATTTAACTTGCGGGACGGATTCAGCTCTCACTACGACCAGCACTACCTAAGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGAACGTGAA
+
F:FFFF:FFFFF:F:,,F:,FF:FFFFFFFFFFFF:FFFF,FFFF,:,FFFFFFFFFFFFF,FFFFF:,F,FFFFFFFF,FFFFFFFFF,FFFF,:F,F::FF,::,F:FFF,FFFFFFF,F::FF,:,FFFFFFF:FFF,F:FF,FF::
@A00153:690:HJVN5DSXY:1:1101:3568:1000 2:N:0:GAACGTGA+AACTGGTG
AGGTCAGTCACATGGTTAGGACGCAGCGAGTAAACGAAAACGAACGGGATAAATACGGTAATCGAAAACCGATACGATCGGCATAGAAAAGGTTGACAAGGAAATTGACGAATTGAAGCAGAAACTGGAAAACTTGGTAAAACAAGAAGC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF
@A00153:690:HJVN5DSXY:1:1101:4056:1000 2:N:0:GAACGTGA+AACTGGTG
AGGTCAGTCACATGGTTAGGACGCAGCGAGTAAACGAAAACGAACGGGATAAATACGGTAATCGAAAACCGATACGATCCGGTCGGGTTAAAGTCGAAATCGGACGGGAACCGGTATTTTTGTTCGGTAAAATCACACATGGCTACGAAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

$ gzip forward.fastq

I can double click in finder and produce the fastq file

Finder WIndow

ADD REPLY
1
Entering edit mode

maybe I generalized too soon, amusingly:

# This works:
echo '@A00153:690:HJVN5DSXY:1:1101:1470:1000 1:N:0:GAACGTGA+AACTGGTG' | gzip > foo.gz 

# This fails:
echo '@A00153:690:HJVN5DSXY:1:1101:1470:10001:N:0:GAACGTGA+AACTGGTG' | gzip > foo.gz 

I removed a single space character from the read name and now it fails. Not all FASTQ files will fail but some sure do :-) Those for example that are created from a BAM file, since the part after the whitespace is not a read id and will not be present in the BAM file.

ADD REPLY

Login before adding your answer.

Traffic: 1572 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6