Question

Linux bash: How to pipe several files?

0

Entering edit mode

2.2 years ago

Simon ▴ 10

Hi,

I have several compressed fastq files (basically text files containing genomic data) for which I want to count the occurrence of a string. For this I have been using:

zgrep -c "STRING" *.fq.gz > Results.txt

This gives me a file containing a list of file:count pairs.

However, because I got many fastq files, this took too long time. Now what I want is to:

Read all the files (with wildcard, not in a loop)
Pick out only lines starting with "@"
Split these lines on ":" and select column 10
Count occurrence of the string
Save to file that should contain a list of file:count pairs

What I have tried is the following:

zcat *.fq.gz | grep '^@' | cut -d : -f 10 | grep -c "STRING" > Results.txt

This does not work. The resulting file only contain the count for one of the files. I read about xargs and tried to use it but without any success.

Any suggestions for how to make this work is highly appreciated.

wildcard pipe Linux • 1.5k views

ADD COMMENT • link updated 2.2 years ago by cpad0112 21k • written 2.2 years ago by Simon ▴ 10

0

Entering edit mode

zcat *.fq.gz | grep '^@' | cut -d : -f 10 | grep -c "STRING" > Results.txt

This does not work.

well it should work...

ADD REPLY • link 2.2 years ago by Pierre Lindenbaum 163k

0

Entering edit mode

OP please define "does not work": no result (empty Results.txt) or exits with non-zero exit condition. Did you check the output of the cut command contains the keyword you grep for?

ADD REPLY • link 2.2 years ago by Carambakaracho ★ 3.3k

0

Entering edit mode

Yes, the output of the cut command contains the keyword I grep for. My confusion came from the fact that I had expected my final output file to contain a list of file:count pairs. However, thanks to the comment by Michael I now realize that zcat is the wrong command to use in this case. So I may after all have to use a for loop.

ADD REPLY • link 2.2 years ago by Simon ▴ 10

0

Entering edit mode

 grep '^@'

It's a bad usage. The quality string of a fastq CAN also start with '@'.

ADD REPLY • link 2.2 years ago by Pierre Lindenbaum 163k

0

Entering edit mode

Thank you for the comment. Yes, you are right, the quality string of my fastq files do start with a "@". It is a mistake by me.

ADD REPLY • link 2.2 years ago by Simon ▴ 10

0

Entering edit mode

While it is tempting to use unix utils (and they are certainly useful) use a tool designed to work with sequences: https://bioinf.shenwei.me/seqkit/usage/#grep or https://bioinf.shenwei.me/seqkit/usage/#locate

ADD REPLY • link 2.2 years ago by GenoMax 145k

0

Entering edit mode

Thank you for the suggestions. I will definitely look into those.

ADD REPLY • link 2.2 years ago by Simon ▴ 10

0

Entering edit mode

Don't use a generic tool like zgrep for parsing huge files like fastq. Use tools like seqkit that can grep gz files and headers only. In addition, seqkit supports threads. Post an example file and expected outcome.

ADD REPLY • link 2.2 years ago by cpad0112 21k

score 2 · Accepted Answer · 2022-06-09

2

Entering edit mode

2.2 years ago

Michael 55k

The resulting file only contain the count for one of the files.

No, it contains the counts for "STRING" over all the files concatenated, therefore, only a single value. Once you use zcat or cat like this, the files can no longer be distinguished. Therefore you have to use a for loop to achieve this.