Linux bash: How to pipe several files?
1
0
Entering edit mode
22 months ago
Simon ▴ 10

Hi,

I have several compressed fastq files (basically text files containing genomic data) for which I want to count the occurrence of a string. For this I have been using:

zgrep -c "STRING" *.fq.gz > Results.txt

This gives me a file containing a list of file:count pairs.

However, because I got many fastq files, this took too long time. Now what I want is to:

  1. Read all the files (with wildcard, not in a loop)
  2. Pick out only lines starting with "@"
  3. Split these lines on ":" and select column 10
  4. Count occurrence of the string
  5. Save to file that should contain a list of file:count pairs

What I have tried is the following:

zcat *.fq.gz | grep '^@' | cut -d : -f 10 | grep -c "STRING" > Results.txt

This does not work. The resulting file only contain the count for one of the files. I read about xargs and tried to use it but without any success.

Any suggestions for how to make this work is highly appreciated.

wildcard pipe Linux • 1.3k views
ADD COMMENT
0
Entering edit mode

zcat *.fq.gz | grep '^@' | cut -d : -f 10 | grep -c "STRING" > Results.txt

This does not work.

well it should work...

ADD REPLY
0
Entering edit mode

OP please define "does not work": no result (empty Results.txt) or exits with non-zero exit condition. Did you check the output of the cut command contains the keyword you grep for?

ADD REPLY
0
Entering edit mode

Yes, the output of the cut command contains the keyword I grep for. My confusion came from the fact that I had expected my final output file to contain a list of file:count pairs. However, thanks to the comment by Michael I now realize that zcat is the wrong command to use in this case. So I may after all have to use a for loop.

ADD REPLY
0
Entering edit mode
 grep '^@' 

It's a bad usage. The quality string of a fastq CAN also start with '@'.

ADD REPLY
0
Entering edit mode

Thank you for the comment. Yes, you are right, the quality string of my fastq files do start with a "@". It is a mistake by me.

ADD REPLY
0
Entering edit mode

While it is tempting to use unix utils (and they are certainly useful) use a tool designed to work with sequences: https://bioinf.shenwei.me/seqkit/usage/#grep or https://bioinf.shenwei.me/seqkit/usage/#locate

ADD REPLY
0
Entering edit mode

Thank you for the suggestions. I will definitely look into those.

ADD REPLY
0
Entering edit mode

Don't use a generic tool like zgrep for parsing huge files like fastq. Use tools like seqkit that can grep gz files and headers only. In addition, seqkit supports threads. Post an example file and expected outcome.

ADD REPLY
2
Entering edit mode
22 months ago
Michael 54k

The resulting file only contain the count for one of the files.

No, it contains the counts for "STRING" over all the files concatenated, therefore, only a single value. Once you use zcat or cat like this, the files can no longer be distinguished. Therefore you have to use a for loop to achieve this.

ADD COMMENT
0
Entering edit mode

Thank you Michael. That explains it.

ADD REPLY

Login before adding your answer.

Traffic: 2605 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6