I have several compressed fastq files (basically text files containing genomic data) for which I want to count the occurrence of a string. For this I have been using:
zgrep -c "STRING" *.fq.gz > Results.txt
This gives me a file containing a list of file:count pairs.
However, because I got many fastq files, this took too long time. Now what I want is to:
- Read all the files (with wildcard, not in a loop)
- Pick out only lines starting with "@"
- Split these lines on ":" and select column 10
- Count occurrence of the string
- Save to file that should contain a list of file:count pairs
What I have tried is the following:
zcat *.fq.gz | grep '^@' | cut -d : -f 10 | grep -c "STRING" > Results.txt
This does not work. The resulting file only contain the count for one of the files. I read about xargs and tried to use it but without any success.
Any suggestions for how to make this work is highly appreciated.
well it should work...
OP please define "does not work": no result (empty Results.txt) or exits with non-zero exit condition. Did you check the output of the cut command contains the keyword you grep for?
Yes, the output of the cut command contains the keyword I grep for. My confusion came from the fact that I had expected my final output file to contain a list of file:count pairs. However, thanks to the comment by Michael I now realize that zcat is the wrong command to use in this case. So I may after all have to use a for loop.
It's a bad usage. The quality string of a fastq CAN also start with '@'.
Thank you for the comment. Yes, you are right, the quality string of my fastq files do start with a "@". It is a mistake by me.
While it is tempting to use unix utils (and they are certainly useful) use a tool designed to work with sequences: https://bioinf.shenwei.me/seqkit/usage/#grep or https://bioinf.shenwei.me/seqkit/usage/#locate
Thank you for the suggestions. I will definitely look into those.
Don't use a generic tool like zgrep for parsing huge files like fastq. Use tools like seqkit that can grep gz files and headers only. In addition, seqkit supports threads. Post an example file and expected outcome.