Quick One Liner For Fastq Header Renaming
5
3
Entering edit mode
9.2 years ago
Josh Herr 5.7k

My annual question here: I received 17 large (each between 5.27 and 7.89 Gb) FASTQ files of transcriptome data from a collaborator. What they sent me has no header information for any of the sequences in the files. Here's the (truncated) first two sequences of the first file:

@No name
CAAGCGTGTCACCTATACCCCTCCGCCGGGGCAAAA
+
????????DDBDDDDDFFFFF9CEEHCECHHHBFHFH
@No name
CCAACTGCTGTTCACACGGAACCTTTCCCCACTTCAG
+
BBBDDBDDDDDFFFFEFIIIIHIIIHHIIHHHHIIIF


Before I QC and do any downstream analyses I would like to give each sequence a number so I can reference each later (i.e. controls and treatments). I want something like this:

@1_1
CAAGCGTGTCACCTATACCCCTCCGCCGGGGCAAAA
+
????????DDBDDDDDFFFFF9CEEHCECHHHBFHFH
@1_2
CCAACTGCTGTTCACACGGAACCTTTCCCCACTTCAG
+
BBBDDBDDDDDFFFFEFIIIIHIIIHHIIHHHHIIIF


etc., etc., for the millions of sequences... but I need to make sure all the sequence data has individual header names that correspond to that particular sequence (so when I rename the next file it needs to be unique for that file).

This shouldn't be that hard, but I googled around on the forum and elsewhere and can't find a solution. I've tried to grep the files and I get a memory error. I tried to use bioawk (with the -c flag) on the giant gzip'ed file but the renaming is erratic.

Admittedly, I really need to work on my awk skills. Can anyone provide a quick awk one liner to rename these headers in the unzipped files that would use minimal memory? Thanks so much for your help.

fastq awk • 17k views
16
Entering edit mode
9.2 years ago

Pierre's answer is correct, but I think there is a simpler Awk solution:

zcat file.fastq.gz | awk '{print (NR%4 == 1) ? "@1_" ++i : $0}' | gzip -c > another.fastq.gz  With Awk, you can use conditional expressions (expressions with 3 operands, like in C). It reads like that (A ? B : C): if A is true, then B, else C. In that case, we print the result of the conditional expression to the standard output. With Awk, you do not need to declare your variables (automatically set to 0). Pierre's solution to set the first i to 1 can be improved by using pre-incrementation (++i) instead of a post-incrementation (i++). ADD COMMENT 2 Entering edit mode thank you for improving my awk :-) ADD REPLY 0 Entering edit mode Thanks for improving my awk too! ADD REPLY 9 Entering edit mode 9.2 years ago NR is the number of rows.$0 is the current whole line.

gunzip -c file.fastq.gz |  awk '{if(NR%4==1) $0=sprintf("@1_%d",(1+i++)); print;}' | gzip -c > another.fastq.gz  ADD COMMENT 0 Entering edit mode Thanks! This worked like a charm! ADD REPLY 0 Entering edit mode 6.4 years later still helpful, thanks :) ADD REPLY 4 Entering edit mode 9.2 years ago With Biopieces www.biopieces.org): read_fastq -i in.fq | add_ident -k SEQ_NAME -p 1_ | write_fastq -o out.fq -x  ADD COMMENT 0 Entering edit mode Thanks for this and biopieces! ADD REPLY 2 Entering edit mode 9.2 years ago bioinfo ▴ 810 I guess we can do it with the fastx toolkit as well. Here it is: $ fastx_renamer -h
usage: fastx_renamer [-n TYPE] [-h] [-z] [-v] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.10 by A. Gordon (gordon@cshl.edu)

[-n TYPE]    = rename type:
SEQ - use the nucleotides sequence as the name.
COUNT - use simply counter as the name.
[-h]         = This helpful help screen.
[-z]         = Compress output with GZIP.
[-i INFILE]  = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.

0
Entering edit mode

Thanks! I was not aware of this.

0
Entering edit mode
2.8 years ago
ATpoint 62k

Very simple with seqtk rename, see https://github.com/lh3/seqtk

0
Entering edit mode

Thanks for pointing that one out for posterity! When I asked the question 6.4 years ago the seqtk tool did not exist!