Question

Quick One Liner For Fastq Header Renaming

3

Entering edit mode

11.0 years ago

Josh Herr 5.8k

My annual question here: I received 17 large (each between 5.27 and 7.89 Gb) FASTQ files of transcriptome data from a collaborator. What they sent me has no header information for any of the sequences in the files. Here's the (truncated) first two sequences of the first file:

@No name
CAAGCGTGTCACCTATACCCCTCCGCCGGGGCAAAA
+
????????DDBDDDDDFFFFF9CEEHCECHHHBFHFH
@No name
CCAACTGCTGTTCACACGGAACCTTTCCCCACTTCAG
+
BBBDDBDDDDDFFFFEFIIIIHIIIHHIIHHHHIIIF

Before I QC and do any downstream analyses I would like to give each sequence a number so I can reference each later (i.e. controls and treatments). I want something like this:

@1_1
CAAGCGTGTCACCTATACCCCTCCGCCGGGGCAAAA
+
????????DDBDDDDDFFFFF9CEEHCECHHHBFHFH
@1_2
CCAACTGCTGTTCACACGGAACCTTTCCCCACTTCAG
+
BBBDDBDDDDDFFFFEFIIIIHIIIHHIIHHHHIIIF

etc., etc., for the millions of sequences... but I need to make sure all the sequence data has individual header names that correspond to that particular sequence (so when I rename the next file it needs to be unique for that file).

This shouldn't be that hard, but I googled around on the forum and elsewhere and can't find a solution. I've tried to grep the files and I get a memory error. I tried to use bioawk (with the -c flag) on the giant gzip'ed file but the renaming is erratic.

Admittedly, I really need to work on my awk skills. Can anyone provide a quick awk one liner to rename these headers in the unzipped files that would use minimal memory? Thanks so much for your help.

fastq awk • 21k views

ADD COMMENT • link updated 4.6 years ago by ATpoint 81k • written 11.0 years ago by Josh Herr 5.8k

score 16 · Answer 1 · 2013-04-15

16

Entering edit mode

11.0 years ago

Frédéric Mahé ★ 3.2k

Pierre's answer is correct, but I think there is a simpler Awk solution:

zcat file.fastq.gz | awk '{print (NR%4 == 1) ? "@1_" ++i : $0}' | gzip -c > another.fastq.gz

With Awk, you can use conditional expressions (expressions with 3 operands, like in C). It reads like that (A ? B : C): if A is true, then B, else C. In that case, we print the result of the conditional expression to the standard output. With Awk, you do not need to declare your variables (automatically set to 0). Pierre's solution to set the first i to 1 can be improved by using pre-incrementation (++i) instead of a post-incrementation (i++).

ADD COMMENT • link 11.0 years ago by Frédéric Mahé ★ 3.2k

2

Entering edit mode

thank you for improving my awk :-)

ADD REPLY • link 11.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks for improving my awk too!

ADD REPLY • link 10.8 years ago by Josh Herr 5.8k

score 9 · Answer 2 · 2013-04-07

9

Entering edit mode

11.0 years ago

Pierre Lindenbaum 161k

NR is the number of rows. $0 is the current whole line.

gunzip -c file.fastq.gz |  awk '{if(NR%4==1) $0=sprintf("@1_%d",(1+i++)); print;}' | gzip -c > another.fastq.gz

ADD COMMENT • link 11.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks! This worked like a charm!

ADD REPLY • link 11.0 years ago by Josh Herr 5.8k

0

Entering edit mode

6.4 years later still helpful, thanks :)

ADD REPLY • link 4.7 years ago by ATpoint 81k

score 4 · Answer 3 · 2013-04-07

4

Entering edit mode

11.0 years ago

Martin A Hansen 3.0k

With Biopieces www.biopieces.org):

read_fastq -i in.fq | add_ident -k SEQ_NAME -p 1_ | write_fastq -o out.fq -x

ADD COMMENT • link 11.0 years ago by Martin A Hansen 3.0k

0

Entering edit mode

Thanks for this and biopieces!

ADD REPLY • link 11.0 years ago by Josh Herr 5.8k

score 2 · Answer 4 · 2013-04-08

2

Entering edit mode

11.0 years ago

bioinfo ▴ 830

I guess we can do it with the fastx toolkit as well. Here it is:

$ fastx_renamer -h
usage: fastx_renamer [-n TYPE] [-h] [-z] [-v] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.10 by A. Gordon (gordon@cshl.edu)

   [-n TYPE]    = rename type:
          SEQ - use the nucleotides sequence as the name.
          COUNT - use simply counter as the name.
   [-h]         = This helpful help screen.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.

ADD COMMENT • link 11.0 years ago by bioinfo ▴ 830

0

Entering edit mode

Thanks! I was not aware of this.

ADD REPLY • link 11.0 years ago by Josh Herr 5.8k

score 1 · Answer 5 · 2019-09-08

1

Entering edit mode

4.6 years ago

ATpoint 81k

Very simple with seqtk rename, see https://github.com/lh3/seqtk

ADD COMMENT • link 4.6 years ago by ATpoint 81k

0

Entering edit mode

Thanks for pointing that one out for posterity! When I asked the question 6.4 years ago the seqtk tool did not exist!

ADD REPLY • link 4.6 years ago by Josh Herr 5.8k