Question: Quick One Liner For Fastq Header Renaming
3
gravatar for Josh Herr
6.2 years ago by
Josh Herr5.6k
University of Nebraska
Josh Herr5.6k wrote:

My annual question here: I received 17 large (each between 5.27 and 7.89 Gb) FASTQ files of transcriptome data from a collaborator. What they sent me has no header information for any of the sequences in the files. Here's the (truncated) first two sequences of the first file:

@No name
CAAGCGTGTCACCTATACCCCTCCGCCGGGGCAAAA
+
????????DDBDDDDDFFFFF9CEEHCECHHHBFHFH
@No name
CCAACTGCTGTTCACACGGAACCTTTCCCCACTTCAG
+
BBBDDBDDDDDFFFFEFIIIIHIIIHHIIHHHHIIIF

Before I QC and do any downstream analyses I would like to give each sequence a number so I can reference each later (i.e. controls and treatments). I want something like this:

@1_1
CAAGCGTGTCACCTATACCCCTCCGCCGGGGCAAAA
+
????????DDBDDDDDFFFFF9CEEHCECHHHBFHFH
@1_2
CCAACTGCTGTTCACACGGAACCTTTCCCCACTTCAG
+
BBBDDBDDDDDFFFFEFIIIIHIIIHHIIHHHHIIIF

etc., etc., for the millions of sequences... but I need to make sure all the sequence data has individual header names that correspond to that particular sequence (so when I rename the next file it needs to be unique for that file).

This shouldn't be that hard, but I googled around on the forum and elsewhere and can't find a solution. I've tried to grep the files and I get a memory error. I tried to use bioawk (with the -c flag) on the giant gzip'ed file but the renaming is erratic.

Admittedly, I really need to work on my awk skills. Can anyone provide a quick awk one liner to rename these headers in the unzipped files that would use minimal memory? Thanks so much for your help.

fastq awk • 11k views
ADD COMMENTlink modified 6.2 years ago by Frédéric Mahé2.9k • written 6.2 years ago by Josh Herr5.6k
11
gravatar for Frédéric Mahé
6.2 years ago by
France, Montpellier, CIRAD
Frédéric Mahé2.9k wrote:

Pierre's answer is correct, but I think there is a simpler Awk solution:

zcat file.fastq.gz | awk '{print (NR%4 == 1) ? "@1_" ++i : $0}' | gzip -c > another.fastq.gz

With Awk, you can use conditional expressions (expressions with 3 operands, like in C). It reads like that (A ? B : C): if A is true, then B, else C. In that case, we print the result of the conditional expression to the standard output. With Awk, you do not need to declare your variables (automatically set to 0). Pierre's solution to set the first i to 1 can be improved by using pre-incrementation (++i) instead of a post-incrementation (i++).

ADD COMMENTlink written 6.2 years ago by Frédéric Mahé2.9k
1

thank you for improving my awk :-)

ADD REPLYlink written 6.2 years ago by Pierre Lindenbaum121k

Thanks for improving my awk too!

ADD REPLYlink written 6.0 years ago by Josh Herr5.6k
8
gravatar for Pierre Lindenbaum
6.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

NR is the number of rows. $0 is the current whole line.

gunzip -c file.fastq.gz |  awk '{if(NR%4==1) $0=sprintf("@1_%d",(1+i++)); print;}' | gzip -c > another.fastq.gz
ADD COMMENTlink written 6.2 years ago by Pierre Lindenbaum121k

Thanks! This worked like a charm!

ADD REPLYlink written 6.2 years ago by Josh Herr5.6k
4
gravatar for Martin A Hansen
6.2 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

With Biopieces www.biopieces.org):

read_fastq -i in.fq | add_ident -k SEQ_NAME -p 1_ | write_fastq -o out.fq -x
ADD COMMENTlink written 6.2 years ago by Martin A Hansen3.0k

Thanks for this and biopieces!

ADD REPLYlink written 6.2 years ago by Josh Herr5.6k
2
gravatar for bioinfo
6.2 years ago by
bioinfo710
New Zealand
bioinfo710 wrote:

I guess we can do it with the fastx toolkit as well. Here it is:

$ fastx_renamer -h
usage: fastx_renamer [-n TYPE] [-h] [-z] [-v] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.10 by A. Gordon (gordon@cshl.edu)

   [-n TYPE]    = rename type:
          SEQ - use the nucleotides sequence as the name.
          COUNT - use simply counter as the name.
   [-h]         = This helpful help screen.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
ADD COMMENTlink written 6.2 years ago by bioinfo710

Thanks! I was not aware of this.

ADD REPLYlink written 6.2 years ago by Josh Herr5.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1594 users visited in the last hour