Question: awk or/and cut command for reformatting sequence header identifiers of fastq file
0
gravatar for qinglong
8 months ago by
qinglong10
qinglong10 wrote:

Hi there,

I have error-corrected reads with header names changed (four lines per one sequence):

@E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0 BH:changed:3
GGCATGGGCATCGAACTGGCGGTGTAAGGGTTGGGGCTTTGGC
+E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0 BH:changed:3
<@FFFJJJJJJJJJFFJJJJJJJJFJAJJFFJFJJJJJJF<JFFJJFJJ-<

I need simplified headers (four lines per one sequence):

@E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0
GGCATGGGCATCGAACTGGCGGTGTAAGGGTTGGGGCTTTGGC
+
<@FFFJJJJJJJJJFFJJJJJJJJFJAJJFFJFJJJJJJF<JFFJJFJJ-<

Please noted the ASCII_BASE 33 format (@ is also available in the quality profile) for my Illumina reads.

Thanks!

ADD COMMENTlink modified 8 months ago by c_dampier60 • written 8 months ago by qinglong10
3
gravatar for c_dampier
8 months ago by
c_dampier60
USA/UVA
c_dampier60 wrote:

I think this (edited per h.mon comments) awk code will give you the desired output:

awk 'NR%4==1 {$3=""} NR%4==3 {$0=substr($0,1,1)} {print $0}' <my.fq>

The first rule finds each 4th line starting with line 1, and its associated action changes the third field (separated by " ") to empty.

The next rule finds each 4th line starting with line 3, and its associated action takes a sub-string of the line, starting at position 1, and keeps 1 character.

The third (implicit) rule finds every line, and its associated action prints the full line. Note that this final action prints the lines after they have been acted upon by the first two rules.

Original (unsafe) code for reference:

awk '/^@/ {$3=""} /^+/ {$0=substr($0,1,1)} {print $0}' <my.fq>
ADD COMMENTlink modified 8 months ago • written 8 months ago by c_dampier60
2

@ and + are part of the character set used to encode quality, so they are not safe to use for header / comment line determination. A better solution is to parse based on line numbers, as most fastq files place sequences in groups of four lines. The original fastq specification allow for sequence and quality line wrapping, but this is nearly nonexistent, and the four-line per record fastq became the de facto standard.

ADD REPLYlink modified 8 months ago • written 8 months ago by h.mon28k

Thank you, h.mon. I think the edited code will be safer.

ADD REPLYlink modified 8 months ago • written 8 months ago by c_dampier60
1

Thanks Christopher! Very clean and clear!

ADD REPLYlink modified 8 months ago • written 8 months ago by qinglong10

You're welcome, qinglong. I apologize for potentially leading you astray with my original code. Glad h.mon was there to save us.

ADD REPLYlink written 8 months ago by c_dampier60

I spent half day to figure out a awk command for this purpose; but luckily to have your help.

One stupid thing I met is that awk may not work for compressed files, have to decompress files then manipulate the files. Just a note for others who may be also using this command as.

ADD REPLYlink written 8 months ago by qinglong10
1

You can pipe several commands into one, avoiding the need to create intermediary files:

zcat file.fastq.gz | awk 'NR%4==1 {$3=""} NR%4==3 {$0=substr($0,1,1)} {print $0}' | gzip > file_edited.fastq.gz
ADD REPLYlink written 8 months ago by h.mon28k

Yes, you are right; that is exactly what I have done when I was constructing my in-house pipe.

ADD REPLYlink modified 8 months ago • written 8 months ago by qinglong10
1
gravatar for cpad0112
8 months ago by
cpad011212k
India
cpad011212k wrote:

with gnu-sed ( I copy pasted OP sequence thrice to check the code):

$ sed  '3~4 s/.*/+/; 1~4 s/[^ ]*$//' test.fq

@E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0 
GGCATGGGCATCGAACTGGCGGTGTAAGGGTTGGGGCTTTGGC
+
<@FFFJJJJJJJJJFFJJJJJJJJFJAJJFFJFJJJJJJF<JFFJJFJJ-<
@E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0 
GGCATGGGCATCGAACTGGCGGTGTAAGGGTTGGGGCTTTGGC
+
<@FFFJJJJJJJJJFFJJJJJJJJFJAJJFFJFJJJJJJF<JFFJJFJJ-<
@E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0 
GGCATGGGCATCGAACTGGCGGTGTAAGGGTTGGGGCTTTGGC
+
<@FFFJJJJJJJJJFFJJJJJJJJFJAJJFFJFJJJJJJF<JFFJJFJJ-<
ADD COMMENTlink modified 8 months ago • written 8 months ago by cpad011212k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2030 users visited in the last hour