Manipulate Sequences In Fastq Files
3
2
Entering edit mode
10.4 years ago
lsvijfhuizen ▴ 90

Dear All,

I have 20x illumina sequences data in large fastq files. Each file contains a sequence length of 21 nucleotides. I would like to remove the first 4 nucleotides from all reads in the files.

i.e.

@D5N3XBQ1:129:C0T9LACXX:1:1101:1227:2122 1:N:0:
CATGATTTGATATTTAGGGCTT
+
HIFHIEGHIIFHGIIGHIIIDH
@D5N3XBQ1:129:C0T9LACXX:1:1101:1150:2163 1:N:0:
CATGATGACATAGAAATAATTT
+
IIFIIIIIIIIIIIFIFIIIFI
@D5N3XBQ1:129:C0T9LACXX:1:1101:1155:2248 1:N:0:
CATGAAGACAAAGCCTCTATGA


to

@D5N3XBQ1:129:C0T9LACXX:1:1101:1227:2122 1:N:0:
ATTTGATATTTAGGGCTT
+
HIFHIEGHIIFHGIIGHIIIDH
@D5N3XBQ1:129:C0T9LACXX:1:1101:1150:2163 1:N:0:
ATGACATAGAAATAATTT
+
IIFIIIIIIIIIIIFIFIIIFI
@D5N3XBQ1:129:C0T9LACXX:1:1101:1155:2248 1:N:0:
AAGACAAAGCCTCTATGA


I am new to bioinformatics and would appreciate a few pointers on the best way to get this done with the command line in Linux. Thanks, Lisanne

fastq sequence • 9.9k views
0
Entering edit mode

Edited my answer; presumably, you also want to remove the first 4 characters of quality score?

9
Entering edit mode
10.4 years ago
Neilfws 49k

sed '2~4s/^$$.\{4\}$$//' myfile > newfile


Translated that says: starting from line 2, substitute the first 4 characters every 4th line with nothing (i.e. remove them).

If you don't want to write to newfile, run:

sed -i '2~4s/^$$.\{4\}$$//' myfile


to edit myfile "in place".

EDIT

I think you will also want to remove the corresponding characters from the quality score lines. So you should run:

sed '2~2s/^$$.\{4\}$$//' myfile > newfile


Other useful command line tools for text processing: grep, awk, cut, paste, head, tail.

6
Entering edit mode
10.4 years ago

As an alternative approach - the FASTX-Toolkit provides a number of command line utilities for manipulating FASTQ and FASTA sequence files.

fastx_trimmer has lots of options for trimming sequences in a variety of ways. To achieve what you are after (trim first 4 bases from each read):

fastx_trimmer -f 5 -z -i infile.fq.gz -o outfile.fq.gz

The -f 5 option is the key one here, this says that the 5th nucleotide of each sequence is the first one you want to keep (i.e. the first 4 are discarded).

1
Entering edit mode
10.4 years ago
JC 13k

Perl-one-liner, trimming read sequence and qualities lines:

cat file.fq | perl -plane '$ln++; s/^....// if ($ln % 2 == 0)' > trimmed.fq