Question

How To Parse Fastq File?

1

Entering edit mode

11.3 years ago

SK ▴ 110

I am new in NGS. I need to format some fastq files. For example I want to replace characters /1 and /2 with -1 and - 2 in the files having data like @HWUSI-EAS1671_0001:5:1:1022:10290/1. I tried to do it in perl with pattern matching but the files are too big to parse. Could anyone please give me suggestion on share any code?

fastq • 8.2k views

ADD COMMENT • link updated 8.8 years ago by Biostar 20 • written 11.3 years ago by SK ▴ 110

5

Entering edit mode

Well, files are never too big to parse with Perl if you don't keep the file in RAM.

ADD REPLY • link 11.3 years ago by lh3 33k

0

Entering edit mode

note: as far as I can see, all the examples below assume that it takes four lines for one FASTQ record .

ADD REPLY • link 11.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

This is generally the case with fastq parsers. I'd be curious if you have an alternative. One of the flaws of fastq in my opinion is that each record is spread over multiple lines with no guaranteed record separator

ADD REPLY • link 11.3 years ago by Obi Griffith 20k

0

Entering edit mode

see

How common are multi-line fastq files?

ADD REPLY • link 11.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I made a tool to convert the nonstandard multi-line fastq files into 4-line entries. Fastq files should be 4 lines per entry because they are too difficult to parse otherwise. https://sourceforge.net/p/cg-pipeline/code/425/tree/cg_pipeline/branches/lkatz/scripts/run_assembly_convertMultiFastqToStandard.pl

ADD REPLY • link 11.3 years ago by Lee Katz ★ 3.1k

3

Entering edit mode

While I agree multi-line fastq is inconvenient, it is not nonstandard as no official documentations invalidate multi-line fastq. Seqtk is probably the most efficient way so far to convert between multi-line and 4-line fastq.

ADD REPLY • link 11.3 years ago by lh3 33k

0

Entering edit mode

Are there cases where there are fewer lines for a FASTQ record? (I write quickie .fq parsers every now and again, so I'm genuinely curious.)

ADD REPLY • link 11.3 years ago by Alex Reynolds 35k

score 4 · Answer 1 · 2013-02-01

4

Entering edit mode

11.3 years ago

Irsan ★ 7.8k

Many times, you can use UNIX commands do file slicing and dicing tasks.

Consider the file old.fastq

@HWI-ST1019:196:D121WACXX:5:1101:1538:2300/1
CCGCGACCTCTGTTCTGCAGCCCCTTCCCTTCCCCGCCTCCTGCTCTGCCGGGACTACGCACCGGCCTGATTGGTTACCCCCGGGGTGTCCTCGGTCACCA
+
1+++4)<@<A<+2A9A2++:3C8:)1?BDBDBBDC@::6@(.8..7)777:<?@@##############################################
@HWI-ST1019:196:D121WACXX:5:1101:1538:2300/2
CCGCGACCTCTGTTCTGCAGCCCCTTCCCTTCCCCGCCTCCTGCTCTGCCGGGACTACGCACCGGCCTGATTGGTTACCCCCGGGGTGTCCTCGGTCACCA
+
1+++4)<@<A<+2A9A2++:3C8:)1?BDBDBBDC@::6@(.8..7)777:<?@@##############################################

you can use the stream editor (sed) to search and replace /1 for -1

sed 's:\/\([12]\):-\1:g' test.fastq

that will give you the output

@HWI-ST1019:196:D121WACXX:5:1101:1538:2300-1
CCGCGACCTCTGTTCTGCAGCCCCTTCCCTTCCCCGCCTCCTGCTCTGCCGGGACTACGCACCGGCCTGATTGGTTACCCCCGGGGTGTCCTCGGTCACCA
+
1+++4)<@<A<+2A9A2++:3C8:)1?BDBDBBDC@::6@(.8..7)777:<?@@##############################################
@HWI-ST1019:196:D121WACXX:5:1101:1538:2300-2
CCGCGACCTCTGTTCTGCAGCCCCTTCCCTTCCCCGCCTCCTGCTCTGCCGGGACTACGCACCGGCCTGATTGGTTACCCCCGGGGTGTCCTCGGTCACCA
+
1+++4)<@<A<+2A9A2++:3C8:)1?BDBDBBDC@::6@(.8..7)777:<?@@##############################################

ADD COMMENT • link 11.3 years ago by Irsan ★ 7.8k

1

Entering edit mode

What happens here if you have '/1' in the quality strings ?

ADD REPLY • link 11.3 years ago by toni ★ 2.2k

0

Entering edit mode

Hmm, that also gets replaced: then use this to change any sequence id that ens with /1 to -1 assuming you have 1 sequence id every 4 lines

awk '{if(NR%4==1){gsub(/\/1$/,"-1");print $0}else{print $0}}' test.fastq

and for /2 to -2

awk '{if(NR%4==1){gsub(/\/2$/,"-2");print $0}else{print $0}}' test.fastq

ADD REPLY • link 11.3 years ago by Irsan ★ 7.8k

0

Entering edit mode

I personally rather use part of the header, such as "HWI" when trying to match header than relying on 1 sequence always spread over 4 lines.

ADD REPLY • link 11.3 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

+1 for use of sed; see my blog post on handling fastq - http://nsaunders.wordpress.com/2011/12/22/sequencing-for-relics-from-the-sanger-era-part-1-getting-the-raw-data/

ADD REPLY • link 11.3 years ago by Neilfws 49k

score 4 · Answer 2 · 2013-02-01

4

Entering edit mode

11.3 years ago

JC 13k

As @lh3 pointed, you don't to read the full file in memory in Perl, just read line by line and process if required:

perl -plane 's/\/(\d)/-$1/ if (m/\@HWI/)' < file.fastq > new_file.fastq

ADD COMMENT • link 11.3 years ago by JC 13k

score 4 · Answer 3 · 2013-02-01

4

Entering edit mode

11.3 years ago

Alastair Kerr 5.3k

Some code golf for a Friday afternoon? Here is a perl one liner,

perl -pi.bak -e '/^@H/ && s#/([12])$#-\1#' filename

ADD COMMENT • link 11.3 years ago by Alastair Kerr 5.3k

0

Entering edit mode

beaten to the send button

ADD REPLY • link 11.3 years ago by Alastair Kerr 5.3k

score 3 · Answer 4 · 2013-02-01

Hi, Whatever language you choose the files will remain big anyway, but I do not think it's worth to write a C code just for this. I do not know on which operating system you are working on, but supposing that you are under Linux (or equivalent) you can have a look at the sed command.

You must pay attention to the fact that the characters '/', '1' and '2' are in the range of those used for encoding quality. So when you do a substitution, you have to make sure that it is indeed a line of read name and not a quality string.

Take this example (file example.fasq):

@HWI-ST1019:196:/121WACXX:5:1101:1538:2300/1
CCGCGACCTCTGTTCTGCAGCCCCTTCCCTTCCCCGCCTCCTGCTCTGCCGGGACTACGCACCGGCCTGATTGGTTACCCCCGGGGTGTCCTCGGTCACCA
+
1+++4)<@<A<+2A9A2++:3C8:)1?BDBDBBDC@::6@(.8..7)777:<?@@######################################/1######

There are 3 '/1' appearances : 2 in the read name, one in the quality string, you only want the one at the end of the read name to be modified.

Then execute (assuming all your read names begin with HW):

sed "/^@HW/ s/\/1$/-1/g" example.fastq

And it produces :

@HWI-ST1019:196:/121WACXX:5:1101:1538:2300-1
CCGCGACCTCTGTTCTGCAGCCCCTTCCCTTCCCCGCCTCCTGCTCTGCCGGGACTACGCACCGGCCTGATTGGTTACCCCCGGGGTGTCCTCGGTCACCA
+
1+++4)<@<A<+2A9A2++:3C8:)1?BDBDBBDC@::6@(.8..7)777:<?@@######################################/1######

Is it REALLY necessary that you read such a huge file just to substitute a '/' with a '-' ?

score 2 · Answer 5 · 2013-02-01

2

Entering edit mode

11.3 years ago

Nikolay Vyahhi ★ 1.3k

In Python:

for line in open('file.fastq'):
  print line.replace('/1', '-1').replace('/2', '-2')

ADD COMMENT • link 11.3 years ago by Nikolay Vyahhi ★ 1.3k

score 1 · Answer 6 · 2013-02-01

1

Entering edit mode

11.3 years ago

Mikael Huss 4.8k

I find that modules like seqIO in BioPython are convenient for this kind of thing. Especially when you start needing to do other things than just change one character in each header. Of course there are analogues in HTSeq, BioPerl etc.

ADD COMMENT • link 11.3 years ago by Mikael Huss 4.8k