Question: How To Parse Fastq File?
1
gravatar for SK
7.1 years ago by
SK110
Germany
SK110 wrote:

I am new in NGS. I need to format some fastq files. For example I want to replace characters /1 and /2 with -1 and - 2 in the files having data like @HWUSI-EAS1671_0001:5:1:1022:10290/1. I tried to do it in perl with pattern matching but the files are too big to parse. Could anyone please give me suggestion on share any code?

fastq • 4.8k views
ADD COMMENTlink modified 4.6 years ago by Biostar ♦♦ 20 • written 7.1 years ago by SK110
5

Well, files are never too big to parse with Perl if you don't keep the file in RAM.

ADD REPLYlink written 7.1 years ago by lh332k

note: as far as I can see, all the examples below assume that it takes four lines for one FASTQ record .

ADD REPLYlink written 7.1 years ago by Pierre Lindenbaum126k

This is generally the case with fastq parsers. I'd be curious if you have an alternative. One of the flaws of fastq in my opinion is that each record is spread over multiple lines with no guaranteed record separator

ADD REPLYlink written 7.1 years ago by Obi Griffith18k

see

How common are multi-line fastq files?

ADD REPLYlink written 7.1 years ago by Pierre Lindenbaum126k

I made a tool to convert the nonstandard multi-line fastq files into 4-line entries. Fastq files should be 4 lines per entry because they are too difficult to parse otherwise. https://sourceforge.net/p/cg-pipeline/code/425/tree/cg_pipeline/branches/lkatz/scripts/run_assembly_convertMultiFastqToStandard.pl

ADD REPLYlink written 7.0 years ago by Lee Katz3.0k
3

While I agree multi-line fastq is inconvenient, it is not nonstandard as no official documentations invalidate multi-line fastq. Seqtk is probably the most efficient way so far to convert between multi-line and 4-line fastq.

ADD REPLYlink written 7.0 years ago by lh332k

Are there cases where there are fewer lines for a FASTQ record? (I write quickie .fq parsers every now and again, so I'm genuinely curious.)

ADD REPLYlink written 7.1 years ago by Alex Reynolds29k
4
gravatar for Irsan
7.1 years ago by
Irsan7.1k
Amsterdam
Irsan7.1k wrote:

Many times, you can use UNIX commands do file slicing and dicing tasks.

Consider the file old.fastq

@HWI-ST1019:196:D121WACXX:5:1101:1538:2300/1
CCGCGACCTCTGTTCTGCAGCCCCTTCCCTTCCCCGCCTCCTGCTCTGCCGGGACTACGCACCGGCCTGATTGGTTACCCCCGGGGTGTCCTCGGTCACCA
+
1+++4)<@<A<+2A9A2++:3C8:)1?BDBDBBDC@::6@(.8..7)777:<?@@##############################################
@HWI-ST1019:196:D121WACXX:5:1101:1538:2300/2
CCGCGACCTCTGTTCTGCAGCCCCTTCCCTTCCCCGCCTCCTGCTCTGCCGGGACTACGCACCGGCCTGATTGGTTACCCCCGGGGTGTCCTCGGTCACCA
+
1+++4)<@<A<+2A9A2++:3C8:)1?BDBDBBDC@::6@(.8..7)777:<?@@##############################################

you can use the stream editor (sed) to search and replace /1 for -1

sed 's:\/\([12]\):-\1:g' test.fastq

that will give you the output

@HWI-ST1019:196:D121WACXX:5:1101:1538:2300-1
CCGCGACCTCTGTTCTGCAGCCCCTTCCCTTCCCCGCCTCCTGCTCTGCCGGGACTACGCACCGGCCTGATTGGTTACCCCCGGGGTGTCCTCGGTCACCA
+
1+++4)<@<A<+2A9A2++:3C8:)1?BDBDBBDC@::6@(.8..7)777:<?@@##############################################
@HWI-ST1019:196:D121WACXX:5:1101:1538:2300-2
CCGCGACCTCTGTTCTGCAGCCCCTTCCCTTCCCCGCCTCCTGCTCTGCCGGGACTACGCACCGGCCTGATTGGTTACCCCCGGGGTGTCCTCGGTCACCA
+
1+++4)<@<A<+2A9A2++:3C8:)1?BDBDBBDC@::6@(.8..7)777:<?@@##############################################
ADD COMMENTlink modified 7.1 years ago • written 7.1 years ago by Irsan7.1k
1

What happens here if you have '/1' in the quality strings ?

ADD REPLYlink written 7.1 years ago by toni2.1k

Hmm, that also gets replaced: then use this to change any sequence id that ens with /1 to -1 assuming you have 1 sequence id every 4 lines

awk '{if(NR%4==1){gsub(/\/1$/,"-1");print $0}else{print $0}}' test.fastq

and for /2 to -2

awk '{if(NR%4==1){gsub(/\/2$/,"-2");print $0}else{print $0}}' test.fastq
ADD REPLYlink modified 7.1 years ago • written 7.1 years ago by Irsan7.1k

I personally rather use part of the header, such as "HWI" when trying to match header than relying on 1 sequence always spread over 4 lines.

ADD REPLYlink written 7.1 years ago by Biomonika (Noolean)3.1k

+1 for use of sed; see my blog post on handling fastq - http://nsaunders.wordpress.com/2011/12/22/sequencing-for-relics-from-the-sanger-era-part-1-getting-the-raw-data/

ADD REPLYlink written 7.1 years ago by Neilfws48k
4
gravatar for JC
7.1 years ago by
JC9.5k
Mexico
JC9.5k wrote:

As @lh3 pointed, you don't to read the full file in memory in Perl, just read line by line and process if required:

perl -plane 's/\/(\d)/-$1/ if (m/\@HWI/)' < file.fastq > new_file.fastq

ADD COMMENTlink written 7.1 years ago by JC9.5k
4
gravatar for Alastair Kerr
7.1 years ago by
Alastair Kerr5.2k
Manchester/UK/Cancer Biomarker Centre at CRUK-MI
Alastair Kerr5.2k wrote:

Some code golf for a Friday afternoon? Here is a perl one liner,

perl -pi.bak -e '/^@H/ && s#/([12])$#-\1#' filename

ADD COMMENTlink written 7.1 years ago by Alastair Kerr5.2k

beaten to the send button

ADD REPLYlink written 7.1 years ago by Alastair Kerr5.2k
3
gravatar for toni
7.1 years ago by
toni2.1k
Lyon
toni2.1k wrote:

Hi, Whatever language you choose the files will remain big anyway, but I do not think it's worth to write a C code just for this. I do not know on which operating system you are working on, but supposing that you are under Linux (or equivalent) you can have a look at the sed command.

You must pay attention to the fact that the characters '/', '1' and '2' are in the range of those used for encoding quality. So when you do a substitution, you have to make sure that it is indeed a line of read name and not a quality string.

Take this example (file example.fasq):

@HWI-ST1019:196:/121WACXX:5:1101:1538:2300/1
CCGCGACCTCTGTTCTGCAGCCCCTTCCCTTCCCCGCCTCCTGCTCTGCCGGGACTACGCACCGGCCTGATTGGTTACCCCCGGGGTGTCCTCGGTCACCA
+
1+++4)<@<A<+2A9A2++:3C8:)1?BDBDBBDC@::6@(.8..7)777:<?@@######################################/1######

There are 3 '/1' appearances : 2 in the read name, one in the quality string, you only want the one at the end of the read name to be modified.

Then execute (assuming all your read names begin with HW):

sed "/^@HW/ s/\/1$/-1/g" example.fastq

And it produces :

@HWI-ST1019:196:/121WACXX:5:1101:1538:2300-1
CCGCGACCTCTGTTCTGCAGCCCCTTCCCTTCCCCGCCTCCTGCTCTGCCGGGACTACGCACCGGCCTGATTGGTTACCCCCGGGGTGTCCTCGGTCACCA
+
1+++4)<@<A<+2A9A2++:3C8:)1?BDBDBBDC@::6@(.8..7)777:<?@@######################################/1######

Is it REALLY necessary that you read such a huge file just to substitute a '/' with a '-' ?

ADD COMMENTlink modified 7.1 years ago • written 7.1 years ago by toni2.1k
2
gravatar for Nikolay Vyahhi
7.1 years ago by
Nikolay Vyahhi1.3k
St. Petersburg, Russia
Nikolay Vyahhi1.3k wrote:

In Python:

for line in open('file.fastq'):
  print line.replace('/1', '-1').replace('/2', '-2')
ADD COMMENTlink written 7.1 years ago by Nikolay Vyahhi1.3k
1
gravatar for Mikael Huss
7.1 years ago by
Mikael Huss4.7k
Stockholm
Mikael Huss4.7k wrote:

I find that modules like seqIO in BioPython are convenient for this kind of thing. Especially when you start needing to do other things than just change one character in each header. Of course there are analogues in HTSeq, BioPerl etc.

ADD COMMENTlink written 7.1 years ago by Mikael Huss4.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1738 users visited in the last hour