Hi people, I have got a Java question for you:
I am working with large fastq file. I want to read the file, search for sequence entries by their description, and write a new fastq file with a subset of reads that match. Simple enough.
I have tried BioJava 1.7.1, the fastq API makes this easy to code (see the code example below), but it's inefficient for real-sized files (say 500MB-1GB). It works for my with small files ( up to 100k reads), but not for larger files, I'm running out of memory, and it will be very slow. I haven't found the right heap size for Java yet, must be >4GB for a 500MB fastq.
The fastq parser attempts to slurp the whole file at once.
Is there another library to do this more efficiently, or another way to read that file in chunks using BioJava? Only requirements, should fit into less than 2 GB of RAM and be in Java.
Here is my code:
package org.esysbio.fastqparser;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Arrays;
import java.util.HashSet;
import org.biojava.bio.program.fastq.Fastq;
import org.biojava.bio.program.fastq.FastqReader;
import org.biojava.bio.program.fastq.FastqWriter;
import org.biojava.bio.program.fastq.SangerFastqReader;
import org.biojava.bio.program.fastq.SangerFastqWriter;
public class App {
public static void main(String[] args) throws FileNotFoundException,
IOException {
FileInputStream inputFastq = new FileInputStream(args[0]);
FastqReader qReader = new SangerFastqReader();
/* use a hash set for fast search */
HashSet<String> nameset = new HashSet<String>();
/* some dummy names to search for */
String[] names = { "NCC-1701-D",
"NCC-1701-A" };
nameset.addAll(Arrays.asList(names));
FileOutputStream outputFastq = new FileOutputStream(args[1]);
FastqWriter qWriter = new SangerFastqWriter();
int i = 0, j = 0;
/*
* qReader.read(inputFastq) tries to read the whole file in into memory,
* I think this is part of the problem
*/
for (Fastq fastq : qReader.read(inputFastq)) {
i++;
String nam = fastq.getDescription();
if (nameset.contains(nam)) {
j++;
qWriter.write(outputFastq, fastq);
}
}
outputFastq.close();
System.out.println("read " + i + " sequences, wrote " + j);
}
}
Thank you Pierre, yes it works like a charm out of the box, just a bit embarrassing I didn't get the idea myself. For the timing I to the fun to make a little unix grep competitor:
If one removes each entry after finding it once, that allows for another speedup.