Efficient Fastq Parsing And Manipulation In Java (With Or Without Biojava)
2
2
Entering edit mode
13.8 years ago
Michael 54k

Hi people, I have got a Java question for you:

I am working with large fastq file. I want to read the file, search for sequence entries by their description, and write a new fastq file with a subset of reads that match. Simple enough.

I have tried BioJava 1.7.1, the fastq API makes this easy to code (see the code example below), but it's inefficient for real-sized files (say 500MB-1GB). It works for my with small files ( up to 100k reads), but not for larger files, I'm running out of memory, and it will be very slow. I haven't found the right heap size for Java yet, must be >4GB for a 500MB fastq.

The fastq parser attempts to slurp the whole file at once.

Is there another library to do this more efficiently, or another way to read that file in chunks using BioJava? Only requirements, should fit into less than 2 GB of RAM and be in Java.

Here is my code:

package org.esysbio.fastqparser;

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Arrays;
import java.util.HashSet;

import org.biojava.bio.program.fastq.Fastq;
import org.biojava.bio.program.fastq.FastqReader;
import org.biojava.bio.program.fastq.FastqWriter;
import org.biojava.bio.program.fastq.SangerFastqReader;
import org.biojava.bio.program.fastq.SangerFastqWriter;

public class App {

    public static void main(String[] args) throws FileNotFoundException,
            IOException {
        FileInputStream inputFastq = new FileInputStream(args[0]);
        FastqReader qReader = new SangerFastqReader();
        /* use a hash set for fast search */
        HashSet<String> nameset = new HashSet<String>();
        /* some dummy names to search for */
        String[] names = { "NCC-1701-D",
                "NCC-1701-A" };

        nameset.addAll(Arrays.asList(names));

        FileOutputStream outputFastq = new FileOutputStream(args[1]);

        FastqWriter qWriter = new SangerFastqWriter();

        int i = 0, j = 0;
        /*
         * qReader.read(inputFastq) tries to read the whole file in into memory,
         * I think this is part of the problem
         */
        for (Fastq fastq : qReader.read(inputFastq)) {
            i++;
            String nam = fastq.getDescription();
            if (nameset.contains(nam)) {
                j++;
                qWriter.write(outputFastq, fastq);
            }
        }
        outputFastq.close();
        System.out.println("read " + i + " sequences, wrote " + j);
    }
}
fastq java biojava parsing • 7.3k views
ADD COMMENT
1
Entering edit mode
13.8 years ago

Did you try to use a simple readline ? I've not tested the simple following program and I've no idea about its speed, but you get the idea. The program just supposes that there is only 4 rows per record. You can also try to increase the value of DEFAULT_BUFFER_SIZE.

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.InputStreamReader;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;
import java.util.zip.GZIPInputStream;

public class App {
public static void main(String[] args) throws Exception
    {
    final int DEFAULT_BUFFER_SIZE=5096;
    String[] names = { "@NCC-1701-D",
       "@NCC-1701-A" };

    Set<String> nameset=new HashSet<String>();
    nameset.addAll(Arrays.asList(names));

    BufferedReader r;

if(args[0].endsWith(".gz"))
         {
         r=new BufferedReader(new InputStreamReader(new GZIPInputStream(new FileInputStream(args[0]),DEFAULT_BUFFER_SIZE)),DEFAULT_BUFFER_SIZE);
         }
     else
         {
         r=new BufferedReader(new FileReader(args[0]),DEFAULT_BUFFER_SIZE);
         }
    long nLine=-1L;
    String line;
    while((line=r.readLine())!=null)
        {
        nLine++;
        if(nLine%4!=0 || !nameset.contains(line)) continue;
        System.out.println(line);
        System.out.println(r.readLine());
        System.out.println(r.readLine());
        System.out.println(r.readLine());
        nLine+=3;
        }
    r.close();
    }
}
ADD COMMENT
0
Entering edit mode

Thank you Pierre, yes it works like a charm out of the box, just a bit embarrassing I didn't get the idea myself. For the timing I to the fun to make a little unix grep competitor:

time grep -w -e "@NGC-1" -e @NGC-2 -A3  input.fastq | grep -ve"--"
...
real    0m11.504s
user    0m0.266s
sys 0m0.449s

time java  org.esysbio.App2
...
real    0m13.058s
user    0m6.227s
sys 0m0.977s

If one removes each entry after finding it once, that allows for another speedup.

ADD REPLY
1
Entering edit mode
6.6 years ago

Another approach may by to use kscript with the following scriptlet to stream through the file, apply some transformation, and to persist the result into another fastq file:

#!/usr/bin/env kscript

//DEPS de.mpicbg.scicomp:kutils:0.8.3

package de.mpicbg.scicomp.scratch

import de.mpicbg.scicomp.bioinfo.openFastq
import de.mpicbg.scicomp.kutils.saveAs
import java.io.File

openFastq("some.fastq").
        map{ it.sequence.startsWith("ATT")}.
        saveAs(File("att_fragments.fastq"))

And here's another example for filtering a fasta file:

kscript - Homo_sapiens.GRCh38.pep.all.fa <<"EOF" > Homo_sapiens.GRCh38.pep.tx_prot_coding.fa
//DEPS de.mpicbg.scicomp:kutils:0.8.3

import de.mpicbg.scicomp.bioinfo.openFasta

openFasta(args[0]).
        filter { it.description?.contains("transcript_biotype:protein_coding") ?: false }.
        map{ it.copy(description = null) }.
        forEach { print(it.toEntryString()) }
EOF
ADD COMMENT

Login before adding your answer.

Traffic: 2863 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6