Question: Efficient Fastq Parsing And Manipulation In Java (With Or Without Biojava)
2
gravatar for Michael Dondrup
9.4 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

Hi people, I have got a Java question for you:

I am working with large fastq file. I want to read the file, search for sequence entries by their description, and write a new fastq file with a subset of reads that match. Simple enough.

I have tried BioJava 1.7.1, the fastq API makes this easy to code (see the code example below), but it's inefficient for real-sized files (say 500MB-1GB). It works for my with small files ( up to 100k reads), but not for larger files, I'm running out of memory, and it will be very slow. I haven't found the right heap size for Java yet, must be >4GB for a 500MB fastq.

The fastq parser attempts to slurp the whole file at once.

Is there another library to do this more efficiently, or another way to read that file in chunks using BioJava? Only requirements, should fit into less than 2 GB of RAM and be in Java.

Here is my code:

package org.esysbio.fastqparser;

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Arrays;
import java.util.HashSet;

import org.biojava.bio.program.fastq.Fastq;
import org.biojava.bio.program.fastq.FastqReader;
import org.biojava.bio.program.fastq.FastqWriter;
import org.biojava.bio.program.fastq.SangerFastqReader;
import org.biojava.bio.program.fastq.SangerFastqWriter;

public class App {

    public static void main(String[] args) throws FileNotFoundException,
            IOException {
        FileInputStream inputFastq = new FileInputStream(args[0]);
        FastqReader qReader = new SangerFastqReader();
        /* use a hash set for fast search */
        HashSet<String> nameset = new HashSet<String>();
        /* some dummy names to search for */
        String[] names = { "NCC-1701-D",
                "NCC-1701-A" };

        nameset.addAll(Arrays.asList(names));

        FileOutputStream outputFastq = new FileOutputStream(args[1]);

        FastqWriter qWriter = new SangerFastqWriter();

        int i = 0, j = 0;
        /*
         * qReader.read(inputFastq) tries to read the whole file in into memory,
         * I think this is part of the problem
         */
        for (Fastq fastq : qReader.read(inputFastq)) {
            i++;
            String nam = fastq.getDescription();
            if (nameset.contains(nam)) {
                j++;
                qWriter.write(outputFastq, fastq);
            }
        }
        outputFastq.close();
        System.out.println("read " + i + " sequences, wrote " + j);
    }
}
parsing java fastq biojava • 5.2k views
ADD COMMENTlink modified 14 months ago by RamRS24k • written 9.4 years ago by Michael Dondrup46k
1
gravatar for Pierre Lindenbaum
9.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

Did you try to use a simple readline ? I've not tested the simple following program and I've no idea about its speed, but you get the idea. The program just supposes that there is only 4 rows per record. You can also try to increase the value of DEFAULT_BUFFER_SIZE.

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.InputStreamReader;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;
import java.util.zip.GZIPInputStream;

public class App {
public static void main(String[] args) throws Exception
    {
    final int DEFAULT_BUFFER_SIZE=5096;
    String[] names = { "@NCC-1701-D",
       "@NCC-1701-A" };

    Set<String> nameset=new HashSet<String>();
    nameset.addAll(Arrays.asList(names));

    BufferedReader r;

if(args[0].endsWith(".gz"))
         {
         r=new BufferedReader(new InputStreamReader(new GZIPInputStream(new FileInputStream(args[0]),DEFAULT_BUFFER_SIZE)),DEFAULT_BUFFER_SIZE);
         }
     else
         {
         r=new BufferedReader(new FileReader(args[0]),DEFAULT_BUFFER_SIZE);
         }
    long nLine=-1L;
    String line;
    while((line=r.readLine())!=null)
        {
        nLine++;
        if(nLine%4!=0 || !nameset.contains(line)) continue;
        System.out.println(line);
        System.out.println(r.readLine());
        System.out.println(r.readLine());
        System.out.println(r.readLine());
        nLine+=3;
        }
    r.close();
    }
}
ADD COMMENTlink modified 14 months ago by RamRS24k • written 9.4 years ago by Pierre Lindenbaum124k

Thank you Pierre, yes it works like a charm out of the box, just a bit embarrassing I didn't get the idea myself. For the timing I to the fun to make a little unix grep competitor:

time grep -w -e "@NGC-1" -e @NGC-2 -A3  input.fastq | grep -ve"--"
...
real    0m11.504s
user    0m0.266s
sys 0m0.449s

time java  org.esysbio.App2
...
real    0m13.058s
user    0m6.227s
sys 0m0.977s

If one removes each entry after finding it once, that allows for another speedup.

ADD REPLYlink modified 14 months ago by RamRS24k • written 9.4 years ago by Michael Dondrup46k
1
gravatar for holger.brandl
2.2 years ago by
Germany
holger.brandl10 wrote:

Another approach may by to use kscript with the following scriptlet to stream through the file, apply some transformation, and to persist the result into another fastq file:

#!/usr/bin/env kscript

//DEPS de.mpicbg.scicomp:kutils:0.8.3

package de.mpicbg.scicomp.scratch

import de.mpicbg.scicomp.bioinfo.openFastq
import de.mpicbg.scicomp.kutils.saveAs
import java.io.File

openFastq("some.fastq").
        map{ it.sequence.startsWith("ATT")}.
        saveAs(File("att_fragments.fastq"))

And here's another example for filtering a fasta file:

kscript - Homo_sapiens.GRCh38.pep.all.fa <<"EOF" > Homo_sapiens.GRCh38.pep.tx_prot_coding.fa
//DEPS de.mpicbg.scicomp:kutils:0.8.3

import de.mpicbg.scicomp.bioinfo.openFasta

openFasta(args[0]).
        filter { it.description?.contains("transcript_biotype:protein_coding") ?: false }.
        map{ it.copy(description = null) }.
        forEach { print(it.toEntryString()) }
EOF
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by holger.brandl10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1172 users visited in the last hour