Question

Efficient Fastq Parsing And Manipulation In Java (With Or Without Biojava)

2

Entering edit mode

13.8 years ago

Michael 54k

Hi people, I have got a Java question for you:

I am working with large fastq file. I want to read the file, search for sequence entries by their description, and write a new fastq file with a subset of reads that match. Simple enough.

I have tried BioJava 1.7.1, the fastq API makes this easy to code (see the code example below), but it's inefficient for real-sized files (say 500MB-1GB). It works for my with small files ( up to 100k reads), but not for larger files, I'm running out of memory, and it will be very slow. I haven't found the right heap size for Java yet, must be >4GB for a 500MB fastq.

The fastq parser attempts to slurp the whole file at once.

Is there another library to do this more efficiently, or another way to read that file in chunks using BioJava? Only requirements, should fit into less than 2 GB of RAM and be in Java.

Here is my code:

package org.esysbio.fastqparser;

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Arrays;
import java.util.HashSet;

import org.biojava.bio.program.fastq.Fastq;
import org.biojava.bio.program.fastq.FastqReader;
import org.biojava.bio.program.fastq.FastqWriter;
import org.biojava.bio.program.fastq.SangerFastqReader;
import org.biojava.bio.program.fastq.SangerFastqWriter;

public class App {

    public static void main(String[] args) throws FileNotFoundException,
            IOException {
        FileInputStream inputFastq = new FileInputStream(args[0]);
        FastqReader qReader = new SangerFastqReader();
        /* use a hash set for fast search */
        HashSet<String> nameset = new HashSet<String>();
        /* some dummy names to search for */
        String[] names = { "NCC-1701-D",
                "NCC-1701-A" };

        nameset.addAll(Arrays.asList(names));

        FileOutputStream outputFastq = new FileOutputStream(args[1]);

        FastqWriter qWriter = new SangerFastqWriter();

        int i = 0, j = 0;
        /*
         * qReader.read(inputFastq) tries to read the whole file in into memory,
         * I think this is part of the problem
         */
        for (Fastq fastq : qReader.read(inputFastq)) {
            i++;
            String nam = fastq.getDescription();
            if (nameset.contains(nam)) {
                j++;
                qWriter.write(outputFastq, fastq);
            }
        }
        outputFastq.close();
        System.out.println("read " + i + " sequences, wrote " + j);
    }
}

fastq java biojava parsing • 7.3k views

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.8 years ago by Michael 54k

Ram · Answer 1 · 2010-06-30

Did you try to use a simple readline ? I've not tested the simple following program and I've no idea about its speed, but you get the idea. The program just supposes that there is only 4 rows per record. You can also try to increase the value of DEFAULT_BUFFER_SIZE.

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.InputStreamReader;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;
import java.util.zip.GZIPInputStream;

public class App {
public static void main(String[] args) throws Exception
    {
    final int DEFAULT_BUFFER_SIZE=5096;
    String[] names = { "@NCC-1701-D",
       "@NCC-1701-A" };

    Set<String> nameset=new HashSet<String>();
    nameset.addAll(Arrays.asList(names));

    BufferedReader r;

if(args[0].endsWith(".gz"))
         {
         r=new BufferedReader(new InputStreamReader(new GZIPInputStream(new FileInputStream(args[0]),DEFAULT_BUFFER_SIZE)),DEFAULT_BUFFER_SIZE);
         }
     else
         {
         r=new BufferedReader(new FileReader(args[0]),DEFAULT_BUFFER_SIZE);
         }
    long nLine=-1L;
    String line;
    while((line=r.readLine())!=null)
        {
        nLine++;
        if(nLine%4!=0 || !nameset.contains(line)) continue;
        System.out.println(line);
        System.out.println(r.readLine());
        System.out.println(r.readLine());
        System.out.println(r.readLine());
        nLine+=3;
        }
    r.close();
    }
}

score 1 · Answer 2 · 2017-09-05

Another approach may by to use kscript with the following scriptlet to stream through the file, apply some transformation, and to persist the result into another fastq file:

#!/usr/bin/env kscript

//DEPS de.mpicbg.scicomp:kutils:0.8.3

package de.mpicbg.scicomp.scratch

import de.mpicbg.scicomp.bioinfo.openFastq
import de.mpicbg.scicomp.kutils.saveAs
import java.io.File

openFastq("some.fastq").
        map{ it.sequence.startsWith("ATT")}.
        saveAs(File("att_fragments.fastq"))

And here's another example for filtering a fasta file:

kscript - Homo_sapiens.GRCh38.pep.all.fa <<"EOF" > Homo_sapiens.GRCh38.pep.tx_prot_coding.fa
//DEPS de.mpicbg.scicomp:kutils:0.8.3

import de.mpicbg.scicomp.bioinfo.openFasta

openFasta(args[0]).
        filter { it.description?.contains("transcript_biotype:protein_coding") ?: false }.
        map{ it.copy(description = null) }.
        forEach { print(it.toEntryString()) }
EOF