Question: How To Efficiently Parse A Huge Fastq File?
23
gravatar for Panos
4.0 years ago by
Panos1.3k
Geneva, Switzerland
Panos1.3k wrote:

I have a 19GB fastq file (~70 million reads) and I want to find ~10,000 reads (by looking for their name) and pull out their sequence and quality. Looping through the file 10,000 times is obviously not a solution as it would easily take quite a few days!

By searching the net, I found cdbfasta that couldn't, however, index such a big file because the index file got really big and the program crashed. Other than this, there's also a BioPerl module (Bio::Index::fastq) which I didn't try because I read somewhere that you can run into the same problem when dealing with huge fastq files.

How do you usually deal with such huge fastq files?

ADD COMMENTlink modified 10 months ago by daniel.nicorici0 • written 4.0 years ago by Panos1.3k
2

same as before... python :) my 2 cents;) btw: check how many times perl implementation take than wc -l, as python implementation takes ~2x longer, c++ ~8x longer, python using list 140x longer. so for 70mln reads the differences might be in hours:P

ADD REPLYlink written 4.0 years ago by Leszek3.3k
1

You can also have a look at this post : NGS - Huge (FASTQ) file parsing - which language for good efficiency ?

ADD REPLYlink modified 2.9 years ago by Istvan Albert ♦♦ 56k • written 4.0 years ago by toni2.0k

Thanks all of you guys for the answers and for your time! I think I'll go for the Perl-based solution (since I don't know Python or C++). Generally speaking, if instead of having just 10K reads to look for, I had 100K or millions of them, what would be the best solution?

ADD REPLYlink written 4.0 years ago by Panos1.3k

I have a visual app (C: Efficiently process (view, analize, clip ends, convert, demultiplex, dereplicate) that shows all reads in a FastQ file. It takes about 100 seconds to open a 0.5GB file (3.3 mil reads) on an old Toshiba laptop (but when you open the file next time, it takes below 1 sec to open it). I would like to test it on a 19GB file, but I don't have such file.

So, is this fast or slow? I couldn't find any real numbers on this forum to compare mine.

ADD REPLYlink modified 10 months ago • written 15 months ago by BioApps480

waiting 100 seconds in a graphical user interface just to open a file is too long, a common mistake that people make is that they think that one would need to process the entire file to characterize it

read in 100k lines and report on that, then in the background continue the loading and update the results 

ADD REPLYlink written 15 months ago by Istvan Albert ♦♦ 56k

You need 100 seconds when you open the file for the first time. Then is less than 1 sec. Plus, that no optimizations have been applied yet. I hope I can cut the time in half.

Anyway, your idea is great. I will do that.
 

ADD REPLYlink modified 15 months ago • written 15 months ago by BioApps480

The time is now, after some optimizations, 23.6 seconds (for the same file). Can I post this app here (in a new thread) so I can get feedback? There are any BioStars rules against this?

ADD REPLYlink written 15 months ago by BioApps480

Sure just make it a Forum post

ADD REPLYlink written 15 months ago by Istvan Albert ♦♦ 56k

Thanks Albert. I post it here: Efficiently parsing and viewing a large fastq file

ADD REPLYlink written 15 months ago by BioApps480

You can always concatenate files downloaded from SRA or generate one with random sequences/qualities to obtain a file of any size.

ADD REPLYlink written 10 months ago by Raygozak930
49
gravatar for lh3
4.0 years ago by
lh324k
United States
lh324k wrote:

As I routinely waste time on benchmarks... The take-home message is C/C++ still takes the crown of speed, four times as fast as Perl/Python. Among the following programs, only seqtk and brentp's bioperl.pl work with multi-line fastq. The rest assume 4-line fastq. [PS: Python-2.7.2, Ruby-1.9.2p290, LuaJIT-beta8, Java-1.6.0_25, Perl-5.8.8 and gcc-4.1.2. i.e. latest python/ruby/luajit/java but years old C/Perl]

=========================================================================================
 Program        Lang       User  Sys   Mem/MB    Comments
-----------------------------------------------------------------------------------------
 cat            C           0.4  10.1     0.4    Unix cat
 wc -l          C           4.3  10.4     0.4    Unix wc -l
 fqextract.c    C          17.3  10.8     2.0    Hardcoded maximum line length: 4096
 Pierre-r2.cc   C++        19.5  12.2     3.8    Use tr1::unordered_set
 seqtk          C          19.9   8.9     4.1    Multi-line fastq; subregion; gzip input
 Pierre-r1.cc   C++        22.8  12.6     3.9    Replace cin with reading /dev/stdin
 wc             C          32.0  10.6     0.4    Unix wc (no command-line options)
 fqextract.py   Python     54.0   7.9     8.0    Same algorithm as fqextract.c
 fqextract.java Java       55.8   8.9   562.7    Same algorithm as fqextract.c
 fqextract.pl   Perl       71.5  10.5     5.5    Same algorithm as fqextract.c
 Leszek.py      Python     78.0  13.3     8.0    Infinite loop given improper input
 fqextract.lua  LuaJIT    109.6   8.6     6.1    Same algorithm as fqextract.c
 readfq.py      Python    122.9  11.3     8.1    Multi-line fastq
 bp-fqiter      Python    124.5   9.9    15.2    Multi-line fastq; biopython
 drio.rb        Ruby      130.1   6.9    10.0    Same algorithm as fqextract.c
 readfq.lua     Lua       189.5  11.5     6.0    Multi-line fastq
 readfq.pl      Perl      203.1  69.6     5.7    Multi-line fastq
 Pierre.cc      C++       323.6  13.3     3.9    Same algorithm as Leszek.py
 biopython.py   Python    928.8  34.2    15.6    Multi-line fastq
 bioperl.pl     Perl    >2400.0         >15.0    Multi-line fastq; killed
 grep -f -A     C       >2400.0        >414.0    Killed; possible false positives
=========================================================================================


#Discussions#

  1. Readfq is my implementation of a unified multi-line fasta/fastq parser in Perl, Python and Lua.

  2. It is a little surprising that C/C++ implementations greatly outperform Perl and Python implementations. I would expect the difference to be less than 2 folds.

  3. Pierre.cc and Leszek.py may go into an infinite loop when the input list contains names absent from the fastq file. The strategy used in fqextract is recommended. It is also faster, at least for python.

  4. BioPerl's fastq parser is impractically slow. Readfq.pl also parsers multi-line fastq, but is by far faster. BioPython's fastq parser is more efficient than BioPerl's, but not efficient enough. Readfq.py is 7 times faster.

  5. Unix cat is insanely fast. Unix "wc -l" (line counting only) is 8X faster than "wc" without options (also counting words and bytes). As I remember, Unix fgrep (or grep -f) builds a finite-state automaton for query strings. The algorithm is close to the optimal for general uses, but as fgrep ignores the format of fastq, it is far slower than a fastq-aware program.


#Programs#
fqextract.c

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "khash.h" // khash.h from samtools/bwa/seqtk/klib
KHASH_SET_INIT_STR(s)

#define BUF_SIZE 4096

int main(int argc, char *argv[])
{
    char buf[BUF_SIZE];
    FILE *fp;
    int lineno = 0, flag = 0, ret;
    khash_t(s) *h;
    if (argc == 1) {
        fprintf(stderr, "Usage: cat in.fq | fqextract <name.lst>\n");
        return 1;
    }
    h = kh_init(s);
    fp = fopen(argv[1], "rb"); // FIXME: check fp
    while (fgets(buf, BUF_SIZE, fp))
        kh_put(s, h, strdup(buf), &ret); // FIXME: check ret
    fclose(fp);
    while (fgets(buf, BUF_SIZE, stdin)) {
        if (++lineno%4 == 1)
            flag = (kh_get(s, h, buf + 1) != kh_end(h));
        if (flag) fputs(buf, stdout);
    }
    kh_destroy(s, h); // FIXME: free keys before destroy
    return 0;
}

Pierre-r2.cc

#include <iostream>
#include <string>
#include <fstream>
#include <cstdlib>
#include <tr1/unordered_set> // use <unordered_set> for more recent g++
using namespace std;

int main(int argc,char **argv)
{
    string line;
    tr1::unordered_set<string> names;
    ifstream in(argv[1],ios::in);
    if(!in.is_open()) return EXIT_FAILURE;
    while(getline(in,line,'\n'))
        names.insert(line);
    in.close();
    string name, seq, name2, qual;
    tr1::unordered_set<string>::iterator r;
    ifstream in2("/dev/stdin", ios::in);
    while(!names.empty())
    {
        if(!getline(in2,name,'\n')) break;
        if(!getline(in2,seq,'\n')) break;
        if(!getline(in2,name2,'\n')) break;
        if(!getline(in2,qual,'\n')) break;
        r=names.find(name);
        if(r==names.end()) continue;
        names.erase(r);//names are unique we don't need this one anymore
        cout << name << "\n" << seq << "\n" << name2 << "\n" << qual << endl;
    }
    in2.close();
    return 0;
}

fqextract.java

import java.util.HashSet;
import java.io.*;

class fqextract {
    public static void main(String[] args) throws IOException {
        BufferedReader fp = new BufferedReader(new FileReader(args[0]));
        HashSet h = new HashSet();
        String l;
        while ((l = fp.readLine()) != null) h.add('@' + l);
        fp.close();
        int lineno = 0;
        boolean flag = false;
        fp = new BufferedReader(new InputStreamReaderSystem.in));
        while ((l = fp.readLine()) != null) {
            if (++lineno % 4 == 1) flag = h.contains(l);
            if (flag) System.out.println(l);
        }
    }
}

seqtk

fqextract.pl

die("Usage: cat in.fq | fqextract.pl <name.lst>\n") if (@ARGV == 0);
my $fn = shift(@ARGV);
my %hash = ();
open(FILE, $fn) || die;
$hash{'@'.$_} = 1 while (<FILE>);
close(FILE);
my $flag = 0;
while (<>) {
    $flag = defined($hash{$_})? 1 : 0 if ($.%4 == 1);
    print if ($flag);
}

fqextract.py

import sys 
ids, lineno, flag = set('@' + x for x in open(sys.argv[1])), 0, False
for l in sys.stdin:
    lineno += 1
    if lineno%4 == 1: flag = (l in ids)
    if flag: print l,

biopython.py

from Bio import SeqIO
import sys
ids = set(x[:-1] for x in open(sys.argv[1]))
for seq in SeqIO.parse(sys.stdin, "fastq"):
    if seq.id in ids): print seq.format("fastq"),

bp-fqiter.py

from Bio.SeqIO.QualityIO import FastqGeneralIterator
import sys 
ids = set(x[:-1] for x in open(sys.argv[1]))
for name, seq, qual in FastqGeneralIterator(sys.stdin):
    if (name in ids): print("@%s\n%s\n+\n%s" % (name, seq, qual))

EDIT1 by lh3: Optimized kseq.h for faster FASTQ parsing, though the parser is a little less robust than the previous version (fine for nearly all files). Added cat/wc/grep. Added more discussions.

EDIT2 by lh3: Added Ruby-1.9.2p290

EDIT3 by lh3: Added LuaJIT-beta8. Lua-5.1.4 is 20% slower. Added fqextract.py and fqextract.java

EDIT4 by lh3: Added biopython and readfq parsers.

EDIT5 by lh3: Added BioPython.FastqGeneralIterator (see also here). This parser is as fast as my readfq.py.

ADD COMMENTlink modified 3.0 years ago by Istvan Albert ♦♦ 56k • written 4.0 years ago by lh324k
3

The current Biopython example is using the SeqRecord format method which is not recommended for speed. The documentation and examples try to be clear about this. Instead, use Bio.SeqIO.write() as in

<script src="&lt;a href=" 1201025"="">1201025"></script>
which should be much faster.

Also, in an example like this where you don't need all the objects created, a low level solution as in

<script src="&lt;a href=" 1201033"="">1201033"></script>
would be about as good as I would expect using the Biopython parser. See also http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

If you assume 4 lines per sequence things are different.

ADD REPLYlink written 3.9 years ago by Peter4.8k
2

lh3, can you please add this ruby implementation to the benchmark list?: https://gist.github.com/1154539

ADD REPLYlink written 4.0 years ago by Drio880
2

Added. Thanks a lot.

ADD REPLYlink written 3.9 years ago by lh324k

I have implemented a go version. Could you please add it to your benchmarks and update when you have a chance? Thanks!

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Drio880
2

Your Biopython example suffers from you using the sequence's format method. The docs do recommend you use SeqIO.write() instead. Something like this should be much faster:

from Bio import SeqIO import sys ids = set(x[:-1] for x in open(sys.argv[1])) wanted = (rec for rec in SeqIO.parse(sys.stdin, "fastq") if rec.id in ids) SeqIO.write(wanted, sys.stdout, "fastq")

Of course, if that's all you're doing there is no point making all those objects. See http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

ADD REPLYlink written 3.9 years ago by Peter4.8k

very informative post. thx!

ADD REPLYlink written 4.0 years ago by 2184687-1231-83-4.7k

Here is a lua version: https://gist.github.com/1159195 If you add it to the list could you please mention what version of lua you use? Ruby's performance is horrible. Perhaps you used MRI1.8 ?

ADD REPLYlink written 3.9 years ago by Drio880

I used the latest "official" ruby compiled from source code.

ADD REPLYlink written 3.9 years ago by lh324k

I used the latest "official" ruby compiled from source code. I would not be surprised that ruby is slower. Perl is designed for text processing. My impression is also that for text processing, Perl is more efficient than other language implementations.

ADD REPLYlink written 3.9 years ago by lh324k

I tried to improve the speed of the c version by using mmap(2). I failed on my first attempt (https://gist.github.com/1168330). I think the reason is double. There is extra overhead introduced in the logic from the fact you just get a pointer to memory instead a file descriptor. In addition, reading from stdin is faster than reading directly from a file.

ADD REPLYlink written 3.9 years ago by Drio880

All hail the king of benchmarks!

ADD REPLYlink written 3.9 years ago by Aaronquinlan8.7k
15
gravatar for Leszek
4.0 years ago by
Leszek3.3k
IIMCB, Poland
Leszek3.3k wrote:

I would use python (no dependencies):
1. read you read names into list1 and change list to set (it's hashable, so checking for present of element is much faster than in list)
2. parse you fastq file and check if each name is present in set of read names of interest

#!/usr/bin/python
import sys

#get fname from parameter
idfile=sys.argv[1]

#load ids
ids = set( x.strip() for x in open(idfile) )

#read stdid 
handle=sys.stdin  

while ids:
  #parse fastq
  idline=handle.readline()
  seq   =handle.readline()
  spacer=handle.readline()
  quals =handle.readline()
  #check
  id=idline[:-1]
  if id in ids:
    #print fastq
    sys.stdout.write( '%s%s%s%s' % ( idline, seq, spacer, quals ) )
    #update ids
    ids.remove( id )

My python implementation is over 3x faster than c++ by Pierre. It's due to implementation of set (hash-able) instead of usual list.

#c++
time cat PL429_q10_filtered_1.fastq | ./parse PL429_q10_filtered_1.fastq.lids > PL429_q10_filtered_1.fastq.lids.fqcpp

real    4m21.137s
user    4m15.950s
sys 0m10.510s

#python (3.4x faster than c++)
time cat PL429_q10_filtered_1.fastq | python fastq_id_parser.py PL429_q10_filtered_1.fastq.lids > PL429_q10_filtered_1.fastq.fqSet

real    1m11.155s
user    1m5.270s
sys 0m10.920s

#python list instead of set (56x slower than python using set!)
time cat PL429_q10_filtered_1.fastq | python fastq_id_parser.noBioPython.noSet.py PL429_q10_filtered_1.fastq.lids > PL429_q10_filtered_1.fastq.lids.fqNoSet

real    70m54.534s
user    70m18.650s
sys 0m16.740s

#parsing file
time cat PL429_q10_filtered_1.fastq | wc -l
115959528

real    0m30.373s
user    0m1.990s
sys 0m8.250s

My test set was 10k: first and last 5k headers of the fq. Fq was 4.9GB, 115,959,528 lines, seq between 31-75bases. I used 15k SAS drive - read speed is ~200MB/s, access time ~5ms.

ADD COMMENTlink modified 3.0 years ago by Istvan Albert ♦♦ 56k • written 4.0 years ago by Leszek3.3k
8

@lh3: I was bored today so I figured out what's causing the C++ program to be slow, it seems to be an issue piping the output through cin. If I change the input to use an ifstream and use a file Pierres C++ program is about 2x faster than the python program. If I replace the std::set with a boost::unordered_set the program ends up being about 4x faster than the python one.

ADD REPLYlink written 4.0 years ago by Gww2.5k
3

3X ! I am terribly offended! :-)

ADD REPLYlink written 4.0 years ago by Pierre Lindenbaum73k

Sorry Pierre! I was really surprised, as c++ is considered the fastest language. Do you know some hashed list implementation in c++? :)

ADD REPLYlink written 4.0 years ago by Leszek3.3k

Do you know the performance for a hashed list implementation in Perl? Best regards.

ADD REPLYlink written 4.0 years ago by David70

http://www.sgi.com/tech/stl/hash_set.html

ADD REPLYlink written 4.0 years ago by Aaronquinlan8.7k

Experienced C++ programmers would not use "iostream" for performance critical tasks. It is very inefficient. Its "getline" is far slower than C's "fgets". Also, as others commented, "set" is much slower than "hash_set", which is now deprecated by "unordered_set" in boost or the latest C++ spec. Perl has a very fast "getline" equivalence. Its hash table should come very close to C++. I do not like C++. It has many pitfalls to average programmers.

ADD REPLYlink written 4.0 years ago by lh324k

David, provide me with a full perl script, so I can run it.

ADD REPLYlink written 4.0 years ago by Leszek3.3k

@ih3 I agree you . But I thought the STL would be enough for Biostar :-) !

ADD REPLYlink written 4.0 years ago by Pierre Lindenbaum73k

@Leszek, python sets have a remove method so you dont need to convert to a list (then back to set). also, you can create the ids like: ids = set(x.strip() for x in open(idfile)

ADD REPLYlink written 4.0 years ago by brentp20k

set.remove() is cool, brentp. I was not aware of that. it gains another 2 seconds;)

ADD REPLYlink written 4.0 years ago by Leszek3.3k

set.remove() is cool, brentp. I was not aware of that. it gains another 2-3 seconds;)

ADD REPLYlink written 4.0 years ago by Leszek3.3k

@Pierre. The problem is hash_set is sadly not part of STL and thus not portable even between gcc versions. IMHO, the major fault of STL is not to include hash_set/map. The SGI implementation is actually quite good. BTW, personally I do not quite buy boost and the new C++ spec.

ADD REPLYlink written 4.0 years ago by lh324k

@Pierre: I second that we should not make simple things too complicated. I just want to understand why C++ is much slower.

ADD REPLYlink written 4.0 years ago by lh324k

@Pierre: I second that we should not make simple things too complicated. I just want to understand why C++ is much slower. BTW, I am not a fan of boost and the latest C++ spec.

ADD REPLYlink written 4.0 years ago by lh324k

@GWW: Thanks. The stdin issue is an interesting one. You may well be correct. When I tried C++'s getline() last time, I indeed fed input via stdin.

ADD REPLYlink written 4.0 years ago by lh324k

@lh3: nice work! if you are still boring... what about changing the loop condition to "foreach([?])" so to skip the "names.erase(r)" in every cicle of the loop and clear the set before the "return 0" ?

ADD REPLYlink written 4.0 years ago by ff.cc.cc1.1k

@lh3: nice work! if you are still boring... what about changing the loop condition so to skip the "names.erase(r)" in every cicle of the loop and clear the set only before the "return 0" ? It would be interesting to see if we gain more performance from skipping an update-index operation or from faster data-retrieve due to decreasing size of index tree.

ADD REPLYlink written 4.0 years ago by ff.cc.cc1.1k

@Pierre and @lh3 the STL is enough. std::tr1::unordered_map std::tr1::unordered_set

ADD REPLYlink written 4.0 years ago by mylons120

@ffcccc: GWW did the hard work. I was just guessing. Credit to him.

ADD REPLYlink written 4.0 years ago by lh324k

@lh give me please updated c++ code, so I can update the results. btw: I'm a bit sceptic about c++ being 4x faster than python, as simple cat is a bit over 2x faster;)

ADD REPLYlink written 4.0 years ago by Leszek3.3k

@Leszek: please see my answer below. @desaila: thanks for suggesting tr1. I almost forgot. More recent g++ has [?] in its standard library.

ADD REPLYlink written 3.9 years ago by lh324k
11
gravatar for brentp
4.0 years ago by
brentp20k
Salt Lake City, UT
brentp20k wrote:

if you have only 10,000 reads you want to pull out, you can just iterate over the file once. For each record you check if the current read header is in the list (or hash) of 10,000. Something like this:

use Bio::SeqIO;
use Bio::Seq::Quality;

# read in 10K headers into a hash with value of 1 or whatever.
# you'll have to add your own code to do this.
my %headers = ();

$seqio  = Bio::SeqIO->new('-format'=>'fastq' , '-file'=>'some.fasq');

my $out_fastq = Bio::SeqIO->new( -format => 'fastq', '-file'=> 'subset.fastq');

while((my $rec = $seqio->next_seq())) {
    # if the current record doesn't exist in the 10,000, skip it.
    if(! $headers{$rec->display_id} ) { next; }
    # otherwise, write it to the output file.
    $out_fastq->write_seq($rec);
}
ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by brentp20k
1

wouldn't it be better to say if (! exists $headers{})?

ADD REPLYlink written 4.0 years ago by Flies100
1

you mean (! exists $headers{$rec->display_id}) ? That is more explicit, but it won't make a difference.

ADD REPLYlink written 4.0 years ago by brentp20k

I took your advice brentp and started writing this Perl script. Weirdly, though, I discovered that BioPerl is VERY slow so I decided to do the file parsing "manually" (assuming there are only 4 lines per sequence) and it turned out to be A LOT faster (more than 60 times faster)!

ADD REPLYlink written 4.0 years ago by Panos1.3k
9
gravatar for Pierre Lindenbaum
4.0 years ago by
France
Pierre Lindenbaum73k wrote:

Here is my C++ solution:

#include <iostream>
#include <set>
#include <string>
#include <fstream>
#include <cstdlib>
using namespace std;


int main(int argc,char **argv)
    {
    string line;
    set<string> names;
    /* reads your names */
    ifstream in(argv[1],ios::in);
    if(!in.is_open()) return EXIT_FAILURE;
    while(getline(in,line,'\n'))
        {
        names.insert(line);
        }
    in.close();
    /* reads the fastq */
    string name;
    string seq;
    string name2;
    string qual;
    set<string>::iterator r;
    /* assuming 4 lines / read */
    while(!names.empty())
        {
        if(!getline(cin,name,'\n')) break;
        if(!getline(cin,seq,'\n')) break;
        if(!getline(cin,name2,'\n')) break;
        if(!getline(cin,qual,'\n')) break;
        r=names.find(name);
        if(r==names.end()) continue;
        names.erase(r);//names are unique we don't need this one anymore
        cout << name << "\n" << seq << "\n" << name2 << "\n" << qual << endl;
        }
    return 0;
    }

Compile:

  g++ -Wall -O3 -o parse parser.cpp

Example:

$ cat names.txt
@EAS54_6_R1_2_1_540_792
@EAS54_6_R1_2_1_443_348

$ gunzip -c my.fastq.gz 
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
;;;;;;;;;;;9;7;;.7;393333

$ gunzip -c  my.fastq.gz | ./parse names.txt
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
;;;;;;;;;;;9;7;;.7;393333
ADD COMMENTlink modified 3.0 years ago by Istvan Albert ♦♦ 56k • written 4.0 years ago by Pierre Lindenbaum73k

This is terrific!!

ADD REPLYlink written 3.0 years ago by Manu Prestat3.3k
4
gravatar for Adrian
4.0 years ago by
Adrian580
Cambridge, MA
Adrian580 wrote:

If all you need to do is extract the sequences, you could also just use grep, which will likely be about as fast as writing your own parser.

You can tell it to print out the line after the read name matched (-A 1) and can specify the list of patterns to search for in a file.

ADD COMMENTlink written 4.0 years ago by Adrian580
2

I have used this method successfully many times. Put the IDs, one per line, in a text file called 'ids.txt' with the '@' before them. Assume your sequences are in 'seq.fq' (assume reads are on one line!)

% grep -f ids.txt -F --mmap -A 3 seq.fq | grep -v '^--' > 10000.fq

Done!

ADD REPLYlink written 3.7 years ago by Torst740

Dear Torst, when I am using this syntax grep -f tr1.txt -F --mmap -A 3 che.fastq | grep -v '^--' > 10000.fq it is dumping entire file as it without extracting desire id's sequence, My ids look like

@SRR681003.7 SN603:5:1101:70.90:105.60 length=100

@SRR681003.1 SN603:5:1101:44.60:109.30 length=100

@SRR681003.8 SN603:5:1101:63.80:106.10 length=100 but when I am doing grep -e "@SRR681003.7 SN603:5:1101:70.90:105.60 length=100" -A 3 che.fastq > output.fastq it do perfectly good. how to do * -e "@SRR681003.7 SN603:5:1101:70.90:105.60 length=100* this part for id.txt please help me out.

ADD REPLYlink written 2.1 years ago by Raghav70
2
gravatar for Marvin
4.0 years ago by
Marvin620
Marvin620 wrote:

This doesn't really answer your question, but anyway: WTF would anyone store 70M reads in a text file and then try to access it as if it was a database?

I would convert the humongous FastQ to BAM. The conversion is lossless, saves space, and BAM/SAM are in fact easier to parse than FastQ. I don't know of a tool that would index BAM to retrieve reads by name, but if people start asking for that functionality, maybe someone will write such a tool.

ADD COMMENTlink written 4.0 years ago by Marvin620
2
gravatar for francesco.strozzi
3.1 years ago by
francesco.strozzi20 wrote:

Hi, I have written a fast parser for FastQ files. It's in Ruby but with a C extension underneath, so it's considerably faster then pure Ruby/Python/Perl code https://github.com/fstrozzi/bioruby-faster .

Look at the Wiki for simple examples on how to use it https://github.com/fstrozzi/bioruby-faster/wiki . It's easy and you get back an array for each entry in the FastQ file, with the complete sequence header (ID + comment), the sequence and quality values.

You can then parse the file once and just filter the sequence ID of each record by searching it against a Set of IDs using for example Ruby methods like Set#include?

my_ids = Set.new ["X,"Y","Z"]
fastq = Bio::Faster.new("sequences.fastq")
fastq.each_record do |sequence_header, sequence, quality|
     seq_id = sequence_header.split("\s").first
     puts sequence_header, sequence, quality if my_ids.include? seq_id
end
ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by francesco.strozzi20
2
gravatar for lh3
3.1 years ago by
lh324k
United States
lh324k wrote:

Sorry for a separate answer. It shows the table in my previous answer, which is malformated during the stackexchange transition. I cannot edit my old answer either as it exceeds the 5000 character limit. I will remove my this answer when I can reformat my old answer properly.

=========================================================================================
 Program        Lang       User  Sys   Mem/MB    Comments
-----------------------------------------------------------------------------------------
 cat            C           0.4  10.1     0.4    Unix cat
 wc -l          C           4.3  10.4     0.4    Unix wc -l
 fqextract.c    C          17.3  10.8     2.0    Hardcoded maximum line length: 4096
 Pierre-r2.cc   C++        19.5  12.2     3.8    Use tr1::unordered_set
 seqtk          C          19.9   8.9     4.1    Multi-line fastq; subregion; gzip input
 Pierre-r1.cc   C++        22.8  12.6     3.9    Replace cin with reading /dev/stdin
 wc             C          32.0  10.6     0.4    Unix wc (no command-line options)
 fqextract.py   Python     54.0   7.9     8.0    Same algorithm as fqextract.c
 fqextract.java Java       55.8   8.9   562.7    Same algorithm as fqextract.c
 fqextract.pl   Perl       71.5  10.5     5.5    Same algorithm as fqextract.c
 Leszek.py      Python     78.0  13.3     8.0    Infinite loop given improper input
 fqextract.lua  LuaJIT    109.6   8.6     6.1    Same algorithm as fqextract.c
 readfq.py      Python    122.9  11.3     8.1    Multi-line fastq
 bp-fqiter      Python    124.5   9.9    15.2    Multi-line fastq; biopython
 drio.rb        Ruby      130.1   6.9    10.0    Same algorithm as fqextract.c
 readfq.lua     Lua       189.5  11.5     6.0    Multi-line fastq
 readfq.pl      Perl      203.1  69.6     5.7    Multi-line fastq
 Pierre.cc      C++       323.6  13.3     3.9    Same algorithm as Leszek.py
 biopython.py   Python    928.8  34.2    15.6    Multi-line fastq
 bioperl.pl     Perl    >2400.0         >15.0    Multi-line fastq; killed
 grep -f -A     C       >2400.0        >414.0    Killed; possible false positives
=========================================================================================
ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by lh324k
0
gravatar for Sequencegeek
4.0 years ago by
Sequencegeek660
UCLA
Sequencegeek660 wrote:

You could also use a binary search. I have a library in Python that will only require you to write a small function to adapt to your file type...

ADD COMMENTlink written 4.0 years ago by Sequencegeek660
0
gravatar for User 7691
3.5 years ago by
User 76910
User 76910 wrote:

Did anyone try do it out with: Go = http://golang.org/ D = http://www.d-programming-language.org/ * Vala = http://live.gnome.org/Vala

ADD COMMENTlink written 3.5 years ago by User 76910
0
gravatar for Ric
3.5 years ago by
Ric60
Australia
Ric60 wrote:

Why the Java solution use so much more memory? Is it just because of JVM?

ADD COMMENTlink written 3.5 years ago by Ric60

that's a good question! But that should be a comment of the related answer.

ADD REPLYlink written 2.7 years ago by Manu Prestat3.3k
0
gravatar for Kevin
3.4 years ago by
Kevin470
Kevin470 wrote:

Just a thought. I think Map/reduce in Hadoop might be super fast in doing this. Conceptually

ADD COMMENTlink written 3.4 years ago by Kevin470
0
gravatar for Pals
2.7 years ago by
Pals1.2k
Finland
Pals1.2k wrote:

Very simple solution using biopython:

def batch_iterator(iterator, batch_size:)
    entry = True 
    while entry :
        batch = []
        while len(batch) < batch_size :
            try :
                entry = iterator.next()
            except StopIteration :
                entry = None
            if entry is None :
                break
            batch.append(entry)
        if batch :
            yield batch

from Bio import SeqIO
record_iter = SeqIO.parse(open("filename.fastq"), "fastq")
for i, batch in enumerate(batch_iterator(record_iter, no.of records)) :
    filename = "group_%i.fastq" % (i+1)
    handle = open(filename, "w")
    count = SeqIO.write(batch, handle, "fastq")
    handle.close()
    print "Wrote %i records to %s" % (count, filename)

source: BiopythonWiki

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Pals1.2k
0
gravatar for BioApps
14 months ago by
BioApps480
Spain
BioApps480 wrote:

> Other than this, there's also a BioPerl module (Bio::Index::fastq) which I didn't try because I read somewhere that you can run into the same problem when dealing with huge fastq files.

The idea is that you should never try to load that huge file in RAM. (you could do it if the file would be only half of installed amount of RAM). Just enumerate each record and see if it meets your criteria (the name you want). When you have found one record don't keep it in RAM. Write it to disk in a new file. An app built like this (C: Efficiently process (view, analize, clip ends, convert, demultiplex, dereplicate) would require way under 10MB or RAM (with GUI included in those 10MB).

If you still need this, I can build it for you.

ADD COMMENTlink modified 10 months ago • written 14 months ago by BioApps480

RAM is not the issue with cdbfasta or BioPerl's Bio::Index* modules. cdbfasta is hardcoded to only construct indices up to 4 GB, so it won't work with a lot of sequences. Using the Bio::Index modules with tens of millions of sequences is theoretically possible, but it just doesn't scale well, so your script would seemingly take forever to finish. Both methods construct indices on-disk, and to be fair, they were written more than 10 years ago and were very relevant/useful at the time. These are not good approaches for modern Illumina data sets though.

ADD REPLYlink written 13 months ago by SES6.2k
0
gravatar for daniel.nicorici
10 months ago by
Finland
daniel.nicorici0 wrote:

Another really fast Python implementation which can deal also with very huge input lists of reads ids (e.g. 200 million read ids) which do not fit at once in the memory. It is using also frozenset for speed/memory reasons.

https://code.google.com/p/fusioncatcher/source/browse/extract_short_reads.py

ADD COMMENTlink modified 10 months ago • written 10 months ago by daniel.nicorici0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 955 users visited in the last hour