Question: How To Efficiently Parse A Huge Fastq File?
20
gravatar for Panos
2.8 years ago by
Panos1.2k
Geneva, Switzerland
Panos1.2k wrote:

I have a 19GB fastq file (~70 million reads) and I want to find ~10,000 reads (by looking for their name) and pull out their sequence and quality. Looping through the file 10,000 times is obviously not a solution as it would easily take quite a few days!

By searching the net, I found cdbfasta that couldn't, however, index such a big file because the index file got really big and the program crashed. Other than this, there's also a BioPerl module (Bio::Index::fastq) which I didn't try because I read somewhere that you can run into the same problem when dealing with huge fastq files.

How do you usually deal with such huge fastq files?

ADD COMMENTlink modified 2.6 years ago by Pals1.1k • written 2.8 years ago by Panos1.2k
1

You can also have a look at this post : NGS - Huge (FASTQ) file parsing - which language for good efficiency ?

ADD REPLYlink modified 19 months ago by Istvan Albert ♦♦ 39k • written 2.8 years ago by Tony1.7k
1

same as before... python :) my 2 cents;) btw: check how many times perl implementation take than wc -l, as python implementation takes ~2x longer, c++ ~8x longer, python using list 140x longer. so for 70mln reads the differences might be in hours:P

ADD REPLYlink written 2.8 years ago by Leszek2.9k

Thanks all of you guys for the answers and for your time! I think I'll go for the Perl-based solution (since I don't know Python or C++). Generally speaking, if instead of having just 10K reads to look for, I had 100K or millions of them, what would be the best solution?

ADD REPLYlink written 2.8 years ago by Panos1.2k

I have a visual app that shows all reads in a FastQ file. It takes about 100 seconds to open a 0.5GB file (3.3 mil reads) on an old Tashiba laptop (but when you open the file next time, it takes below 1 sec to open it). I would like to test it on a 19GB file, but I don't have such file.

So, is this fast or slow? I couldn't find any real numbers on this forum to compare mine.

ADD REPLYlink modified 10 hours ago • written 10 hours ago by BioApps0

waiting 100 seconds in a graphical user interface just to open a file is too long, a common mistake that people make is that they think that one would need to process the entire file to characterize it

read in 100k lines and report on that, then in the background continue the loading and update the results 

ADD REPLYlink written 10 hours ago by Istvan Albert ♦♦ 39k

You need 100 seconds when you open the file for the first time. Then is less than 1 sec. Plus, that no optimizations have been applied yet. I hope I can cut the time in half.

Anyway, your idea is great. I will do that.
 

ADD REPLYlink modified 10 hours ago • written 10 hours ago by BioApps0
41
gravatar for lh3
2.7 years ago by
lh320k
lh320k wrote:

As I routinely waste time on benchmarks... The take-home message is C/C++ still takes the crown of speed, four times as fast as Perl/Python. Among the following programs, only seqtk and brentp's bioperl.pl work with multi-line fastq. The rest assume 4-line fastq. [PS: Python-2.7.2, Ruby-1.9.2p290, LuaJIT-beta8, Java-1.6.0_25, Perl-5.8.8 and gcc-4.1.2. i.e. latest python/ruby/luajit/java but years old C/Perl]

=========================================================================================
 Program        Lang       User  Sys   Mem/MB    Comments
-----------------------------------------------------------------------------------------
 cat            C           0.4  10.1     0.4    Unix cat
 wc -l          C           4.3  10.4     0.4    Unix wc -l
 fqextract.c    C          17.3  10.8     2.0    Hardcoded maximum line length: 4096
 Pierre-r2.cc   C++        19.5  12.2     3.8    Use tr1::unordered_set
 seqtk          C          19.9   8.9     4.1    Multi-line fastq; subregion; gzip input
 Pierre-r1.cc   C++        22.8  12.6     3.9    Replace cin with reading /dev/stdin
 wc             C          32.0  10.6     0.4    Unix wc (no command-line options)
 fqextract.py   Python     54.0   7.9     8.0    Same algorithm as fqextract.c
 fqextract.java Java       55.8   8.9   562.7    Same algorithm as fqextract.c
 fqextract.pl   Perl       71.5  10.5     5.5    Same algorithm as fqextract.c
 Leszek.py      Python     78.0  13.3     8.0    Infinite loop given improper input
 fqextract.lua  LuaJIT    109.6   8.6     6.1    Same algorithm as fqextract.c
 readfq.py      Python    122.9  11.3     8.1    Multi-line fastq
 bp-fqiter      Python    124.5   9.9    15.2    Multi-line fastq; biopython
 drio.rb        Ruby      130.1   6.9    10.0    Same algorithm as fqextract.c
 readfq.lua     Lua       189.5  11.5     6.0    Multi-line fastq
 readfq.pl      Perl      203.1  69.6     5.7    Multi-line fastq
 Pierre.cc      C++       323.6  13.3     3.9    Same algorithm as Leszek.py
 biopython.py   Python    928.8  34.2    15.6    Multi-line fastq
 bioperl.pl     Perl    >2400.0         >15.0    Multi-line fastq; killed
 grep -f -A     C       >2400.0        >414.0    Killed; possible false positives
=========================================================================================


#Discussions#

  1. Readfq is my implementation of a unified multi-line fasta/fastq parser in Perl, Python and Lua.

  2. It is a little surprising that C/C++ implementations greatly outperform Perl and Python implementations. I would expect the difference to be less than 2 folds.

  3. Pierre.cc and Leszek.py may go into an infinite loop when the input list contains names absent from the fastq file. The strategy used in fqextract is recommended. It is also faster, at least for python.

  4. BioPerl's fastq parser is impractically slow. Readfq.pl also parsers multi-line fastq, but is by far faster. BioPython's fastq parser is more efficient than BioPerl's, but not efficient enough. Readfq.py is 7 times faster.

  5. Unix cat is insanely fast. Unix "wc -l" (line counting only) is 8X faster than "wc" without options (also counting words and bytes). As I remember, Unix fgrep (or grep -f) builds a finite-state automaton for query strings. The algorithm is close to the optimal for general uses, but as fgrep ignores the format of fastq, it is far slower than a fastq-aware program.


#Programs#
fqextract.c

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "khash.h" // khash.h from samtools/bwa/seqtk/klib
KHASH_SET_INIT_STR(s)

#define BUF_SIZE 4096

int main(int argc, char *argv[])
{
    char buf[BUF_SIZE];
    FILE *fp;
    int lineno = 0, flag = 0, ret;
    khash_t(s) *h;
    if (argc == 1) {
        fprintf(stderr, "Usage: cat in.fq | fqextract <name.lst>\n");
        return 1;
    }
    h = kh_init(s);
    fp = fopen(argv[1], "rb"); // FIXME: check fp
    while (fgets(buf, BUF_SIZE, fp))
        kh_put(s, h, strdup(buf), &ret); // FIXME: check ret
    fclose(fp);
    while (fgets(buf, BUF_SIZE, stdin)) {
        if (++lineno%4 == 1)
            flag = (kh_get(s, h, buf + 1) != kh_end(h));
        if (flag) fputs(buf, stdout);
    }
    kh_destroy(s, h); // FIXME: free keys before destroy
    return 0;
}

Pierre-r2.cc

#include <iostream>
#include <string>
#include <fstream>
#include <cstdlib>
#include <tr1/unordered_set> // use <unordered_set> for more recent g++
using namespace std;

int main(int argc,char **argv)
{
    string line;
    tr1::unordered_set<string> names;
    ifstream in(argv[1],ios::in);
    if(!in.is_open()) return EXIT_FAILURE;
    while(getline(in,line,'\n'))
        names.insert(line);
    in.close();
    string name, seq, name2, qual;
    tr1::unordered_set<string>::iterator r;
    ifstream in2("/dev/stdin", ios::in);
    while(!names.empty())
    {
        if(!getline(in2,name,'\n')) break;
        if(!getline(in2,seq,'\n')) break;
        if(!getline(in2,name2,'\n')) break;
        if(!getline(in2,qual,'\n')) break;
        r=names.find(name);
        if(r==names.end()) continue;
        names.erase(r);//names are unique we don't need this one anymore
        cout << name << "\n" << seq << "\n" << name2 << "\n" << qual << endl;
    }
    in2.close();
    return 0;
}

fqextract.java

import java.util.HashSet;
import java.io.*;

class fqextract {
    public static void main(String[] args) throws IOException {
        BufferedReader fp = new BufferedReader(new FileReader(args[0]));
        HashSet h = new HashSet();
        String l;
        while ((l = fp.readLine()) != null) h.add('@' + l);
        fp.close();
        int lineno = 0;
        boolean flag = false;
        fp = new BufferedReader(new InputStreamReaderSystem.in));
        while ((l = fp.readLine()) != null) {
            if (++lineno % 4 == 1) flag = h.contains(l);
            if (flag) System.out.println(l);
        }
    }
}

seqtk

fqextract.pl

die("Usage: cat in.fq | fqextract.pl <name.lst>\n") if (@ARGV == 0);
my $fn = shift(@ARGV);
my %hash = ();
open(FILE, $fn) || die;
$hash{'@'.$_} = 1 while (<FILE>);
close(FILE);
my $flag = 0;
while (<>) {
    $flag = defined($hash{$_})? 1 : 0 if ($.%4 == 1);
    print if ($flag);
}

fqextract.py

import sys 
ids, lineno, flag = set('@' + x for x in open(sys.argv[1])), 0, False
for l in sys.stdin:
    lineno += 1
    if lineno%4 == 1: flag = (l in ids)
    if flag: print l,

biopython.py

from Bio import SeqIO
import sys
ids = set(x[:-1] for x in open(sys.argv[1]))
for seq in SeqIO.parse(sys.stdin, "fastq"):
    if seq.id in ids): print seq.format("fastq"),

bp-fqiter.py

from Bio.SeqIO.QualityIO import FastqGeneralIterator
import sys 
ids = set(x[:-1] for x in open(sys.argv[1]))
for name, seq, qual in FastqGeneralIterator(sys.stdin):
    if (name in ids): print("@%s\n%s\n+\n%s" % (name, seq, qual))

EDIT1 by lh3: Optimized kseq.h for faster FASTQ parsing, though the parser is a little less robust than the previous version (fine for nearly all files). Added cat/wc/grep. Added more discussions.

EDIT2 by lh3: Added Ruby-1.9.2p290

EDIT3 by lh3: Added LuaJIT-beta8. Lua-5.1.4 is 20% slower. Added fqextract.py and fqextract.java

EDIT4 by lh3: Added biopython and readfq parsers.

EDIT5 by lh3: Added BioPython.FastqGeneralIterator (see also here). This parser is as fast as my readfq.py.

ADD COMMENTlink modified 20 months ago by Istvan Albert ♦♦ 39k • written 2.7 years ago by lh320k
3

The current Biopython example is using the SeqRecord format method which is not recommended for speed. The documentation and examples try to be clear about this. Instead, use Bio.SeqIO.write() as in

<script src="&lt;a href=" 1201025"="">1201025"></script>
which should be much faster.

Also, in an example like this where you don't need all the objects created, a low level solution as in

<script src="&lt;a href=" 1201033"="">1201033"></script>
would be about as good as I would expect using the Biopython parser. See also http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

If you assume 4 lines per sequence things are different.

ADD REPLYlink written 2.6 years ago by Peter3.8k
2

lh3, can you please add this ruby implementation to the benchmark list?: https://gist.github.com/1154539

ADD REPLYlink written 2.7 years ago by Drio860
2

Added. Thanks a lot.

ADD REPLYlink written 2.7 years ago by lh320k

I have implemented a go version. Could you please add it to your benchmarks and update when you have a chance? Thanks!

ADD REPLYlink modified 21 months ago • written 21 months ago by Drio860
2

Your Biopython example suffers from you using the sequence's format method. The docs do recommend you use SeqIO.write() instead. Something like this should be much faster:

from Bio import SeqIO import sys ids = set(x[:-1] for x in open(sys.argv[1])) wanted = (rec for rec in SeqIO.parse(sys.stdin, "fastq") if rec.id in ids) SeqIO.write(wanted, sys.stdout, "fastq")

Of course, if that's all you're doing there is no point making all those objects. See http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

ADD REPLYlink written 2.6 years ago by Peter3.8k

very informative post. thx!

ADD REPLYlink written 2.7 years ago by 2184687-1231-83-4.5k

Here is a lua version: https://gist.github.com/1159195 If you add it to the list could you please mention what version of lua you use? Ruby's performance is horrible. Perhaps you used MRI1.8 ?

ADD REPLYlink written 2.7 years ago by Drio860

I used the latest "official" ruby compiled from source code.

ADD REPLYlink written 2.7 years ago by lh320k

I used the latest "official" ruby compiled from source code. I would not be surprised that ruby is slower. Perl is designed for text processing. My impression is also that for text processing, Perl is more efficient than other language implementations.

ADD REPLYlink written 2.7 years ago by lh320k

I tried to improve the speed of the c version by using mmap(2). I failed on my first attempt (https://gist.github.com/1168330). I think the reason is double. There is extra overhead introduced in the logic from the fact you just get a pointer to memory instead a file descriptor. In addition, reading from stdin is faster than reading directly from a file.

ADD REPLYlink written 2.7 years ago by Drio860

All hail the king of benchmarks!

ADD REPLYlink written 2.6 years ago by Aaronquinlan7.3k
13
gravatar for Leszek
2.8 years ago by
Leszek2.9k
Barcelona, Spain
Leszek2.9k wrote:

I would use python (no dependencies):
1. read you read names into list1 and change list to set (it's hashable, so checking for present of element is much faster than in list)
2. parse you fastq file and check if each name is present in set of read names of interest

#!/usr/bin/python
import sys

#get fname from parameter
idfile=sys.argv[1]

#load ids
ids = set( x.strip() for x in open(idfile) )

#read stdid 
handle=sys.stdin  

while ids:
  #parse fastq
  idline=handle.readline()
  seq   =handle.readline()
  spacer=handle.readline()
  quals =handle.readline()
  #check
  id=idline[:-1]
  if id in ids:
    #print fastq
    sys.stdout.write( '%s%s%s%s' % ( idline, seq, spacer, quals ) )
    #update ids
    ids.remove( id )

My python implementation is over 3x faster than c++ by Pierre. It's due to implementation of set (hash-able) instead of usual list.

#c++
time cat PL429_q10_filtered_1.fastq | ./parse PL429_q10_filtered_1.fastq.lids > PL429_q10_filtered_1.fastq.lids.fqcpp

real    4m21.137s
user    4m15.950s
sys 0m10.510s

#python (3.4x faster than c++)
time cat PL429_q10_filtered_1.fastq | python fastq_id_parser.py PL429_q10_filtered_1.fastq.lids > PL429_q10_filtered_1.fastq.fqSet

real    1m11.155s
user    1m5.270s
sys 0m10.920s

#python list instead of set (56x slower than python using set!)
time cat PL429_q10_filtered_1.fastq | python fastq_id_parser.noBioPython.noSet.py PL429_q10_filtered_1.fastq.lids > PL429_q10_filtered_1.fastq.lids.fqNoSet

real    70m54.534s
user    70m18.650s
sys 0m16.740s

#parsing file
time cat PL429_q10_filtered_1.fastq | wc -l
115959528

real    0m30.373s
user    0m1.990s
sys 0m8.250s

My test set was 10k: first and last 5k headers of the fq. Fq was 4.9GB, 115,959,528 lines, seq between 31-75bases. I used 15k SAS drive - read speed is ~200MB/s, access time ~5ms.

ADD COMMENTlink modified 20 months ago by Istvan Albert ♦♦ 39k • written 2.8 years ago by Leszek2.9k
7

@lh3: I was bored today so I figured out what's causing the C++ program to be slow, it seems to be an issue piping the output through cin. If I change the input to use an ifstream and use a file Pierres C++ program is about 2x faster than the python program. If I replace the std::set with a boost::unordered_set the program ends up being about 4x faster than the python one.

ADD REPLYlink written 2.7 years ago by Gww2.4k
3

3X ! I am terribly offended! :-)

ADD REPLYlink written 2.8 years ago by Pierre Lindenbaum58k

Sorry Pierre! I was really surprised, as c++ is considered the fastest language. Do you know some hashed list implementation in c++? :)

ADD REPLYlink written 2.8 years ago by Leszek2.9k

Do you know the performance for a hashed list implementation in Perl? Best regards.

ADD REPLYlink written 2.8 years ago by David70

http://www.sgi.com/tech/stl/hash_set.html

ADD REPLYlink written 2.8 years ago by Aaronquinlan7.3k

Experienced C++ programmers would not use "iostream" for performance critical tasks. It is very inefficient. Its "getline" is far slower than C's "fgets". Also, as others commented, "set" is much slower than "hash_set", which is now deprecated by "unordered_set" in boost or the latest C++ spec. Perl has a very fast "getline" equivalence. Its hash table should come very close to C++. I do not like C++. It has many pitfalls to average programmers.

ADD REPLYlink written 2.8 years ago by lh320k

Experienced C++ programmers would not use "iostream" for performance critical tasks. It is very inefficient. Its "getline" is far slower than C's "fgets" (though C's printf is a famous pitfall, too). Also, as others commented, "set" is much slower than "hash_set", which is now deprecated by "unordered_set" in boost or the latest C++ spec. Perl has a very fast "getline" equivalence. Its hash table should come very close to C++. I do not like C++. It has many pitfalls to average programmers.

ADD REPLYlink written 2.8 years ago by lh320k

David, provide me with a full perl script, so I can run it.

ADD REPLYlink written 2.8 years ago by Leszek2.9k

@ih3 I agree you . But I thought the STL would be enough for Biostar :-) !

ADD REPLYlink written 2.8 years ago by Pierre Lindenbaum58k

@Leszek, python sets have a remove method so you dont need to convert to a list (then back to set). also, you can create the ids like: ids = set(x.strip() for x in open(idfile)

ADD REPLYlink written 2.8 years ago by brentp17k

set.remove() is cool, brentp. I was not aware of that. it gains another 2 seconds;)

ADD REPLYlink written 2.8 years ago by Leszek2.9k

set.remove() is cool, brentp. I was not aware of that. it gains another 2-3 seconds;)

ADD REPLYlink written 2.8 years ago by Leszek2.9k

@Pierre. The problem is hash_set is sadly not part of STL and thus not portable even between gcc versions. IMHO, the major fault of STL is not to include hash_set/map. The SGI implementation is actually quite good. BTW, personally I do not quite buy boost and the new C++ spec.

ADD REPLYlink written 2.8 years ago by lh320k

@Pierre: I second that we should not make simple things too complicated. I just want to understand why C++ is much slower.

ADD REPLYlink written 2.8 years ago by lh320k

@Pierre: I second that we should not make simple things too complicated. I just want to understand why C++ is much slower. BTW, I am not a fan of boost and the latest C++ spec.

ADD REPLYlink written 2.8 years ago by lh320k

@GWW: Thanks. The stdin issue is an interesting one. You may well be correct. When I tried C++'s getline() last time, I indeed fed input via stdin.

ADD REPLYlink written 2.7 years ago by lh320k

@lh3: nice work! if you are still boring... what about changing the loop condition to "foreach([?])" so to skip the "names.erase(r)" in every cicle of the loop and clear the set before the "return 0" ?

ADD REPLYlink written 2.7 years ago by ff.cc.cc1.1k

@lh3: nice work! if you are still boring... what about changing the loop condition so to skip the "names.erase(r)" in every cicle of the loop and clear the set only before the "return 0" ? It would be interesting to see if we gain more performance from skipping an update-index operation or from faster data-retrieve due to decreasing size of index tree.

ADD REPLYlink written 2.7 years ago by ff.cc.cc1.1k

@Pierre and @lh3 the STL is enough. std::tr1::unordered_map std::tr1::unordered_set

ADD REPLYlink written 2.7 years ago by mylons120

@ffcccc: GWW did the hard work. I was just guessing. Credit to him.

ADD REPLYlink written 2.7 years ago by lh320k

@lh give me please updated c++ code, so I can update the results. btw: I'm a bit sceptic about c++ being 4x faster than python, as simple cat is a bit over 2x faster;)

ADD REPLYlink written 2.7 years ago by Leszek2.9k

@Leszek: please see my answer below. @desaila: thanks for suggesting tr1. I almost forgot. More recent g++ has [?] in its standard library.

ADD REPLYlink written 2.7 years ago by lh320k
11
gravatar for brentp
2.8 years ago by
brentp17k
Denver, Colorado
brentp17k wrote:

if you have only 10,000 reads you want to pull out, you can just iterate over the file once. For each record you check if the current read header is in the list (or hash) of 10,000. Something like this:

use Bio::SeqIO;
use Bio::Seq::Quality;

# read in 10K headers into a hash with value of 1 or whatever.
# you'll have to add your own code to do this.
my %headers = ();

$seqio  = Bio::SeqIO->new('-format'=>'fastq' , '-file'=>'some.fasq');

my $out_fastq = Bio::SeqIO->new( -format => 'fastq', '-file'=> 'subset.fastq');

while((my $rec = $seqio->next_seq())) {
    # if the current record doesn't exist in the 10,000, skip it.
    if(! $headers{$rec->display_id} ) { next; }
    # otherwise, write it to the output file.
    $out_fastq->write_seq($rec);
}
ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by brentp17k
1

wouldn't it be better to say if (! exists $headers{})?

ADD REPLYlink written 2.8 years ago by Flies100
1

you mean (! exists $headers{$rec->display_id}) ? That is more explicit, but it won't make a difference.

ADD REPLYlink written 2.8 years ago by brentp17k

I took your advice brentp and started writing this Perl script. Weirdly, though, I discovered that BioPerl is VERY slow so I decided to do the file parsing "manually" (assuming there are only 4 lines per sequence) and it turned out to be A LOT faster (more than 60 times faster)!

ADD REPLYlink written 2.8 years ago by Panos1.2k
9
gravatar for Pierre Lindenbaum
2.8 years ago by
France
Pierre Lindenbaum58k wrote:

Here is my C++ solution:

#include <iostream>
#include <set>
#include <string>
#include <fstream>
#include <cstdlib>
using namespace std;


int main(int argc,char **argv)
    {
    string line;
    set<string> names;
    /* reads your names */
    ifstream in(argv[1],ios::in);
    if(!in.is_open()) return EXIT_FAILURE;
    while(getline(in,line,'\n'))
        {
        names.insert(line);
        }
    in.close();
    /* reads the fastq */
    string name;
    string seq;
    string name2;
    string qual;
    set<string>::iterator r;
    /* assuming 4 lines / read */
    while(!names.empty())
        {
        if(!getline(cin,name,'\n')) break;
        if(!getline(cin,seq,'\n')) break;
        if(!getline(cin,name2,'\n')) break;
        if(!getline(cin,qual,'\n')) break;
        r=names.find(name);
        if(r==names.end()) continue;
        names.erase(r);//names are unique we don't need this one anymore
        cout << name << "\n" << seq << "\n" << name2 << "\n" << qual << endl;
        }
    return 0;
    }

Compile:

  g++ -Wall -O3 -o parse parser.cpp

Example:

$ cat names.txt
@EAS54_6_R1_2_1_540_792
@EAS54_6_R1_2_1_443_348

$ gunzip -c my.fastq.gz 
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
;;;;;;;;;;;9;7;;.7;393333

$ gunzip -c  my.fastq.gz | ./parse names.txt
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
;;;;;;;;;;;9;7;;.7;393333
ADD COMMENTlink modified 20 months ago by Istvan Albert ♦♦ 39k • written 2.8 years ago by Pierre Lindenbaum58k

This is terrific!!

ADD REPLYlink written 20 months ago by Manu Prestat2.8k
4
gravatar for Adrian
2.7 years ago by
Adrian490
Cambridge, MA
Adrian490 wrote:

If all you need to do is extract the sequences, you could also just use grep, which will likely be about as fast as writing your own parser.

You can tell it to print out the line after the read name matched (-A 1) and can specify the list of patterns to search for in a file.

ADD COMMENTlink written 2.7 years ago by Adrian490
2

I have used this method successfully many times. Put the IDs, one per line, in a text file called 'ids.txt' with the '@' before them. Assume your sequences are in 'seq.fq' (assume reads are on one line!)

% grep -f ids.txt -F --mmap -A 3 seq.fq | grep -v '^--' > 10000.fq

Done!

ADD REPLYlink written 2.5 years ago by Torst590

Dear Torst, when I am using this syntax grep -f tr1.txt -F --mmap -A 3 che.fastq | grep -v '^--' > 10000.fq it is dumping entire file as it without extracting desire id's sequence, My ids look like

@SRR681003.7 SN603:5:1101:70.90:105.60 length=100

@SRR681003.1 SN603:5:1101:44.60:109.30 length=100

@SRR681003.8 SN603:5:1101:63.80:106.10 length=100 but when I am doing grep -e "@SRR681003.7 SN603:5:1101:70.90:105.60 length=100" -A 3 che.fastq > output.fastq it do perfectly good. how to do * -e "@SRR681003.7 SN603:5:1101:70.90:105.60 length=100* this part for id.txt please help me out.

ADD REPLYlink written 10 months ago by raghvendra1986singh60
2
gravatar for francesco.strozzi
22 months ago by
francesco.strozzi20 wrote:

Hi, I have written a fast parser for FastQ files. It's in Ruby but with a C extension underneath, so it's considerably faster then pure Ruby/Python/Perl code https://github.com/fstrozzi/bioruby-faster .

Look at the Wiki for simple examples on how to use it https://github.com/fstrozzi/bioruby-faster/wiki . It's easy and you get back an array for each entry in the FastQ file, with the complete sequence header (ID + comment), the sequence and quality values.

You can then parse the file once and just filter the sequence ID of each record by searching it against a Set of IDs using for example Ruby methods like Set#include?

my_ids = Set.new ["X,"Y","Z"]
fastq = Bio::Faster.new("sequences.fastq")
fastq.each_record do |sequence_header, sequence, quality|
     seq_id = sequence_header.split("\s").first
     puts sequence_header, sequence, quality if my_ids.include? seq_id
end
ADD COMMENTlink modified 22 months ago • written 22 months ago by francesco.strozzi20
1
gravatar for Marvin
2.7 years ago by
Marvin540
Marvin540 wrote:

This doesn't really answer your question, but anyway: WTF would anyone store 70M reads in a text file and then try to access it as if it was a database?

I would convert the humongous FastQ to BAM. The conversion is lossless, saves space, and BAM/SAM are in fact easier to parse than FastQ. I don't know of a tool that would index BAM to retrieve reads by name, but if people start asking for that functionality, maybe someone will write such a tool.

ADD COMMENTlink written 2.7 years ago by Marvin540
1
gravatar for lh3
22 months ago by
lh320k
lh320k wrote:

Sorry for a separate answer. It shows the table in my previous answer, which is malformated during the stackexchange transition. I cannot edit my old answer either as it exceeds the 5000 character limit. I will remove my this answer when I can reformat my old answer properly.

=========================================================================================
 Program        Lang       User  Sys   Mem/MB    Comments
-----------------------------------------------------------------------------------------
 cat            C           0.4  10.1     0.4    Unix cat
 wc -l          C           4.3  10.4     0.4    Unix wc -l
 fqextract.c    C          17.3  10.8     2.0    Hardcoded maximum line length: 4096
 Pierre-r2.cc   C++        19.5  12.2     3.8    Use tr1::unordered_set
 seqtk          C          19.9   8.9     4.1    Multi-line fastq; subregion; gzip input
 Pierre-r1.cc   C++        22.8  12.6     3.9    Replace cin with reading /dev/stdin
 wc             C          32.0  10.6     0.4    Unix wc (no command-line options)
 fqextract.py   Python     54.0   7.9     8.0    Same algorithm as fqextract.c
 fqextract.java Java       55.8   8.9   562.7    Same algorithm as fqextract.c
 fqextract.pl   Perl       71.5  10.5     5.5    Same algorithm as fqextract.c
 Leszek.py      Python     78.0  13.3     8.0    Infinite loop given improper input
 fqextract.lua  LuaJIT    109.6   8.6     6.1    Same algorithm as fqextract.c
 readfq.py      Python    122.9  11.3     8.1    Multi-line fastq
 bp-fqiter      Python    124.5   9.9    15.2    Multi-line fastq; biopython
 drio.rb        Ruby      130.1   6.9    10.0    Same algorithm as fqextract.c
 readfq.lua     Lua       189.5  11.5     6.0    Multi-line fastq
 readfq.pl      Perl      203.1  69.6     5.7    Multi-line fastq
 Pierre.cc      C++       323.6  13.3     3.9    Same algorithm as Leszek.py
 biopython.py   Python    928.8  34.2    15.6    Multi-line fastq
 bioperl.pl     Perl    >2400.0         >15.0    Multi-line fastq; killed
 grep -f -A     C       >2400.0        >414.0    Killed; possible false positives
=========================================================================================
ADD COMMENTlink modified 22 months ago • written 22 months ago by lh320k
0
gravatar for Sequencegeek
2.7 years ago by
Sequencegeek640
UCLA
Sequencegeek640 wrote:

You could also use a binary search. I have a library in Python that will only require you to write a small function to adapt to your file type...

ADD COMMENTlink written 2.7 years ago by Sequencegeek640
0
gravatar for User 7691
2.3 years ago by
User 76910
User 76910 wrote:

Did anyone try do it out with: Go = http://golang.org/ D = http://www.d-programming-language.org/ * Vala = http://live.gnome.org/Vala

ADD COMMENTlink written 2.3 years ago by User 76910
0
gravatar for Ric
2.3 years ago by
Ric40
Ric40 wrote:

Why the Java solution use so much more memory? Is it just because of JVM?

ADD COMMENTlink written 2.3 years ago by Ric40

that's a good question! But that should be a comment of the related answer.

ADD REPLYlink written 17 months ago by Manu Prestat2.8k
0
gravatar for Kevin
2.2 years ago by
Kevin440
Kevin440 wrote:

Just a thought. I think Map/reduce in Hadoop might be super fast in doing this. Conceptually

ADD COMMENTlink written 2.2 years ago by Kevin440
0
gravatar for Pals
17 months ago by
Pals1.1k
Pals1.1k wrote:

Very simple solution using biopython:

def batch_iterator(iterator, batch_size:)
    entry = True 
    while entry :
        batch = []
        while len(batch) < batch_size :
            try :
                entry = iterator.next()
            except StopIteration :
                entry = None
            if entry is None :
                break
            batch.append(entry)
        if batch :
            yield batch

from Bio import SeqIO
record_iter = SeqIO.parse(open("filename.fastq"), "fastq")
for i, batch in enumerate(batch_iterator(record_iter, no.of records)) :
    filename = "group_%i.fastq" % (i+1)
    handle = open(filename, "w")
    count = SeqIO.write(batch, handle, "fastq")
    handle.close()
    print "Wrote %i records to %s" % (count, filename)

source: BiopythonWiki

ADD COMMENTlink modified 17 months ago • written 17 months ago by Pals1.1k
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 404 users visited in the last hour