Tool: Seq_File: C Library For Reading Multiple Bioinformatics Sequence File Formats
gravatar for Jeremy Leipzig
2.9 years ago by
Philadelphia, PA
Jeremy Leipzig15k wrote:

The aim is to provide a C library that allows programs to transparently read sequence data from multiple file formats, without having to worry about the format. Pass a file to seqfileopen() and then read sequences using seqnextread() without having to worry about what the file format is.

Currently supports:

SAM & BAM FASTA (& gzipped fasta) FASTQ (& gzipped fastq) 'plain' format (one sequence per line [.txt]) (& gzipped plain [.txt.gz])

fasta fastq bam sam tool • 1.5k views
ADD COMMENTlink modified 2.9 years ago by Isaac Turner50 • written 2.9 years ago by Jeremy Leipzig15k

Do you have any sense of how it is better than the htslib upon which it is apprently built?

ADD REPLYlink written 2.9 years ago by Aaronquinlan8.9k
gravatar for Isaac Turner
2.9 years ago by
Isaac Turner50
United Kingdom
Isaac Turner50 wrote:

I'm the author of seq_file. It's quite a new piece of code written for use in the variant caller cortex - I didn't expect to see it pop up here. Thanks Heng for that comparison - it's very useful. seq_file was written in C to provide:

  • file reading with file format auto-detection
  • reading FASTA/FASTQ/SAM/BAM formats and others as they come along
  • unbuffered reading a line at a time for use with interactive programs

Making these trade offs has lowered the reading speed (especially the last one). On top of which optimising for speed hasn't been a priority so far. Heng's htslib is used for SAM/BAM reading. The API supports reading a base/qual pair at a time (keeps memory usage low if a whole genome FASTA file is passed) or reading an entire entry at once. There is also support for writing in different output formats.

In Heng's example I'd change:

f = seq_file_open_filetype(argv[1], SEQ_FASTA);


f = seq_file_open(argv[1]);

to support format auto-detecting and multi-format support. This was also the source of the bug you saw -- that function should have been removed from the API. I'm not sure I understand Heng's comment about zlib - I'd assumed it would link against whichever version of zlib was available so I assume he means he used zlib-1.2.3.

If this is something people would like to use, optimisations could certainly be made and I'd be happy to have input. If people are looking to simply read FASTA/Q quickly in C, I'd point them towards Heng's kseq library linked above.


ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by Isaac Turner50

Your library should work with zlib-1.1.4 through zlib-1.2.7, but you will get better performance when zlib-1.2.4+ is in use. zlib-1.2.3 implements inefficient gzgetc() and gzgets() due to the use of unbuffered I/O. THE major improvement in zlib-1.2.4 is to reimplement gz*() functions. When I implemented my parser in 2008, 1.5 years before the release of zlib-1.2.4, I first wrote a buffer to overcome the inefficient gzgetc(). This buffer guarantees good performance across all zlib versions.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by lh325k
gravatar for lh3
2.9 years ago by
United States
lh325k wrote:

Benchmark on parsing human chromosome fasta (C source code at the end):

Data             Library    RealTime    UserTime    SysTime
humanChr-plain   seq_file   127         88          38
humanChr-gzip    seq_file   246         237         10
humanChr-plain   kseq       11          8           3
humanChr-gzip    kseq       31          29          2
humanChr-plain   wc         21          9           12
humanChr-gzip    zcat|wc    38          39          18

Seq_file seems to be slower than my single-header kseq library. I have not tried on short-read FASTQ. The results may be different. Also, seq_file is linked to zlib-1.2.3. Probably using zlib-1.2.5 wil lead to better performance because in the most widely used zlib-1.2.3, I/O is not buffered.

Source code based on seq_file. Somehow the program is buggy: it always misses the first sequence in the file.

#include <stdio.h>
#include <stdlib.h>
#include "seq_file.h"

int main(int argc, char *argv[])
    SeqFile *f;
    StrBuf *buf;
    if (argc == 1) return 1;
    f = seq_file_open_filetype(argv[1], SEQ_FASTA);
    buf = strbuf_new();
    while (seq_next_read(f)) {
        seq_read_all_bases(f, buf);
        printf("%s\t%ld\n", seq_get_read_name(f), strbuf_len(buf));
    return 1;

Based on kseq:

#include <zlib.h>
#include <stdio.h>
#include "kseq.h"
KSEQ_INIT(gzFile, gzread)

int main(int argc, char *argv[])
    gzFile fp = gzopen(argv[1], "r");
    kseq_t *ks = kseq_init(fp);
    if (argc == 1) return 1;
    while (kseq_read(ks) >= 0)
        printf("%s\t%ld\n", ks->name.s, ks->seq.l);
    return 0;
ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by lh325k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 690 users visited in the last hour