Tool: Seq_File: C Library For Reading Multiple Bioinformatics Sequence File Formats
3
gravatar for Jeremy Leipzig
16 months ago by
Philadelphia, PA
Jeremy Leipzig12k wrote:

https://github.com/noporpoise/seq_file

The aim is to provide a C library that allows programs to transparently read sequence data from multiple file formats, without having to worry about the format. Pass a file to seqfileopen() and then read sequences using seqnextread() without having to worry about what the file format is.

Currently supports:

SAM & BAM FASTA (& gzipped fasta) FASTQ (& gzipped fastq) 'plain' format (one sequence per line [.txt]) (& gzipped plain [.txt.gz])

ADD COMMENTlink modified 15 months ago by Isaac Turner50 • written 16 months ago by Jeremy Leipzig12k

Do you have any sense of how it is better than the htslib upon which it is apprently built?

ADD REPLYlink written 16 months ago by Aaronquinlan7.3k
5
gravatar for Isaac Turner
15 months ago by
Isaac Turner50
United Kingdom
Isaac Turner50 wrote:

I'm the author of seq_file. It's quite a new piece of code written for use in the variant caller cortex - I didn't expect to see it pop up here. Thanks Heng for that comparison - it's very useful. seq_file was written in C to provide:

  • file reading with file format auto-detection
  • reading FASTA/FASTQ/SAM/BAM formats and others as they come along
  • unbuffered reading a line at a time for use with interactive programs

Making these trade offs has lowered the reading speed (especially the last one). On top of which optimising for speed hasn't been a priority so far. Heng's htslib is used for SAM/BAM reading. The API supports reading a base/qual pair at a time (keeps memory usage low if a whole genome FASTA file is passed) or reading an entire entry at once. There is also support for writing in different output formats.

In Heng's example I'd change:

f = seq_file_open_filetype(argv[1], SEQ_FASTA);

to:

f = seq_file_open(argv[1]);

to support format auto-detecting and multi-format support. This was also the source of the bug you saw -- that function should have been removed from the API. I'm not sure I understand Heng's comment about zlib - I'd assumed it would link against whichever version of zlib was available so I assume he means he used zlib-1.2.3.

If this is something people would like to use, optimisations could certainly be made and I'd be happy to have input. If people are looking to simply read FASTA/Q quickly in C, I'd point them towards Heng's kseq library linked above.

Isaac

ADD COMMENTlink modified 15 months ago • written 15 months ago by Isaac Turner50
2

Your library should work with zlib-1.1.4 through zlib-1.2.7, but you will get better performance when zlib-1.2.4+ is in use. zlib-1.2.3 implements inefficient gzgetc() and gzgets() due to the use of unbuffered I/O. THE major improvement in zlib-1.2.4 is to reimplement gz*() functions. When I implemented my parser in 2008, 1.5 years before the release of zlib-1.2.4, I first wrote a buffer to overcome the inefficient gzgetc(). This buffer guarantees good performance across all zlib versions.

ADD REPLYlink modified 15 months ago • written 15 months ago by lh320k
4
gravatar for lh3
16 months ago by
lh320k
lh320k wrote:

Benchmark on parsing human chromosome fasta (C source code at the end):

Data             Library    RealTime    UserTime    SysTime
-----------------------------------------------------------
humanChr-plain   seq_file   127         88          38
humanChr-gzip    seq_file   246         237         10
humanChr-plain   kseq       11          8           3
humanChr-gzip    kseq       31          29          2
humanChr-plain   wc         21          9           12
humanChr-gzip    zcat|wc    38          39          18

Seq_file seems to be slower than my single-header kseq library. I have not tried on short-read FASTQ. The results may be different. Also, seq_file is linked to zlib-1.2.3. Probably using zlib-1.2.5 wil lead to better performance because in the most widely used zlib-1.2.3, I/O is not buffered.

Source code based on seq_file. Somehow the program is buggy: it always misses the first sequence in the file.

#include <stdio.h>
#include <stdlib.h>
#include "seq_file.h"

int main(int argc, char *argv[])
{
    SeqFile *f;
    StrBuf *buf;
    if (argc == 1) return 1;
    f = seq_file_open_filetype(argv[1], SEQ_FASTA);
    buf = strbuf_new();
    while (seq_next_read(f)) {
        seq_read_all_bases(f, buf);
        printf("%s\t%ld\n", seq_get_read_name(f), strbuf_len(buf));
        strbuf_reset(buf);
    }
    seq_file_close(f);
    strbuf_free(buf);
    return 1;
}

Based on kseq:

#include <zlib.h>
#include <stdio.h>
#include "kseq.h"
KSEQ_INIT(gzFile, gzread)

int main(int argc, char *argv[])
{   
    gzFile fp = gzopen(argv[1], "r");
    kseq_t *ks = kseq_init(fp);
    if (argc == 1) return 1;
    while (kseq_read(ks) >= 0)
        printf("%s\t%ld\n", ks->name.s, ks->seq.l);
    kseq_destroy(ks);
    gzclose(fp);
    return 0;
}
ADD COMMENTlink modified 16 months ago • written 16 months ago by lh320k
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 575 users visited in the last hour