Question

Tool:Seq_File: C Library For Reading Multiple Bioinformatics Sequence File Formats

3

Entering edit mode

11.3 years ago

Jeremy Leipzig 22k

The aim is to provide a C library that allows programs to transparently read sequence data from multiple file formats, without having to worry about the format. Pass a file to seqfileopen() and then read sequences using seqnextread() without having to worry about what the file format is.

Currently supports:

SAM & BAM
FASTA (& gzipped fasta)
FASTQ (& gzipped fastq)
'plain' format (one sequence per line [.txt]) (& gzipped plain [.txt.gz])

fasta fastq bam sam • 4.5k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 11.3 years ago by Jeremy Leipzig 22k

0

Entering edit mode

Do you have any sense of how it is better than the htslib upon which it is apprently built?

ADD REPLY • link 11.3 years ago by Aaronquinlan 12k

score 5 · Answer 1 · 2012-12-29

I'm the author of seq_file. It's quite a new piece of code written for use in the variant caller cortex - I didn't expect to see it pop up here. Thanks Heng for that comparison - it's very useful. seq_file was written in C to provide:

file reading with file format auto-detection
reading FASTA/FASTQ/SAM/BAM formats and others as they come along
unbuffered reading a line at a time for use with interactive programs

Making these trade offs has lowered the reading speed (especially the last one). On top of which optimising for speed hasn't been a priority so far. Heng's htslib is used for SAM/BAM reading. The API supports reading a base/qual pair at a time (keeps memory usage low if a whole genome FASTA file is passed) or reading an entire entry at once. There is also support for writing in different output formats.

In Heng's example I'd change:

f = seq_file_open_filetype(argv[1], SEQ_FASTA);

to:

f = seq_file_open(argv[1]);

to support format auto-detecting and multi-format support. This was also the source of the bug you saw -- that function should have been removed from the API. I'm not sure I understand Heng's comment about zlib - I'd assumed it would link against whichever version of zlib was available so I assume he means he used zlib-1.2.3.

If this is something people would like to use, optimisations could certainly be made and I'd be happy to have input. If people are looking to simply read FASTA/Q quickly in C, I'd point them towards Heng's kseq library linked above.

Isaac

score 4 · Answer 2 · 2012-12-28

Benchmark on parsing human chromosome fasta (C source code at the end):

Data             Library    RealTime    UserTime    SysTime
-----------------------------------------------------------
humanChr-plain   seq_file   127         88          38
humanChr-gzip    seq_file   246         237         10
humanChr-plain   kseq       11          8           3
humanChr-gzip    kseq       31          29          2
humanChr-plain   wc         21          9           12
humanChr-gzip    zcat|wc    38          39          18

Seq_file seems to be slower than my single-header kseq library. I have not tried on short-read FASTQ. The results may be different. Also, seq_file is linked to zlib-1.2.3. Probably using zlib-1.2.5 wil lead to better performance because in the most widely used zlib-1.2.3, I/O is not buffered.

Source code based on seq_file. Somehow the program is buggy: it always misses the first sequence in the file.

#include <stdio.h>
#include <stdlib.h>
#include "seq_file.h"

int main(int argc, char *argv[])
{
    SeqFile *f;
    StrBuf *buf;
    if (argc == 1) return 1;
    f = seq_file_open_filetype(argv[1], SEQ_FASTA);
    buf = strbuf_new();
    while (seq_next_read(f)) {
        seq_read_all_bases(f, buf);
        printf("%s\t%ld\n", seq_get_read_name(f), strbuf_len(buf));
        strbuf_reset(buf);
    }
    seq_file_close(f);
    strbuf_free(buf);
    return 1;
}

Based on kseq:

#include <zlib.h>
#include <stdio.h>
#include "kseq.h"
KSEQ_INIT(gzFile, gzread)

int main(int argc, char *argv[])
{   
    gzFile fp = gzopen(argv[1], "r");
    kseq_t *ks = kseq_init(fp);
    if (argc == 1) return 1;
    while (kseq_read(ks) >= 0)
        printf("%s\t%ld\n", ks->name.s, ks->seq.l);
    kseq_destroy(ks);
    gzclose(fp);
    return 0;
}