Fasta Parser In C?
4
4
Entering edit mode
13.1 years ago
Newvin ▴ 360

I'm looking for an existing C program which can parse FASTA files for an academic project. The requirements are:

  1. It must be capable of parsing very large DNA sequences (entire genomes)
  2. It must be open source (i.e. I must have the author's permission to use it)
  3. It must be C only (not C++)

I've done some Googling but haven't yet found an ideal candidate. I would appreciate any advice.

Thank you

c open-software fasta • 11k views
ADD COMMENT
0
Entering edit mode

The top answer to the following question is in pure C and works for fasta. MIT license. Which C++ Libraries Are Best For Dealing With Fastq Files?

ADD REPLY
0
Entering edit mode

See also: Which C++ Libraries Are Best For Dealing With Fastq Files?

The library mentioned in top answer is efficient, MIT licensed, written in C and works for both fasta and fastq.

ADD REPLY
8
Entering edit mode
13.1 years ago
brentp 24k

Heng Li has a fasta/fastq parser in that fits in a single C-header file: http://lh3lh3.users.sourceforge.net/parsefastq.shtml

ADD COMMENT
0
Entering edit mode

Thanks. This was exactly what I was looking for.

ADD REPLY
0
Entering edit mode

i was looking at the same problem, but the program requires a .seq file, how can i use it with a .fasta file?

ADD REPLY
0
Entering edit mode

If you look the example, KSEQ_INIT request two variable, the first is a type variable, for your purpose use FILE* (it's defined in stdio.h), the second is a method, the text file are readed by "read" method: KSEQ_INIT(FILE*, read); //STEP 1

I tried with fscanf and fgets, but they are wrong.

after that, you must use a connection to the file, a pointer file: FILE* fp;

the file is open by fopen: fp = fopen(argv[1], "r"); // STEP 2

at the end you need to close the connection.

fclose(fp); // STEP 6

that's all for the moment, I'm finding the function to read.

ADD REPLY
0
Entering edit mode

You can use KSEQ_INIT(int, read) or KSEQ_INIT(gzFile, gzread)

ADD REPLY
0
Entering edit mode

if I use int the prompt prints Program too big to fit in memory

ADD REPLY
0
Entering edit mode

Int is the type of file descriptor.

ADD REPLY
0
Entering edit mode

Modifying your example:

#include <stdio.h>
#include "kseq.h"
KSEQ_INIT(int, read)

int main(int argc, char *argv[])
{
FILE* fp;
kseq_t *seq;
int l;
if (argc == 1) {
    fprintf(stderr, "Usage: %s <in.seq>\n", argv[0]);
    return 1;
}
fp = fopen(argv[1], "r");
seq = kseq_init(fp);
while ((l = kseq_read(seq)) >= 0) {
    printf("name: %s\n", seq->name.s);
    if (seq->comment.l) printf("comment: %s\n", seq->comment.s);
    printf("seq: %s\n", seq->seq.s);
    if (seq->qual.l) printf("qual: %s\n", seq->qual.s);
}
printf("return value: %d\n", l);
kseq_destroy(seq);
fclose(fp);
return 0;
 }

I obtain when I compile

kseq_test_mod.c: In function 'main':
kseq_test_mod.c:16:2: warning: passing argument 1 of 'kseq_init' makes integer from pointer without a cast
kseq_test_mod.c:4:1: note: expected 'int' but argument is of type 'struct FILE *

If I run the executable:

C:\Documents and Settings\Proteomica\My Documents\Download\kseq&gt;prova.exe sequence.fasta
return value: -1

The fasta is dowloaded from ncbi.

ADD REPLY
0
Entering edit mode

You really need to understand the example on the web page first before trying to modify it.

ADD REPLY
0
Entering edit mode

The first thing I did, probably there are some thing that I don't understand

ADD REPLY
5
Entering edit mode
13.1 years ago

How about Bill Pearson's FASTA (written in C)? Given it is the origin of the format, I am inclined to believe it parses FASTA.

http://fasta.bioch.virginia.edu/fasta_www2/fasta_down.shtml

ADD COMMENT
3
Entering edit mode
13.1 years ago
Andreas ★ 2.5k

You can try Sean Eddy's squid library. It's a general purpose sequence analysis library, but deprecated by easel (the library now used by HMMER3). See sreformat_main.c for an example of how to read sequence files of (almost) any format.

Another (general purpose) library to consider might be seqan (C++ though and haven't used it myself).

I think there used to be a C version of readseq as well, but I can only find a newer Java version.

However, all of the above might be overkill if you just want to parse fasta format, which is pretty simple.

Andreas

ADD COMMENT
0
Entering edit mode

The readseq C version is still available as version1: http://iubio.bio.indiana.edu/soft/molbio/readseq/. It is quite old, though.

ADD REPLY
2
Entering edit mode
13.1 years ago

The UCSC source tree has a large number of fasta utilities written in C that can parse and process large multi-fasta files in the src/utils directory.

ADD COMMENT

Login before adding your answer.

Traffic: 2591 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6