Fasta Parser In C?
4
4
Entering edit mode
10.7 years ago
Newvin ▴ 350

I'm looking for an existing C program which can parse FASTA files for an academic project. The requirements are:

1. It must be capable of parsing very large DNA sequences (entire genomes)
2. It must be open source (i.e. I must have the author's permission to use it)
3. It must be C only (not C++)

I've done some Googling but haven't yet found an ideal candidate. I would appreciate any advice.

Thank you.

fasta c open software • 9.2k views
0
Entering edit mode

The top answer to the following question is in pure C and works for fasta. MIT license. Which C++ Libraries Are Best For Dealing With Fastq Files?

0
Entering edit mode

The library mentioned in top answer is efficient, MIT licensed, written in C and works for both fasta and fastq.

8
Entering edit mode
10.7 years ago
brentp 23k

Heng Li has a fasta/fastq parser in that fits in a single C-header file: http://lh3lh3.users.sourceforge.net/parsefastq.shtml

0
Entering edit mode

Thanks. This was exactly what I was looking for.

0
Entering edit mode

i was looking at the same problem, but the program requires a .seq file, how can i use it with a .fasta file?

0
Entering edit mode

If you look the example, KSEQ_INIT request two variable, the first is a type variable, for your purpose use FILE* (it's defined in stdio.h), the second is a method, the text file are readed by "read" method: KSEQ_INIT(FILE*, read); //STEP 1

I tried with fscanf and fgets, but they are wrong.

after that, you must use a connection to the file, a pointer file: FILE* fp;

the file is open by fopen: fp = fopen(argv[1], "r"); // STEP 2

at the end you need to close the connection.

fclose(fp); // STEP 6

that's all for the moment, I'm finding the function to read.

0
Entering edit mode

You can use KSEQ_INIT(int, read) or KSEQ_INIT(gzFile, gzread)

0
Entering edit mode

if I use int the prompt prints Program too big to fit in memory

0
Entering edit mode

Int is the type of file descriptor.

0
Entering edit mode

#include <stdio.h>
#include "kseq.h"

int main(int argc, char *argv[])
{
FILE* fp;
kseq_t *seq;
int l;
if (argc == 1) {
fprintf(stderr, "Usage: %s <in.seq>\n", argv[0]);
return 1;
}
fp = fopen(argv[1], "r");
seq = kseq_init(fp);
while ((l = kseq_read(seq)) >= 0) {
printf("name: %s\n", seq->name.s);
if (seq->comment.l) printf("comment: %s\n", seq->comment.s);
printf("seq: %s\n", seq->seq.s);
if (seq->qual.l) printf("qual: %s\n", seq->qual.s);
}
printf("return value: %d\n", l);
kseq_destroy(seq);
fclose(fp);
return 0;
}


I obtain when I compile

kseq_test_mod.c: In function 'main':
kseq_test_mod.c:16:2: warning: passing argument 1 of 'kseq_init' makes integer from pointer without a cast
kseq_test_mod.c:4:1: note: expected 'int' but argument is of type 'struct FILE *


If I run the executable:

C:\Documents and Settings\Proteomica\My Documents\Download\kseq&gt;prova.exe sequence.fasta
return value: -1


The fasta is dowloaded from ncbi.

0
Entering edit mode

You really need to understand the example on the web page first before trying to modify it.

0
Entering edit mode

The first thing I did, probably there are some thing that I don't understand

5
Entering edit mode
10.7 years ago

How about Bill Pearson's FASTA (written in C)? Given it is the origin of the format, I am inclined to believe it parses FASTA.

http://fasta.bioch.virginia.edu/fasta_www2/fasta_down.shtml

3
Entering edit mode
10.7 years ago
Andreas ★ 2.5k

You can try Sean Eddy's squid library. It's a general purpose sequence analysis library, but deprecated by easel (the library now used by HMMER3). See sreformat_main.c for an example of how to read sequence files of (almost) any format.

Another (general purpose) library to consider might be seqan (C++ though and haven't used it myself).

I think there used to be a C version of readseq as well, but I can only find a newer Java version.

However, all of the above might be overkill if you just want to parse fasta format, which is pretty simple.

Andreas

0
Entering edit mode

The readseq C version is still available as version1: http://iubio.bio.indiana.edu/soft/molbio/readseq/. It is quite old, though.

2
Entering edit mode
10.7 years ago

The UCSC source tree has a large number of fasta utilities written in C that can parse and process large multi-fasta files in the src/utils directory.