Question: read multiple fasta sequence using kseq.h
1
gravatar for madhurima.das
6.3 years ago by
United States
madhurima.das10 wrote:

I am trying to find fasta sequences of 5 ids/name as provided by user from a big fasta file (containing 80000 fasta sequences) using an external header file kseq.h as in:http://lh3lh3.users.sourceforge.net/kseq.shtml. When I run the program in a for loop, I have to open/close the big fasta file again and again (commented in the code) which makes the computation time slow. On the contrary, if I open/close only once outside the loop, the program stops if it encounters an entry which is not present in the big fasta file i.e. it reaches end of the file. Can anyone suggest how to get all the sequences without losing computational time. The code is:

 

    #include <zlib.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <time.h>
    #include "ext_libraries/kseq.h"
    
    KSEQ_INIT(gzFile, gzread)
    
    int main(int argc,char *argv[])
    {
        char gwidd_ids[100];
        kseq_t *seq;
        int i=0,nFields=0,row=0,col=0;
        int size=1000,flag1=0,l=0,index0=0;
    
        printf("Opening file %s\n",argv[1]);
        
        char **gi_ids=(char **)malloc(sizeof(char *)*size);
        for(i=0;i<size;i++)
        {
            gi_ids[i]=(char *)malloc(sizeof(char)*50);
        }
        FILE *fp_inp = fopen(argv[1],"r");
        while(fscanf(fp_inp,"%s",gwidd_ids) == 1)
        {
            printf("%s\n",gwidd_ids);
            strcpy(gi_ids[index0],gwidd_ids);
            index0++;
        }    
        fclose(fp_inp);
    
        FILE *f0 = fopen("xxx.txt", "w");
        FILE *f1 = fopen("yyy.txt", "w");
        FILE *f2 = fopen("zzz", "w");
    
        FILE *instream = NULL;
        instream = fopen("fasta_seq_nr_uniprot.txt", "r"); 
        gzFile fpf = gzdopen(fileno(instream), "r");
    
        for(col=0;col<index0;col++)
        {
            flag1=0;
    //        FILE *instream = NULL;
    //        instream = fopen("fasta_seq_nr_uniprot.txt", "r"); 
    //        gzFile fpf = gzdopen(fileno(instream), "r");
            kseq_t *seq = kseq_init(fpf); 
            while((kseq_read(seq)) >= 0 && flag1 == 0)
            {
                if(strcasecmp(gi_ids[col],seq->name.s) == 0)
                {
                    fprintf(f1,">%s\n",gi_ids[col]);
                    fprintf(f2,">%s\n%s\n",seq->name.s,seq->seq.s);
                    flag1 = 1;
                }
            }
            if(flag1 == 0)
            {
                fprintf(f0,"%s\n",gi_ids[col]);
            }
            kseq_destroy(seq);
    //        gzclose(fpf); 
        }
        gzclose(fpf); 
    
        fclose(f0);
        fclose(f1);
        fclose(f2);
            
        for(i=0;i<size;i++)
        {
            free(gi_ids[i]);
        }
        free(gi_ids);
    
        return 0;
    }    
    
        
C sequence kseq.h • 2.0k views
ADD COMMENTlink modified 6.3 years ago by Alex Reynolds31k • written 6.3 years ago by madhurima.das10
0
gravatar for Devon Ryan
6.3 years ago by
Devon Ryan97k
Freiburg, Germany
Devon Ryan97k wrote:

This is borderline off-topic.

You should just iterate checking gi_ids from within the while((kseq_read(seq))... while loop, setting a found variable and breaking if there's a match.

ADD COMMENTlink written 6.3 years ago by Devon Ryan97k

I think I already have put in a flag condition inside the while loop to break when it reaches flag=1. I hope this is what you meant..

ADD REPLYlink written 6.3 years ago by madhurima.das10

Yeah flag1 will work. Just embed the if-block in a for-loop.

ADD REPLYlink written 6.3 years ago by Devon Ryan97k
0
gravatar for Alex Reynolds
6.3 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

As an alternate approach, consider storing your sequence header IDs of interest into a hash table or associative array. Then stream through the large FASTA file and lookup the file's current sequence header ID against your hash table. If there's a match, print out the header and its sequence.

Lookups of hash tables only take O(1) time and you'll only have to stream through the file once. With five entries in a hash table, and only reading one line into memory at a time for evaluation, the file I/O time and memory usage will be minimal. 

One approach in awk, for example, where you define your five patterns of interest in the variable ht:

$ awk ' \
BEGIN { \
    ht[">foo"] = 1; \
    ht[">bar"] = 1; \
    ht[">baz"] = 1; \
    flag = 0; \
} \
{ \
    if (($0 ~ /^>/) && (ht[$0] == 1)) { \
        print $0; \
        flag = 1; \
    } \
    else if (($0 !~ /^>/) && (flag == 1)) { \
        print $0; \
    } \
    else { \
        flag = 0; \
    } \
}' fasta_seq_nr_uniprot.txt

If you really need to write this in C, you could look into khash.

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Alex Reynolds31k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1373 users visited in the last hour