Java code for indexing fasta file
2
2
Entering edit mode
7.4 years ago

Hello- Before I reinvent the flat tyre I ask here...

Does anybody know of or have some Java code to make an index file for a fasta file? For "index fasta file" I mean the file produced by e.g. samtools faidx and containing sequence length, byte positions of each sequence etc.

Ideally I'm looking for something working like this:

public static void Main(String[] args){

    //Writes to disk index just like `samtools faidx sequences.fasta`:
    new MakeFastaIndex("sequences.fasta") 

}

I had a look at picard, biojava but couldn't find anything...

Java fasta index • 2.7k views
ADD COMMENT
0
Entering edit mode

I would assume that there is something in htsjdk for this.

ADD REPLY
0
Entering edit mode

@devon I didn't find it.

$ find . -name "*.java" -exec grep -H -F 'fai"' '{}' ';'
./src/test/java/htsjdk/samtools/reference/FastaSequenceIndexTest.java:        final File sequenceIndexFile = new File(TEST_DATA_DIR,"Homo_sapiens_assembly18.fasta.fai");
./src/test/java/htsjdk/samtools/reference/FastaSequenceIndexTest.java:        final File sequenceIndexFile = new File(TEST_DATA_DIR,"testing.fai");
./src/main/java/htsjdk/samtools/reference/IndexedFastaSequenceFile.java:        return new File(fastaFile.getAbsolutePath() + ".fai");
./src/main/java/htsjdk/samtools/reference/IndexedFastaSequenceFile.java:        return fastaFile.resolveSibling(fastaFile.getFileName() + ".fai");
ADD REPLY
0
Entering edit mode

What about "dict"?

ADD REPLY
1
Entering edit mode

'Dict' is for sequence dictionary. The structure of a dict is different from a faidx (it contains a SAM sequence header dictionary)

ADD REPLY
0
Entering edit mode

Ah, right. I'm a little surprised that this isn't in htsjdk, seems like a normal function for it to provide.

ADD REPLY
2
Entering edit mode
7.4 years ago

Without checking anything (lines length...) for text files only:

import java.io.*;
import java.util.*;

public class Faidx
    {
    public static class SamSequenceRecord
        {
        String name=null;
        int length=0;
        long begin;
        int linelen=-1;
        }
    public final List<SamSequenceRecord> dict = new ArrayList<>();
    public Faidx(final File fn) throws IOException
        {
        long offset=0;
        FileReader rz = new FileReader(fn);
        int c;
        SamSequenceRecord ssr = null;
        while((c=rz.read())!=-1) {
            if (c == '>') {
                offset++;
                StringBuilder sb=new StringBuilder();
                ssr = new SamSequenceRecord();
                boolean ws=false;
                while((c=rz.read())!=-1 )
                    {
                    offset++;
                    if( c=='\n') break;
                    if(!ws && Character.isWhitespace(c)) ws=true;
                    if(!ws) sb.append((char)c);
                    }
                ssr.name=sb.toString();
                ssr.begin=offset;
                dict.add(ssr);
                }
            else
                {
                int n=0;
                do  {
                    offset++;
                    if( c == '\n') break;
                    n++;
                    c = rz.read();
                    } while(c!=-1);
                ssr.length+=n;
                if(ssr.linelen==-1)
                    {
                    ssr.linelen = n;
                    }
                }
            }
        }
    public void print(PrintStream out)
        {
        for(SamSequenceRecord ssr: dict)
            {
            out.printlnssr.name+"\t"+ssr.length+"\t"+ssr.begin+"\t"+ssr.linelen+"\t"+(ssr.linelen+1));
            }
        }
    public static void main(String args[]) throws IOException {
        new Faidx(new File(args[0])).print(System.out);
        }
    }
ADD COMMENT
0
Entering edit mode

Thank you Pierre. I'll try this out. I was also quite surprised to see that htsjdk doesn't have anything...

ADD REPLY
2
Entering edit mode
7.3 years ago

If anyone is interested I put here https://github.com/dariober/ASCIIGenome/tree/master/src/faidx a package to index a fasta file. Some tests are https://github.com/dariober/ASCIIGenome/blob/master/test/faidx.

Example:

new Faidx(new File("seq.fa"));

will create seq.fa.fai

ADD COMMENT
0
Entering edit mode

Great project Dario (ASCIIGenome)!

ADD REPLY

Login before adding your answer.

Traffic: 2616 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6