Corrupted tabix index?
1
0
Entering edit mode
8.3 years ago

Hello,

I reported this problem on the samtools/htsjdk list but I got no reply yet so I'll cross post here.

In a Java program I need to create a tabix index for a bgzip compressed file. I'm using the method writeBasedOnFeatureFile. No exception is thrown but the resulting index appear to be corrupted as only the first records of the first chrom can be queried.

Could anyone reproduce the issue or am I missing something?

The high level question is: How can I create a tabix index for a bgzip file in Java? I don't need to stick to htsjdk.

Example:

Here's a bed test file, it has 1M rows on 10 chroms. I bgzipped with tabix 1.2.1: https://dl.dropboxusercontent.com/u/53487723/tmp.bedGraph.gz

I created an index for it with the following program (java 8, htsjdk-1.141):

import java.io.File;
import java.io.IOException;
import htsjdk.tribble.bed.BEDCodec;
import htsjdk.tribble.index.IndexFactory;
import htsjdk.tribble.index.tabix.TabixFormat;
import htsjdk.tribble.index.tabix.TabixIndex;

public class TestTabix {

    public static void main(String[] args) throws IOException {

        String bgzfOut= "tmp.bedGraph.gz";
        TabixIndex tabixIndexGz =
                IndexFactory.createTabixIndex(new File(bgzfOut), new BEDCodec(), TabixFormat.BED,
                        null);
        tabixIndexGz.writeBasedOnFeatureFile(new File(bgzfOut));
    }
}

Querying the resulting index returns only the first records of chr1:

tabix tmp.bedGraph.gz chr1 | wc -l  # Should be 100000
    4500
tabix tmp.bedGraph.gz chr2 | wc -l  # Should be 100000
       0

EDIT: The test file looks sane. I can index it with standalone tabix and query it:

tabix -p bed tmp.bedGraph.gz
tabix tmp.bedGraph.gz chr2 | wc -l   #` As expected: 100000`
tabix tmp.bedGraph.gz chr2 | head
chr2    100000    100001
chr2    100001    100002
chr2    100002    100003
chr2    100003    100004
chr2    100004    100005
chr2    100005    100006
tabix htsjdk java • 2.4k views
ADD COMMENT
0
Entering edit mode

are you sure your VCF was sorted ?

ADD REPLY
0
Entering edit mode

Hi Pierre, thanks for replying. Input file is bed not vcf, but anyway it looks ok, see edit. I'll have a look at your code, but still... Either there is a big bug in htsjdk.tribble createTabixIndex or I'm doing something silly...!

PS Not sure the link to the test file was working, before, I've updated it.

ADD REPLY
2
Entering edit mode
8.3 years ago

I wrote one https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/misc/VcfIndexTabix.java

It's an old source, may be a recent version of htsjdk has some new methods to do this

ADD COMMENT
0
Entering edit mode

Eventually I got around this by adapting the code from https://github.com/lindenb/jvarkit/wiki/BedIndexTabix. Thanks a ton Pierre! Here's a method to block compress and index a bed file:

/**
 * Block compress input bed file and create associated tabix index. Newly created file and index are
 * deleted on exit if deleteOnExit true.
 * @throws IOException 
 * */
private void blockCompressAndIndex(String in, String bgzfOut, boolean deleteOnExit) throws IOException {

    System.err.print("Compressing: " + in + " to file: " + bgzfOut + "... ");

    File inFile= new File(in);
    File outFile= new File(bgzfOut);

    LineIterator lin= IOUtils.openURIForLineIterator(inFile.getAbsolutePath());

    BlockCompressedOutputStream writer = new BlockCompressedOutputStream(outFile);
    long filePosition= writer.getFilePointer();

    TabixIndexCreator indexCreator=new TabixIndexCreator(TabixFormat.BED);
    BedLineCodec bedCodec= new BedLineCodec();
    while(lin.hasNext()){
        String line = lin.next();
        BedLine bed = bedCodec.decode(line);
        if(bed==null) continue;
        writer.write(line.getBytes());
        writer.write('\n');
        indexCreator.addFeature(bed, filePosition);
        filePosition = writer.getFilePointer();
    }
    writer.flush();

    System.err.print("Indexing... ");

    File tbi= new File(bgzfOut + TabixUtils.STANDARD_INDEX_EXTENSION);
    if(tbi.exists() && tbi.isFile()){
        System.err.println("Index file exists: " + tbi);
        System.exit(1);
    }
    Index index = indexCreator.finalizeIndex(writer.getFilePointer());
    index.writeBasedOnFeatureFile(outFile);
    writer.close();

    System.err.println("Done");

    if(deleteOnExit){
        outFile.deleteOnExit();
        File idx= new File(outFile.getAbsolutePath() + TabixUtils.STANDARD_INDEX_EXTENSION);
        idx.deleteOnExit();
    }
}
ADD REPLY

Login before adding your answer.

Traffic: 1826 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6