Question: Corrupted tabix index?
0
gravatar for dariober
3.3 years ago by
dariober9.9k
WCIP | Glasgow | UK
dariober9.9k wrote:

Hello,

I reported this problem on the samtools/htsjdk list but I got no reply yet so I'll cross post here.

In a Java program I need to create a tabix index for a bgzip compressed file. I'm using the method writeBasedOnFeatureFile. No exception is thrown but the resulting index appear to be corrupted as only the first records of the first chrom can be queried.

Could anyone reproduce the issue or am I missing something?

The high level question is: How can I create a tabix index for a bgzip file in Java? I don't need to stick to htsjdk.

Example:

Here's a bed test file, it has 1M rows on 10 chroms. I bgzipped with tabix 1.2.1:

https://dl.dropboxusercontent.com/u/53487723/tmp.bedGraph.gz

I created an index for it with the following program (java 8, htsjdk-1.141):

import java.io.File;
import java.io.IOException;
import htsjdk.tribble.bed.BEDCodec;
import htsjdk.tribble.index.IndexFactory;
import htsjdk.tribble.index.tabix.TabixFormat;
import htsjdk.tribble.index.tabix.TabixIndex;

public class TestTabix {

    public static void main(String[] args) throws IOException {

        String bgzfOut= "tmp.bedGraph.gz";
        TabixIndex tabixIndexGz =
                IndexFactory.createTabixIndex(new File(bgzfOut), new BEDCodec(), TabixFormat.BED,
                        null);
        tabixIndexGz.writeBasedOnFeatureFile(new File(bgzfOut));
    }

}

Querying the resulting index returns only the first records of chr1:

tabix tmp.bedGraph.gz chr1 | wc -l  # Should be 100000
    4500
tabix tmp.bedGraph.gz chr2 | wc -l  # Should be 100000
       0

 

*EDIT*: The test file looks sane. I can index it with standalone tabix and query it:

tabix -p bed tmp.bedGraph.gz
tabix tmp.bedGraph.gz chr2 | wc -l   # As expected: 100000
tabix tmp.bedGraph.gz chr2 | head
chr2    100000    100001
chr2    100001    100002
chr2    100002    100003
chr2    100003    100004
chr2    100004    100005
chr2    100005    100006
tabix java htsjdk • 1.1k views
ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by dariober9.9k

are you sure your VCF was sorted ?

ADD REPLYlink written 3.3 years ago by Pierre Lindenbaum118k

Hi Pierre, thanks for replying. Input file is bed not vcf, but anyway it looks ok, see edit. I'll have a look at your code, but still... Either there is a big bug in htsjdk.tribble createTabixIndex or I'm doing something silly...!

PS Not sure the link to the test file was working, before, I've updated it.

ADD REPLYlink written 3.3 years ago by dariober9.9k
2
gravatar for Pierre Lindenbaum
3.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

I wrote one https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/misc/VcfIndexTabix.java

It's an old source, may be a recent version of htsjdk has some new methods to do this

ADD COMMENTlink written 3.3 years ago by Pierre Lindenbaum118k

Eventually I got around this by adapting the code from https://github.com/lindenb/jvarkit/wiki/BedIndexTabix. Thanks a ton Pierre! Here's a method to block compress and index a bed file:

/**
 * Block compress input bed file and create associated tabix index. Newly created file and index are
 * deleted on exit if deleteOnExit true.
 * @throws IOException 
 * */
private void blockCompressAndIndex(String in, String bgzfOut, boolean deleteOnExit) throws IOException {

    System.err.print("Compressing: " + in + " to file: " + bgzfOut + "... ");

    File inFile= new File(in);
    File outFile= new File(bgzfOut);

    LineIterator lin= IOUtils.openURIForLineIterator(inFile.getAbsolutePath());

    BlockCompressedOutputStream writer = new BlockCompressedOutputStream(outFile);
    long filePosition= writer.getFilePointer();

    TabixIndexCreator indexCreator=new TabixIndexCreator(TabixFormat.BED);
    BedLineCodec bedCodec= new BedLineCodec();
    while(lin.hasNext()){
        String line = lin.next();
        BedLine bed = bedCodec.decode(line);
        if(bed==null) continue;
        writer.write(line.getBytes());
        writer.write('\n');
        indexCreator.addFeature(bed, filePosition);
        filePosition = writer.getFilePointer();
    }
    writer.flush();

    System.err.print("Indexing... ");

    File tbi= new File(bgzfOut + TabixUtils.STANDARD_INDEX_EXTENSION);
    if(tbi.exists() && tbi.isFile()){
        System.err.println("Index file exists: " + tbi);
        System.exit(1);
    }
    Index index = indexCreator.finalizeIndex(writer.getFilePointer());
    index.writeBasedOnFeatureFile(outFile);
    writer.close();

    System.err.println("Done");

    if(deleteOnExit){
        outFile.deleteOnExit();
        File idx= new File(outFile.getAbsolutePath() + TabixUtils.STANDARD_INDEX_EXTENSION);
        idx.deleteOnExit();
    }
}
ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by dariober9.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 958 users visited in the last hour