Question: (Closed) How To Decompress 1000Genomes Bgzip-Compressed Files Using Java
1
gravatar for Pierre Lindenbaum
8.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum116k wrote:

(cross posted on stack-overflow )

Ok. I'm puzzled.

I'm want to use java to download ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz in order to annotate some VCFs on the fly. (I don't want to download this file on my desktop, I really want to stream its bytes )

But the program stops after reading a few lines.

Here is the minimal program for this problem:

import java.net.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
public class Test
    {
    public static void main(String args[]) throws Exception
        {
        int count=0;
        URL url=new URL("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz");
        String line;
        BufferedReader in= new BufferedReader(new InputStreamReader(new GZIPInputStream(url.openStream())));
        while((line=in.readLine())!=null)
            {
            ++count;
            System.err.println("["+count+"] "+line);
            }
        in.close();
        System.out.println("Done. nLines="+count);
        }
    }

Compile and run:

javac Test.java
java -Dftp.proxyHost=${MYPROXYHOST} -Dftp.proxyPort=${MYPROXYPORT} Test

And the output stops prematurely after the 1012th line (from my home and from my work place):

(...)
[999] 1    750138    rs61770171    G    A    .    PASS    DP=2189;AF=0.083;CB=UM,BI;EUR_R2=0.129;AFR_R2=0.164
[1000] 1    750153    .    T    C    .    PASS    DP=2555;AF=0.016;CB=UM,BI,BC;EUR_R2=0.167;AFR_R2=0.281
[1001] 1    750190    .    C    T    .    PASS    DP=3515;AF=0.003;CB=UM,BI;EUR_R2=0.581;AFR_R2=0.575
[1002] 1    750235    .    G    A    .    PASS    DP=3914;AF=0.019;CB=UM,BI,BC;EUR_R2=0.719;AFR_R2=0.733
[1003] 1    750436    .    C    T    .    PASS    DP=598;AF=0.020;CB=BI,BC;EUR_R2=0.144;AFR_R2=0.355
[1004] 1    750511    .    G    A    .    PASS    DP=806;AF=0.010;CB=BI,BC;AFR_R2=0.352
[1005] 1    750718    .    G    A    .    PASS    DP=2751;AF=0.003;CB=UM,BI,BC;EUR_R2=0.54;AFR_R2=0.545
[1006] 1    750897    .    G    A    .    PASS    DP=744;AF=0.010;CB=BI,BC;AFR_R2=0.479
[1007] 1    750946    .    A    G    .    PASS    DP=873;AF=0.010;CB=BI,BC;AFR_R2=0.414
[1008] 1    751043    .    G    A    .    PASS    DP=1522;AF=0.000;CB=BI,BC;EUR_R2=0.273
[1009] 1    751281    .    T    C    .    PASS    DP=403;AF=0.010;CB=BI,BC;AFR_R2=0.178
[1010] 1    751343    .    T    A    .    PASS    DP=1912;AF=0.117;CB=UM,BI;EUR_R2=0.683;AFR_R2=0.582
[1011] 1    751456    .    T    C    .    PASS    DP=1775;AF=0.008;CB=UM,BI;EUR_R2=0.515;AFR_R2=0.332
[1012] 1    
Done. nLines=1012

I was not the only one to have this problem: http://twitter.com/#!/neilswainston/status/43301088757157888

Re: 1000 genome. Don't think it's your problem. Try downloading it and uncompressing it manually - same result... (66kb file).

Using internet explorer and winrar , it was said that the file was corrupted.

Using firefox for downloading the file, the browser said:

"Content Encoding Error :The page you are trying to view cannot be shown because it uses an invalid or unsupported form of compression. Please contact the website owners to inform them of this problem."

Using curl: it worked !!!

>curl "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz" -o ALL.2of4intersection.20100804.sites.vcf.gz
  % Total    % Received % Xferd  Average Speed  Time    Time    Time  Current
                                Dload  Upload  Total  Spent    Left  Speed
100  388M  100  388M    0    0  414k      0  0:16:00  0:16:00 --:--:--  566k
> md5sum ALL.2of4intersection.20100804.sites.vcf.gz
da386f5e2e0fa7e92c64e79691d0a8b8  ALL.2of4intersection.20100804.sites.vcf.gz ##CORRECT
> gunzip -t ALL.2of4intersection.20100804.sites.vcf.gz
> ls -la ALL.2of4intersection.20100804.sites.vcf
-rw-r--r-- 1 lindenb lindenb 1947373891 2011-03-03 17:39
ALL.2of4intersection.20100804.sites.vcf

Why ? what's happening ? how can it be fixed ? is there a problem with a bgzip compression ?

Thanks for your help.

Pierre

UPDATE:

I solved my problem by using net.sf.samtools.util.BlockCompressedInputStream instead of GZipInputStream. The following code works:

import java.net.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
import net.sf.samtools.util.BlockCompressedInputStream;
public class Test
    {
    public static void main(String args[]) throws Exception
        {
        URL url=new URL("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz");
        String line;
        int nRead=0;
        BufferedReader in= new BufferedReader(new InputStreamReader(new BlockCompressedInputStream(url.openStream())));
        while((line=in.readLine())!=null)
            {
            System.out.println(line);
            }
        in.close();
        System.out.println("Done.");
        }
    }
genome java • 6.9k views
ADD COMMENTlink modified 4.2 years ago by Biostar ♦♦ 20 • written 8.0 years ago by Pierre Lindenbaum116k

Strange ... when I download the file manually and run a variant of your code that reads the file directly rather than going through the URL, the problem also exists - it stops on line 1012.

ADD REPLYlink written 8.0 years ago by Bio_X2Y3.6k

Strange ... when I download the file manually and run a variant of your code that reads the file directly rather than going through the URL, the problem also exists - it stops on line 1012. So the problem doesn't seem to be internet related.

ADD REPLYlink written 8.0 years ago by Bio_X2Y3.6k

on http://stackoverflow.com/questions/5180001 a user found that the size of the downloaded part was 65536 bytes long...

ADD REPLYlink written 8.0 years ago by Pierre Lindenbaum116k

Ok, when I go work directly with GZIPInputStream (abandoning the reader altogether), it gets the 65536 bytes you mentioned, and then stops. So we can now rule out (1) internet issues, and (2) encoding at the level of the reader.

ADD REPLYlink written 8.0 years ago by Bio_X2Y3.6k

I'd also change the title to "How to decompress 1000genomes bgzip-compressed files using Java", but I lack the permissions to do that.

ADD REPLYlink written 7.8 years ago by Chronos550

@chronos I've changed the title

ADD REPLYlink written 7.8 years ago by Pierre Lindenbaum116k

closed as answered. BTW, the Java lib for gzip has been fixed.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Pierre Lindenbaum116k
7
gravatar for iw9oel_ad
8.0 years ago by
iw9oel_ad6.0k
iw9oel_ad6.0k wrote:

The SAM spec. carries this warning:

"There is a known bug in the Java GZIPInputStream class that concatenated gzip archives cannot be successfully decompressed by this class. BGZF files can be created and manipulated using the built-in Java util.zip package, but naive use of GZIPInputStream on a BGZF file will not work due to this bug."

I'm not sure whether this is the bug, but it looks like it might be. Additionally, bug 4763158 states "The current GZIPInputStream implementation assumes that the compressed input stream provided consists of a single header followed by the entire compressed body of text/data followed by a standard gzip trailer." It also says that this is fixed in Java 7 (a mere 8 years after it was reported!)

ADD COMMENTlink modified 8.0 years ago • written 8.0 years ago by iw9oel_ad6.0k
2
gravatar for lh3
8.0 years ago by
lh331k
United States
lh331k wrote:

Most *.vcf.gz files at 1000g ftp are bgzf compressed. Then Keith James' answer follows.

Using BGZF allows you to do something like this:

tabix ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz 1:2,000,000-2,100,000

Java implementation is also available, which requires picard. IGV is using that.

ADD COMMENTlink written 8.0 years ago by lh331k
2

Your comment is misleading. You do not get it through because snp130 has 49 chromosomes, but knowngene has 46 chromosomes - the input is wrong. I agree the error message should be improved, but you cannot put all the blames to the library when you give it wrong inputs. IGV uses tabix exclusively for all VCFs and optionally for BED/GFF/etc. The java implementation can be improved, but it is more mature than what you have described. At the very least, having "one file" is not any sign of immaturity.

ADD REPLYlink written 8.0 years ago by lh331k
1

Thanks, I played with this java implementation. But the it's still very immature: http://plindenbaum.blogspot.com/2011/02/tabix-fast-retrieval-of-sequence.html

ADD REPLYlink written 8.0 years ago by Pierre Lindenbaum116k

You do not get it through because snp130 has 49 chromosomes, but knowngene has 46 chromosomes - some chromosomes cannot be found in knowngene. I agree the error message should be improved, but I do not see if "one file" is a sign of immaturity. IGV uses tabix exclusively for all VCFs and optionally for BED/GFF/etc. The java implementation can be improved, but it is working correctly and is not so immature as you said.

ADD REPLYlink written 8.0 years ago by lh331k

Your comment is misleading. You do not get it through because snp130 has 49 chromosomes, but knowngene has 46 chromosomes - the input is wrong. I agree the error message should be improved, but I do not see "one file" is any sign of immaturity. IGV uses tabix exclusively for all VCFs and optionally for BED/GFF/etc. The java implementation can be improved, but it is more mature than what you have described.

ADD REPLYlink written 8.0 years ago by lh331k

If there were a public TabixException class specific to the type of errors tabix is throwing, there would need to be another file for it, like any other class. Such an exception would be a good idea for this occasion because Pierre's data were simply mismatched, something that is not well described by the core Java exceptions. In any event, ArrayIndexOufOfBoundsExceptions should never get propaganted up into user code.

ADD REPLYlink written 8.0 years ago by iw9oel_ad6.0k
Please log in to add an answer.
The thread is closed. No new answers may be added.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 652 users visited in the last hour