Question: 1000G Query / Using Tabix With A Proxy
0
gravatar for secretjess
5.8 years ago by
secretjess170
Cambridge
secretjess170 wrote:

When I run the following command the connection times out:

./tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 1:57000000-57001000 > test.vcf
connect: Connection timed out
[main] fail to open the data file

But if I run this it works (it might take an estimated 4 hours to download but it does connect!):

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz

So how can I query the 1000 genomes data from behind a proxy? I'm assuming that's the problem.

(P.S. What I want to know is if there's any recorded SNPs, SVs, etc in a specified region)

tabix vcftools • 2.7k views
ADD COMMENTlink modified 5.2 years ago by Biostar ♦♦ 20 • written 5.8 years ago by secretjess170

As of now, tabix does not support ftp proxy.

ADD REPLYlink written 5.2 years ago by lh331k
2
gravatar for Ying W
5.8 years ago by
Ying W3.9k
South San Francisco, CA
Ying W3.9k wrote:

First off, I really don't think this is the right way of doing things, you should run tabix after downloading the complete file especially since the download might break halfway through and then you will have to rerun everything. That said, achieve what you initially set out to do, you can try

wget -O - ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz | ./tabix -h - 1:57000000-57001000 > test.vcf
ADD COMMENTlink written 5.8 years ago by Ying W3.9k
2

Downloading the entire VCF file is not necessary, in most cases. If you tabix an FTP location directly, only the index file will be downloaded and tabix will access the relevant part of the VCF file directly on the FTP server. Of course, if you are regularly querying the VCF file, then I'd recommend downloading it to local stoage.

In the case here, the query that the OP posted took less than a second to run on my machine (a little MacBook Air over a wireless internet connection). Downloading the entire 62GB VCF file takes considerably longer.

ADD REPLYlink written 5.8 years ago by BruceB330

most important, if the download breaks half way, no error message will be shown

ADD REPLYlink written 5.8 years ago by Giovanni M Dall'Olio26k

Thanks both! That makes sense. I should probably update the release I'm looking at too. I intended this question to resolve my issues with using tabix behind a proxy, but I hadn't considered that the download might break.

ADD REPLYlink written 5.8 years ago by secretjess170
1
gravatar for Giovanni M Dall'Olio
5.8 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

Tabix is very useful to download files from 1000 Genomes, because thanks to the indexing method it allows to retrieve only portions of a file.

To use it behind a proxy, make sure that your HTTP_PROXY variables are correctly set. For example, you can add these lines to your .bashrc file:

export PROXY=http://your.proxy.edu
export PROXYPORT=8080
export http_proxy=$PROXY:$PROXYPORT
export HTTP_PROXY=$PROXY:$PROXYPORT
export https_proxy=$PROXY:$PROXYPORT
export HTTPS_PROXY=$PROXY:$PROXYPORT

Then, do a source ~/.bashrc, and tabix should work correctly. If it doesn't work, try with the latest version of tabix, I think that some previous versions did not work correctly behind a proxy.

You should be aware that in case of connection errors, tabix doesn't return any error or warning message. Thus, nothing will alert you if the file has not been downloaded correctly; you will have to check it by yourself.

If you want to download files from 1000 Genomes, a valid alternative is the Aspera client. This allows you to download the whole 1000 Genomes dataset in less than one hour. To use the Aspera client, you can follow these instructions, and possibly using the EBI server instead of the NCBI given in the example, if you are from Europe.

ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by Giovanni M Dall'Olio26k

Downloaded the newest (0.2.6) tabix and installed it by following http://genometoolbox.blogspot.co.uk/2013/11/installing-tabix-on-unix.html. The folder is called tabix-0.2.6 but when I run tabix it claims it's "Version: 0.2.5". Bit strange, but either way - thanks for the help but I still can't get tabix to work with my proxy. I've done what I wanted on the browser but it'd still be useful for the future if I could get this working.

ADD REPLYlink written 5.8 years ago by secretjess170

Check that version 0.2.5 is not in your PATH variable: echo "$PATH"

ADD REPLYlink written 5.8 years ago by BruceB330

or try: which tabix

ADD REPLYlink modified 5.8 years ago • written 5.8 years ago by Giovanni M Dall'Olio26k

Using either of those commands shows 0.2.6 but if I just tabix to start up the program (it displays help) then it says Version: 0.2.5 (r1005)

ADD REPLYlink written 5.7 years ago by secretjess170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1214 users visited in the last hour