Question: Tool for random access to indexed BAM files in S3?
2
gravatar for donfreed
2.3 years ago by
donfreed1.1k
Mountain View, CA
donfreed1.1k wrote:

I have access to some indexed BAM files in S3. Using the AWS CLI, it is fairly easy to download the entire BAM file. That was fine for our initial analysis, but for validation, we are interested in looking a small region of ~10kb in thousands of individuals.

The BAM files are indexed and S3 supports GET requests with range headers so this should be possible. Does anyone know of a tool that does this? 

EDIT: htslib 1.3 was recently released and supports random access to BAM files in s3.

cloud bam • 3.0k views
ADD COMMENTlink modified 13 months ago by xing0 • written 2.3 years ago by donfreed1.1k
7
gravatar for John Marshall
2.3 years ago by
John Marshall1.1k
Glasgow, Scotland
John Marshall1.1k wrote:

The upcoming samtools 1.3 release will support this.  As well as the current way of accessing public buckets in donfreed's comment on another answer, samtools 1.3 will understand s3: pseudo-URLs like

    s3://1000genomes/phase1/data/NA12878/exome_alignment/NA12878.blah.bam

For accessing private buckets, samtools will look for your AWS credentials in the usual configuration files and environment variables, or you can specify them on the command line as s3://id:secret@bucket/... though that's not particularly recommended.

This release will be fairly soon.  In the meantime, you can try this out by building samtools with GitHub htslib's libcurl branch.  The code in that branch only looks for credentials in $AWS_ACCESS_KEY_ID / $AWS_SECRET_ACCESS_KEY and as id:secret in the URL.

ADD COMMENTlink written 2.3 years ago by John Marshall1.1k

Thank you. This worked with only slight modification.

ADD REPLYlink written 2.3 years ago by donfreed1.1k

Can you please describe what you have done, it would be very helpful.

ADD REPLYlink written 2.1 years ago by the_smiao0
2

No problem.

I am not sure how the official htslib implementation is coming along, but these steps work for me.

1. Start and ssh into an EC2 instance.

2. Install some software and libriaries:

$ sudo yum install gcc autoconf git zlib-devel libcurl-devel openssl-devel ncurses-devel

3. Clone my fork of htslib's libcurl branch.

$ git clone https://github.com/DonFreed/htslib.git -b libcurl

4. Build and install htslib.

$ cd htslib/
$ autoconf
$ ./configure --enable-libcurl
$ sudo make install

5. Clone and install samtools. 

$ cd ..
$ git clone https://github.com/samtools/samtools.git -b 1.2
$ cd samtools
$ sudo make install LDLIBS+=-lcurl LDLIBS+=-lcrypto

6. Configure AWS.

$ aws configure
AWS Access Key ID [None]: ******
AWS Secret Access Key [None]:  ********
Default region name [None]: us-east-1
Default output format [None]:

7. Set environment.

$ export AWS_ACCESS_KEY_ID=*****
$ export AWS_SECRET_ACCESS_KEY=*****

8. Test.

$ samtools                    # works
$ aws s3 ls s3://1000genomes/ # works
$ samtools view s3://1000genomes/phase1/data/NA12878/exome_alignment/NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam 20:1000-100000 # works
ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by donfreed1.1k
1

Just tried this and it worked!  Thanks!

Only problem I had was:
git clone https://github.com/samtools/samtools.git -b 1.2

For some unknown reason, I had to checkout the 1.2 branch manually inside the directory

ADD REPLYlink written 2.1 years ago by egafni30
1

Just noting here that if you have a bucket name with a "." (period) in it, samtools view will not work.  Hope that saves you the hours it took me to figure it out!

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by egafni30

This is true but %2E is the HTML equivalent for period and substituting %2E for '.' (period) in your URL does seem to work.  

ADD REPLYlink written 21 months ago by mschaff0

@DonFreed, I saw you had support for using session tokens, is this only available in htsfile or is there a way to build samtools to include using AWS_SESSION_TOKEN?  I work on the National Database for Autism Research (NDAR) project and I saw you referenced our use of temporary federated tokens to control access to s3.

I would be keen to show users this functionality baked into samtools; currently this can be done through the use of a proxy (https://github.com/obenshaindw/s3proxy) and writing the s3 urls in an http scheme that makes requests against the proxy.

ADD REPLYlink written 23 months ago by david.obenshain10
1

Hi @david.obenshain, steps 1-5 above will build samtools with support for temporary session tokens. In addition to the steps above, specifying the environmental variable AWS_SESSION_TOKEN is necessary.

Here’s a detailed example of accessing NDAR on AWS EC2.

1. Perform steps 1-5 above.

$ sudo yum install gcc autoconf git zlib-devel libcurl-devel openssl-devel ncurses-devel
$ git clone https://github.com/DonFreed/htslib.git -b libcurl
$ cd htslib
$ autoconf
$ ./configure --enable-libcurl
$ sudo make install
$ cd ..
$ git clone https://github.com/samtools/samtools.git -b 1.2
$ cd samtools
$ sudo make install LDLIBS+=-lcurl LDLIBS+=-lcrypto

2. Get access keys from NDAR. See the NDAR cloud_page to download "downloadmanager.jar". More information can be found from the ndar_tutorials.

$ cd ~/
$ unzip downloadmanager.zip
$ java –jar downloadmanager.jar -u $ndar_user_id –p $ndar_pswd –g aws_keys.txt
$ cat aws_keys.txt
accessKey=*****
secretKey=*****
sessionToken=*****…*****

3. Configure AWS. Add the token to the credential file.

$ aws configure
AWS Access Key ID [None]: ******
AWS Secret Access Key [None]:  ********
Default region name [None]: us-east-1
Default output format [None]:
$ echo "aws_session_token = *****…*****” >> .aws/credentials

4. Set environmental variables.

$ export AWS_ACCESS_KEY_ID=*****
$ export AWS_SECRET_ACCESS_KEY=*****
$ export AWS_SESSION_TOKEN=******

5. Test.

$ samtools   # works
$ aws s3 ls s3://NDAR_Central_4/submission_10215/complete/11000/complete_bams/11000.fa.realigned.recal.bam     # works
$ samtools view s3://NDAR_Central_4/submission_10215/complete/11000/complete_bams/11000.fa.realigned.recal.bam 10:1000000-1010000 # works
ADD REPLYlink modified 23 months ago • written 23 months ago by donfreed1.1k
1

Hi Don,

Thank you, this does work. I was also able to do the following to have bcftools built to work with your libcurl branch.  Completing the steps above first...

$ cd ~
$ git clone https://github.com/samtools/bcftools -b 1.2
$ cd bcftools
$ git checkout -b 1.2
$ sudo make install LDLIBS+=-lcurl LDLIBS+=-lcrypto
$ bcftools view s3://NDAR_LOCATION_FOR_VCF.vcf # Works

I would like to put a few gists on our GitHub account related to this, if you don't mind.  You might also want to checkout https://github.com/NDAR/nda_aws_token_generator to skip step 2.  When support for the security token is added to the official samtools/htslib we can update the gists accordingly.

I was really excited to stumble on this, and had the coincidence of giving a small demo of this functionality to Dr. Pevsner yesterday.

David

 

ADD REPLYlink modified 23 months ago • written 23 months ago by david.obenshain10
1

See also PR #303 which expands on Don's branch to also read credentials (including session tokens) from the usual configuration files (mostly ~/.aws/credentials).  This will be landing in htslib shortly, but it would be good if some of you S3 users could do some testing first.

ADD REPLYlink written 23 months ago by John Marshall1.1k
0
gravatar for h.mon
2.3 years ago by
h.mon9.0k
Brazil
h.mon9.0k wrote:

SAMtools view can download only chunks of a BAM file over the internet, using ftp or http. I am not sure if it can download over encrypted connections.

ADD COMMENTlink written 2.3 years ago by h.mon9.0k
1

Perhaps one could modify the source code to SAMtools to extend support for https and to also submit the required XML payload to S3, to retrieve chunks of data.

ADD REPLYlink written 2.3 years ago by Alex Reynolds21k
1

Samtools view will work for 'public' buckets, but not for private buckets. For example, the command below will work on both my local machine and EC2, but a similar command will not work for the private buckets I am accessing.

samtools view http://s3.amazonaws.com/1000genomes/phase1/data/NA12878/exome_alignment/NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam 1:1000000-1001000
ADD REPLYlink written 2.3 years ago by donfreed1.1k
0
gravatar for tychele
20 months ago by
tychele0
United States
tychele0 wrote:

Unfortunately, this does not seem to work for requester-pay buckets. It works for the other examples listed above.

ADD COMMENTlink written 20 months ago by tychele0
1

Requester Pays buckets need an extra x-amz-request-payer: requester header that at present samtools doesn't set, so this will indeed not work at present.

Clearly it would not be appropriate for htslib/samtools to set it all the time (as it represents explicit acknowledgement from the user that they will be charged). So we could set it if some flag was present in the URL or perhaps via an extra config file key on the profile used. @donfreed or anyone else: are you aware of any existing practice in this area?

ADD REPLYlink modified 19 months ago • written 19 months ago by John Marshall1.1k
1

This is now HTSlib issue #346; hopefully we'll come up with a way to say "yes, charge me!" in time for the 1.4 release.

ADD REPLYlink written 19 months ago by John Marshall1.1k

There is a workaround available in that you can use pre-signed URLs as described in the github issue, however we recognise the limitations of this workaround and will look into alternatives.

ADD REPLYlink written 19 months ago by mp150
0
gravatar for xing
13 months ago by
xing0
xing0 wrote:

I have a private aws bucket with aws_access_key_id, aws_secret_access_key, and region

How set I set up samtools 1.3 to make it work on bam files on aws s3 storage?

follow the instruction above I can make the following command work $ samtools view s3://1000genomes/phase1/data/NA12878/exome_alignment/NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam 20:1000-100000 # works

but I can not make samtools work on my private bucket.

Any suggestion?

ADD COMMENTlink modified 12 months ago • written 13 months ago by xing0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1258 users visited in the last hour