Fetch a single chromosome from a WGS bam located on an object store
1
0
Entering edit mode
8.9 years ago

I'm am performing GATK analysis on of hundreds of WGS in indexed BAM format. To speed up the processes and reduce wall clock time, I process the genomes per chromosome. To extract a chromosome I use samtools and this works fine on a NFS share. However , I ran this analysis on a file store that is not posix compatible. I have to download the whole file before chunking it in chromosomes, which is time consuming since the chromosomes are written back to the file store and download it again to process it.

I can download a file partially using a offset en size to download. Is is possible to get from the index the location of the chromosome, download only this chromosome, add some magic sauce to create a valid bam file and process this chromosome in one go instead op upload and downloading chunks?

sequencing • 2.3k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Mounting gridftp is an idea for a cloud setup, however I am using the grid and I do not have super user rights to mount a device. The last visible work on grifi is almost 10 years ago and not sure if it works with the current globus software stack on which it depends.

ADD REPLY
0
Entering edit mode
8.9 years ago

I'm am performing GATK analysis on of hundreds of WGS in indexed BAM format. To speed up the processes and reduce wallclock time , I process the genomes per chromosome

You don't need this, with GATK, using the -L parameter should be enough

https://www.broadinstitute.org/gatk/guide/tagged?tag=intervals

The -L argument (short for --intervals) enables you to restrict your analysis to specific intervals instead of running over the whole genome..

ADD COMMENT
0
Entering edit mode

The -L works perfect on a posix compatible service, but with the gridftp implementation of dcache which I am using is not. I have a limit on my wallclock time and performing all calculation on one place is not a option.

ADD REPLY
2
Entering edit mode

Ask samtools dev to write a gridftp backend for htslib as they have done for https/ssh via curl and iROD. I see gridftp C api support partial downloading from an offset. Then it is technically possible to make samtools/bcftools work with gridftp. Once samtools supports gridftp, I guess you can pipe BAM stream to GATK for calling? On the other hand, gridftp API uses function callbacks to retrieve data. It might not be easy to implement the streaming interface required by htslib.

ADD REPLY
0
Entering edit mode

This is not a easy to implementable solution but would solve the problem. I thought about a more quick and dirty solution.

ADD REPLY

Login before adding your answer.

Traffic: 2006 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6