the task is as follows:
Given a local BAM index file (BAI) plus fasta reference and a remote BAM, produce selected region only samtools-readable BAM using curl with SSL & certificates. Minimize compression/decompression, just get the bgzf-compressed chunks containing region of interest (plus some other reads outside the region), perform the exact regions selection on the local drive.
The curl-less variant of it:
Using BAI + fasta reference, translate region (i.e. chr1:2000-4000) into binary positions of the corresponding BAM, save as text. Dump the BAM header + bgzf-compressed chunks creating samtools-readable micro-BAM, i.e. using dd.
There are multiple libraries/programs implementing necessary functions (samtools[C], htsjdk[java], sambamba/bioD[D], even BioGO[go]). My problem is that because I do understand enough of the BAM/BAI file specifications, I have problem understanding what exactly a given chunk of code does, especially that I am coming from the Python camp.
One possible shortcut is to use htsjdk and access the necessary classes/methods using either jython or groovy. In jython 2.7.0 after putting in Jython registry:
python.security.respectJavaAccessibility = false #in the jython comandline: import sys sys.path.append("/usr/local/soft/picard_1.138/htsjdk-1.138.jar") from htsjdk.samtools import * from java.io import File bai_fh = File("./A.bam.bai") mydict = SAMSequenceDictionary() bai_foo = DiskBasedBAMFileIndex(bai_fh, mydict)
At this stage I am able to use bai_foo.getNumberOfReferences() etc., but the desired method getBinsOverlapping(int referenceIndex, int startPos, int endPos) is in BrowseableBAMIndex interface. And in this maze of classes, methods, interfaces I am getting lost.
So my questions are:
0) not explored but important: given that BAM file starts with compressed, mixed text and bin values header, I hope it is enough to simply dump everything from the BAM start until the start of the first alignment/unaligned sequences block. What is the correct method to get this number?
1) given that htsjdk is accessible from Jython/groovy, what is the path to access the getBinsOverlapping?
2) since I am not restricted to above libs/languages, any hack based on bioD or BioGO, Python or C etc. will be great
Many thanks for your help
Edit: The missing groovy part
groovy -cp libs/htsjdk-1.138.jar test_htsjdk.groovy ###### test_htsjdk.groovy ###### #!/usr/bin/env groovy import htsjdk.samtools.* File bam_fh = new File("./A.bam") File bai_fh = new File("./A.bam.bai") def mydict = new SAMSequenceDictionary() def bai_foo = new DiskBasedBAMFileIndex(bai_fh, mydict) println bai_foo.getNumberOfReferences()