Split A Bam File Into Smaller Files By Tile Number
2
0
Entering edit mode
11.0 years ago
gaelgarcia05 ▴ 280

Hi all,

I would like to split a very big BAM file into smaller files for the purpose of annotating it in parallell. Someone suggested splitting it by tile number, which is a good idea since that guarantees that all the alignments for a given read are contained within the same file.

However, I am stuck as to how to phrase the awk command for this purpose, since the tile number is contained within the READ ID string in the first filed of the alignment, separated from the other information in the string by ":" , while this field is separated from the other fields by "\t" .

HWI-ST975:104:C0W47ACXX:8:1101:8269:91631

Tile number (encrypted) = 1101 (5th field) How could I use awk to get each line put into its new corresponding file based on its tile number?

Thanks, Carmen

samtools tophat • 3.3k views
ADD COMMENT
0
Entering edit mode

I think i may have a perl solution to this, but I don't know the exact way to phrase the output. Can anybody help me out ? :)

I have made a hash of hashes, where all the lines of a file are sorted into a key of the "master" hash depending on the value of their 5th field.

%Tiles has n keys, where each key is a different $Tile_Number.

Each $Tile_Number opens a new hash that contains all lines whose $Tile_Number was the right number of the current key. The value of each of these new keys (lines) is just 1.

$Tiles{Tile_Number}($Line}=1 , where $Tiles{Tile_Number} has many $Line=1 entries.

I want to print each $Tiles{$Tile_Number} hash in a separate file, preferably, creating the file upon the creation of the $Tile_Number key, and printing as each new $Tiles{$Tile_Number}{$Line}=1 is added, to save memory. The best would be to not print the final value (1), but I can do away with this, I guess..

How can I tell perl to open a new file for each key in the "master" hash and print all of its keys?

Thank you, Carmen

ADD REPLY
1
Entering edit mode
11.0 years ago

I just wrote a java program to split a BAM by tile:

https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/splitbytitle/SplitByTile.java

it uses the picard library to parse the BAM.

Compilation:

cd src/main/java
javac -cp path/to/picard.jar:path/to.sam.jar com/github/lindenb/jvarkit/tools/splitbytitle/SplitByTile.java

Execute

java  -cp path/to/picard.jar:path/to.sam.jar \
com.github.lindenb.jvarkit.tools.splitbytitle.SplitByTile \
I=my.bam O=tmp/TILE__TILE__/jeter.__TILE__.bam CREATE_INDEX=true
ADD COMMENT
0
Entering edit mode

WOW, cool! Let me check it out, Pierre!

ADD REPLY

Login before adding your answer.

Traffic: 1807 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6