How can I filter a bam file to extract only reads that aligned on a specific chromosome and are part of a fusion. I used tophat 2.0.3 with bowtie1 to align the reads. FYI : Illumina paired-end 76bp reads. In tophat manual :
accepted_hits.bam. A list of read alignments in SAM format. SAM is a compact short read alignment format that is increasingly being adopted. The formal specification is here.
We assume that a fusion alignment involves two chromosomes or different places on the same chromosome (distant or inversion). So, a fusion alignment is reported in two partial alignments (one before the fusion point and the other after) as follows:
(1) SRR064286.11667197 chr12 2910488255 16M chr20 625050470 ATATGTTTGAGAGGCT BCBCCBCCCBBBBBBB XF:Z:1 chr12-chr20 2910488 16M62500758F34M (2) SRR064286.11667197 chr20 62500758255 34M = 625050470 GGCTGAGGAGGAGGAGCTCAGGGCTGAGCTTACC BB?B@;@B?AA?A?;?:@>>@@B>>;;B>?=9## XF:Z:2 chr12-chr20 2910488 16M62500758F34M
Some of fields are skipped for the sake of illustration, as shown above, 16bp of 50bp read is mapped on chr12 and 34bp is mapped on chr20. The whole fusion alignment is available in a custom field XF in both partial alignments above. XF:Z:1 and XF:Z:2 indicates the first and the second partial alignments, respectively.
A new CIGAR operator 'F' (Fusion) is introduced to indicate the presence of a fusion in read alignments. The following example shows a read alignment across a fusion between chromosome 20 and chromosome 12.
XF:Z:1 chr12-chr20 2910488 16M62500758F34M
The left most base of a read (SRR064286.11667197) is mapped to 2910488th base of chromosome 12 from which the first 16 bases of a read are mapped from 2910488 to 2910503 on chromosome 12, the 17th base of the read is mapped to 62500758th base of chromosome 20, and the remaining 34 bases are mapped from 62500758 to 62500791 on chromosome 20. The precise fusion point is between 2910503 on chromsome 12 and 62500758 on chromosome 20. Note that small letters such as m, n, i, d have opposite interpretations of big letters, meaning a coordinate is decreasing instead of increasing.
In case the other end spans a fusion point, a custom field XP used as follows:
SRR064286.73586371 chr11 44890 50M XP:Z:chr1-chr10 14482
The read (SRR064286.73586371) is mapped on chr11 at 44890th position while its mate partner spans a fusion point between chr1 and chr10.
So is it possible to filter a bam file to filter cigar operator F and chromosome position ?