Question: Tophat-Fusion : Filter Bam On Cigar
gravatar for Nicolas Rosewick
6.8 years ago by
Belgium, Brussels
Nicolas Rosewick7.7k wrote:


How can I filter a bam file to extract only reads that aligned on a specific chromosome and are part of a fusion. I used tophat 2.0.3 with bowtie1 to align the reads. FYI : Illumina paired-end 76bp reads. In tophat manual :

accepted_hits.bam. A list of read alignments in SAM format. SAM is a compact short read alignment format that is increasingly being adopted. The formal specification is here.

We assume that a fusion alignment involves two chromosomes or different places on the same chromosome (distant or inversion). So, a fusion alignment is reported in two partial alignments (one before the fusion point and the other after) as follows:

(1) SRR064286.11667197 chr12 2910488255 16M chr20 625050470 ATATGTTTGAGAGGCT BCBCCBCCCBBBBBBB XF:Z:1 chr12-chr20 2910488 16M62500758F34M (2) SRR064286.11667197 chr20 62500758255 34M = 625050470 GGCTGAGGAGGAGGAGCTCAGGGCTGAGCTTACC BB?B@;@B?AA?A?;?:@>>@@B>>;;B>?=9## XF:Z:2 chr12-chr20 2910488 16M62500758F34M

Some of fields are skipped for the sake of illustration, as shown above, 16bp of 50bp read is mapped on chr12 and 34bp is mapped on chr20. The whole fusion alignment is available in a custom field XF in both partial alignments above. XF:Z:1 and XF:Z:2 indicates the first and the second partial alignments, respectively.

A new CIGAR operator 'F' (Fusion) is introduced to indicate the presence of a fusion in read alignments. The following example shows a read alignment across a fusion between chromosome 20 and chromosome 12.

XF:Z:1 chr12-chr20 2910488 16M62500758F34M

The left most base of a read (SRR064286.11667197) is mapped to 2910488th base of chromosome 12 from which the first 16 bases of a read are mapped from 2910488 to 2910503 on chromosome 12, the 17th base of the read is mapped to 62500758th base of chromosome 20, and the remaining 34 bases are mapped from 62500758 to 62500791 on chromosome 20. The precise fusion point is between 2910503 on chromsome 12 and 62500758 on chromosome 20. Note that small letters such as m, n, i, d have opposite interpretations of big letters, meaning a coordinate is decreasing instead of increasing.

In case the other end spans a fusion point, a custom field XP used as follows:

SRR064286.73586371 chr11 44890 50M XP:Z:chr1-chr10 14482

The read (SRR064286.73586371) is mapped on chr11 at 44890th position while its mate partner spans a fusion point between chr1 and chr10.

So is it possible to filter a bam file to filter cigar operator F and chromosome position ?



filter bam • 2.7k views
ADD COMMENTlink written 6.8 years ago by Nicolas Rosewick7.7k
gravatar for Sean Davis
6.8 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

I'd suggest using perl, python, or java and the appropriate SAM/BAM libraries to write a small piece of code to do this. For python, this is pysam. Perl is Bio::DB::Sam.

Another option is to use samtools view (to get a text version) and awk to grab the reads that have an F in the CIGAR field.

ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by Sean Davis25k

thanks I'll try that !

ADD REPLYlink written 6.8 years ago by Nicolas Rosewick7.7k

Recent answer to a similar question on filtering CIGAR strings

ADD REPLYlink written 6.8 years ago by Istvan Albert ♦♦ 80k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1497 users visited in the last hour