Hi!!
I want to read and process BAM files using my own python scripts. Pysam is a python library designed to do this, but I don't understand how to use it.
For example, I would like get a human-readable print of a BAM file using pysam. The same output I get when I use the following command:
samtools view file.bam
I tried to do this using this code:
import sys
import pysam
def main(BAM):
    samfile = pysam.Samfile( BAM, "rb" )
    for row in samfile:
        print row
if __name__ == '__main__':
    main(sys.argv[1])
But I get an output that is similar than a SAM file, but doesn't have the CIGAR line and have strange fields that I don't understand:
ERR030860.11710_HWI-BRUNOP16X_0001:5:1:10517:2607#0    0    23    43534648    0    [(0, 89), (3, 285), (0, 11)]    -1    -1    100    ATGAAGCAATTGCTGAATTGGATACGCTGAATGAAGAGTCTTATAAAGACAGCACTCTGATCCTGCAGTTACTTAGGGACAATCTCCCTCTGTGGACATC    5565555555@GGAGGGGGC5=><554455DDD.D>?A@?A<?>>444444452*4344555*55DFDDDFDDD<55445C??=8,+14444444<A?@@    [('NM', 2), ('IH', 1), ('HI', 1), ('XS', '+')]
ERR030860.11718_HWI-BRUNOP16X_0001:5:1:11798:2613#0    0    8    30267395    102    [(0, 91)]    -1    -1    91    CAAAGGAAGCTCTTGGTATACGGAAAAAGTATTCAACAATAAATTAGGCATGGTTGCTTCCATTTTCTGCCTCACATACTTTTTTTTTTCG    GHFGHHHGHHHHHHEHHHHDDFBDD44445HHHHHHHHHHGHHHHDC=GGHHHHHHHHHHHEH=HHHHHHHHG>HHHHEHHHHHHCCCBC@    [('NM', 3), ('IH', 1), ('HI', 1)]
ERR030860.11720_HWI-BRUNOP16X_0001:5:1:11974:2612#0    0    27    133472526    0    [(0, 21), (3, 791), (0, 38)]    -1    -1    59    AGTTCTATGGGTCAAAAGAGGATCCACAGACTTTCTATTATGCTGTTGCTGTGGTGACG    HHHHHHHGHH<AD@D444(4445553444/45544EEGGGHHGHHHHHHDGGGGGD:,D    [('NM', 1), ('IH', 1), ('HI', 1), ('XS', '+')]
ERR030860.11722_HWI-BRUNOP16X_0001:5:1:12095:2607#0    0    9    10608405    0    [(0, 95)]    -1    -1    95    GCCATTGAAGACTGGAATAATGAAAGCAGCATGCCCTGTTGTGTCCTTCAGCTTGGCGATATCATCGATGGATATAATGCACAGTATAATGCATC    HHHHHHHHHHHHHHHBD?DD45556@@A?AEEEHHA?DDA4446/BDDADH=HHH>+=<:/4655F?BAFC/5A@4345*1424531*+2@D<5?    [('NM', 1), ('IH', 1), ('HI', 1)]
ERR030860.11724_HWI-BRUNOP16X_0001:5:1:12418:2610#0    0    2    4114112    0    [(0, 64)]    -1    -1    64    TTTGGGGAAGGGGTGGCTAAGGAAGATGGTATAGGTGAAGGCGGCTGTGTGACCACTTTACCCC    GGHHHHGHHHHHHHHBCGHH55555425515444514455C4CCCCE?E244553A=@>09?=A    [('NM', 1), ('IH', 1), ('HI', 1)]
ERR030860.11725_HWI-BRUNOP16X_0001:5:1:13271:2613#0    0    27    9482322    0    [(0, 60), (3, 894), (0, 27)]    -1    -1    87    GACACTATTAATAAGACTGAATTGGCCTGTAATAACACAGTTATTGGTTCCCAAATGCAGTTACAGCTGGGACGAGTCACTCGTGTT    HBHHHHHHHFHHHHHHHGHHEHHHHEHHHEF@BFFFHHEHHGHHHHH:HHHHHHEHHHHHHHHHHHEHHHB@*D743150414536-    [('NM', 1), ('IH', 1), ('HI', 1), ('XS', '+')]
Can anybody tell me how to get a SAM output using pysam??
I'm also interested in other ways to read BAM files using python scripts.
But when I print row.cigar I get the tuples, not the CIGAR. Well... I at least need to understand the meaning of this tuples... for example I have tuples like this: [(0, 25), (3, 1642), (0, 44)] .... It's means that I have 25 matches, one intron of 1642 and 44 matches ??
Yes, you don't get the original string, but a processed version, which is what people usually want. If you need a Python script to simply access the original string representation of the aligned reads in the samefile, then it's probably easier to use Python's subprocess module and read the output from samtools view.
I'm used to directly analyse the SAM lines, but that I understand this processed version, it's even easier manipulate BAM/SAM files. THANKS!!