Question: Pysam (python) problem parsing BAM reference id
1
gravatar for liorglic
3.0 years ago by
liorglic50
liorglic50 wrote:

Hello,

I am trying to use the pysam package in python to do some processing of a large BAM file, which is a BWA output. The header of my file looks like that:

$ samtools view -H myFile.bam
@HD     VN:1.3  SO:coordinate
@SQ     SN:1    LN:...
@SQ     SN:2    LN:...
@SQ     SN:3    LN:...
@SQ     SN:4    LN:...
@SQ     SN:5    LN:...
@SQ     SN:6    LN:...
@SQ     SN:7    LN:...
@SQ     SN:8    LN:...
@SQ     SN:9    LN:...
@SQ     SN:10   LN:...
@SQ     SN:10000001     LN:...
@PG     ...

and I have the following alignment record within this file (just an example):

REC-XYZ-1      147     10000001        1       36      9S91M   =       3601    3511    NAAAAACTCTAGTTGCTCGTGGGTGTCAACTTTTGATTATATGCTGTTATGTTTGTGCAAACTCAACTGAAATGTGAACTCAAGTGGTGATGCTATGAAC    DBCC@:CDCCDDDDDDFFHGHIIIIEGHEHIIGIJIJIEHCGGFIDHHCHJJJJIIIIHHDIFGIFJIIIGIFJJJGHGEIIGIGGIHDFHHFFFDF@@@    NM:i:0  MD:Z:91 AS:i:91 XS:i:99

(notice the '10000001' in the 3rd field).

Using pysam, I do the following:

bam_file = pysam.AlignmentFile("myFile.bam", "rb")
it = bam_file.fetch("10000001")
for x in range(20):
    aln = next(it)
    print(aln.reference_id)

10
10
10
.
.
.

surprisingly, although the real reference id is '10000001', pysam sees it as '10'. For example, the same record I showed above looks like this:

REC-XYZ-1   147 10  0   36  9S91M   10  3600    91  NAAAAACTCTAGTTGCTCGTGGGTGTCAACTTTTGATTATATGCTGTTATGTTTGTGCAAACTCAACTGAAATGTGAACTCAAGTGGTGATGCTATGAAC    array('B', [35, 33, 34, 34, 31, 25, 34, 35, 34, 34, 35, 35, 35, 35, 35, 35, 37, 37, 39, 38, 39, 40, 40, 40, 40, 36, 38, 39, 36, 39, 40, 40, 38, 40, 41, 40, 41, 40, 36, 39, 34, 38, 38, 37, 40, 35, 39, 39, 34, 39, 41, 41, 41, 41, 40, 40, 40, 40, 39, 39, 35, 40, 37, 38, 40, 37, 41, 40, 40, 40, 38, 40, 37, 41, 41, 41, 38, 39, 38, 36, 40, 40, 38, 40, 38, 38, 40, 39, 35, 37, 39, 39, 37, 37, 37, 35, 37, 31, 31, 31])    [('NM', 0), ('MD', '91'), ('AS', 91), ('XS', 99)]

(notice the '10' in the 3rd field)

The same thing happens for the RNEXT field (which I get through aln.next_reference_id), and this is really bad because I cannot differentiate between chromosome 10 and chromosome 10000001. Does anyone know why this is happening? Is this a bug in pysam? I would be quite amazed to find out that it can only take two-digit values as reference id (nothing about that in the documentation). Any ideas?

* I am running on linux, with python 2.7 and the latest version of pysam, using IPython notebook.

pysam sam bam python • 2.4k views
ADD COMMENTlink modified 3.0 years ago by Matt Shirley8.9k • written 3.0 years ago by liorglic50

Devon already explained how to get the actual chromosome name from the refID, however I just thought i'd add where that number comes from.

The refID (number in each row of the BAM file that corresponds to a chromosome in the BAM header) is based on the order of the chromosomes in the header. So the first chromosome has a refID of 0, and the 11th has a refID of 10 -- which is what SN:10000001 is here for you :)

To make matters a little more complicated, the BAM file actually has two headers - an ASCII header that gets printed when you do samtools view -H, and a second redundant binary header that is used to do chrID -> rname conversion. Of course, these two headers should always agree - but they don't have to. I played an April's fools joke on reddit a few years back, where I uploaded a BAM file where the headers didn't match up and asked them to figure out why. Ho ho ho.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by John12k
2
gravatar for Devon Ryan
3.0 years ago by
Devon Ryan88k
Freiburg, Germany
Devon Ryan88k wrote:

If you look at the read the docs page for pysam, you'll see the following warning regarding reference_id:

This field contains the index of the reference sequence in the sequence dictionary. To obtain the name of the reference sequence, use pysam.AlignmentFile.getrname()

In other words, pysam isn't confusing anything, it just happens that many of the IDs happen to correspond to chromosome names in your case (though they'll be the wrong chromosome names).

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Devon Ryan88k

Thanks so much! Totally missed it.

print(bam_file.get_reference_name(aln.next_reference_id))

did the trick.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by liorglic50
0
gravatar for Matt Shirley
3.0 years ago by
Matt Shirley8.9k
Cambridge, MA
Matt Shirley8.9k wrote:

You're encountered one of the quirks of pysam, and hopefully now that you're pointed in the right direction all is good. I just waned to point out that this type of complexity is the exact reason I created my own parser simplesam:

import simplesam 
with open('myFile.bam', 'rb') as bamfile: 
  with simplesam.Reader(bamfile, regions=("10000001",)) as bam: 
    for read in bam:
      print(read.rname)

Each attribute matches the SAM specification, and there are some other class methods that are useful - I should document the whole thing better.

ADD COMMENTlink written 3.0 years ago by Matt Shirley8.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2298 users visited in the last hour