Pysam is giving - ValueError: reference_id -1 out of range 0<=tid<65334. Any suggestions?
3
0
Entering edit mode
7.9 years ago
kirannbishwa01 ★ 1.6k

The Following code:

locus, hap = fh.getrname(aln.tid).split(delim)

throws the following exception:

Traceback (most recent call last):

File "/home/everestial007/anaconda2/envs/emase/bin/bam-to-emase", line 4, in <module> __import__('pkg_resources').run_script('emase==0.10.16', 'bam-to-emase')

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/setuptools-21.2.1-py2.7.egg/pkg_resources/__init__.py", line 719, in run_script

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/setuptools-21.2.1-py2.7.egg/pkg_resources/__init__.py", line 1504, in run_script

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/emase-0.10.16-py2.7.egg-info/scripts/bam-to-emase", line 109, in <module> sys.exit(main())

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/emase-0.10.16-py2.7.egg-info/scripts/bam-to-emase", line 95, in main alignmat_factory.prepare(haplotypes, loci, outdir=os.path.dirname(outfile))

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/emase/AlignmentMatrixFactory.py", line 48, in prepare print(aln.reference_id,fh.getrname(aln.reference_id))

File "pysam/calignmentfile.pyx", line 1638, in pysam.calignmentfile.AlignmentFile.getrname (pysam/calignmentfile.c:18168)

File "pysam/calignmentfile.pyx", line 663, in pysam.calignmentfile.AlignmentFile.get_reference_name (pysam/calignmentfile.c:8913)

ValueError: reference_id -1 out of range 0<=tid<65334

As this program is reading the BAM file using pysam as the follows:

fh = pysam.Samfile(self.alnfile, 'rb')

And the program fetch each line of file using following code:

for aln in fh.fetch(until_eof=True):

I have tried to run the program by changing the tid with reference_id as tid has been deprecated. But I am still getting the same error

ValueError: reference_id -1 out of range 0<=reference_id<65334

There are lots of header (reference_id) in the SAM/BAM file with values more than 65334. Is there a way to increase the range of reference_id and/or tid. Or other way to change the code to make the bam file acceptable.

Thank you in advance !

pysam sam bam reference_id • 5.9k views
ADD COMMENT
0
Entering edit mode

How many contigs/chromosomes are in the file? What version of pysam is this?

ADD REPLY
0
Entering edit mode
7.9 years ago
John 13k

If a read is unaligned, pysam returns a tid/reference_id of -1, which wont work in getrname()

Perhaps wrap it in a try/except block?

The max number there (65334) is the number of chromosomes in your BAM header, which seems like an awful lot. It's also very close to the uint16 max, which is suspicious, since the theoretical max of the BAM format is 2147483648.

ADD COMMENT
0
Entering edit mode
7.9 years ago
kirannbishwa01 ★ 1.6k

@Devon Ryan @ John,

I am using a tool for ASE (allele specific expression), which actually extracts a transcript sequences using information from genome and gtf file. The RNAseq data is aligned against this transcript sequences; so since there are some many transcripts which is treated as chromosomes, pysam is probably hitting the limit. So, is there a way to get rid of this problem by 1) either to increase the index range or 2) make the program accept the number of transcripts (which is treated as scaffold/chromosome)?

Thanks,

ADD COMMENT
2
Entering edit mode

Can you post the version of pysam you have installed (pip freeze | grep pysam)? This appears to be a pysam bug that's being introduced by cython, or at least that's my guess. If you have the most recent version of pysam installed then we should report this bug so it gets fixed (you can then update and things will then work).

ADD REPLY
0
Entering edit mode

@Devon Ryan,

The installed pysam version is pysam==0.9.0 If this is a bug in pysam I hope it gets fixed soon.

@ John I replaced the code with what you have provided. Now, I have other kind of error message:

$ bam-to-emase -a 2ms01e_sub.transcriptome.bam -i emase.pooled.transcripts.info -s M,S -o 2ms01e_sub.transcriptome.h5

Traceback (most recent call last): File "/home/everestial007/anaconda2/envs/emase/bin/bam-to-emase", line 4, in <module> __import__('pkg_resources').run_script('emase==0.10.16', 'bam-to-emase')

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/setuptools-21.2.1-py2.7.egg/pkg_resources/__init__.py", line 719, in run_script

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/setuptools-21.2.1-py2.7.egg/pkg_resources/__init__.py", line 1504, in run_script

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/emase-0.10.16-py2.7.egg-info/scripts/bam-to-emase", line 109, in <module> sys.exit(main())

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/emase-0.10.16-py2.7.egg-info/scripts/bam-to-emase", line 95, in main alignmat_factory.prepare(haplotypes, loci, outdir=os.path.dirname(outfile))

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/emase/AlignmentMatrixFactory.py", line 49, in prepare fhout[hap].write(struct.pack('>I', rid[aln.qname]))

TypeError: unhashable type: 'list'

@ Devon @ John FYI: I am using EMASE for doing ASE analyses. The source of the error for now is coming from https://github.com/churchill-lab/emase/blob/master/emase/AlignmentMatrixFactory.py which is designed to use pysam for calculating the alignment values from the aligned bam file.

Thanks !

Thanks,

ADD REPLY
0
Entering edit mode

Now i'm confused - why isn't this a simple user-error issue?

ValueError: reference_id -1 out of range 0<=reference_id<65334

Is due to the code locus, hap = fh.getrname(aln.tid).split(delim)

Because you can't getrname() of an unmapped read. Instead wrap it in a try/except, or use something like:

locus, hap = None,None if aln.tid == -1 else fh.getrname(aln.tid).split(delim)

ADD REPLY
3
Entering edit mode

An updated update to my last reply, the following code works as expected with pysam-0.9.0 on my system:

#!/usr/bin/env python
import subprocess
import os
import pysam

f = open("foo.sam", "w")
f.write("@HD\tVN:1.5\tSO:coordinate\n")
for i in xrange(80000):
    f.write("@SQ\tSN:chr{}\tLN:1000\n".format(i))
f.close()

subprocess.check_call(["samtools", "view", "-Sbo", "foo.bam", "foo.sam"])
os.unlink("foo.sam")

bam = pysam.AlignmentFile("foo.bam", "r")
print(bam.getrname(79999))
os.unlink("foo.bam")

Since the EMAse code is checking the unmapped bit, the only cause I can think of for this is that somehow alignments on * aren't being marked as unmapped. Perhaps samtools idxstats would be informative, since it'll print the number of alignments per-contig (and those to *).

ADD REPLY
0
Entering edit mode

Excellent test! :D Thats very relieving :) Also awesome to see a BAM file written in just a handful of lines like that :D

ADD REPLY
0
Entering edit mode

HI Devon,

I will try to run the analyses by removing the unmapped files. I hope this solves the issue.

Also, excuse me but I don't have a broad knowledge of bioinformatics language. So, what does alignment on * mean?

On the pessimistic side if the run fails even after removing the unmapped fails, what would be the best approach to solve the issue 1) Wait for pysam update 2) if not obviously I will have to see where the error is coming from from the whole set of code.

Thanks for the detailed analyses. I will update you about the results.

ADD REPLY
0
Entering edit mode

In that case, please ask pysam to print reads where the aln.tid is -1 in the BAM file with no unmapped reads.

That should solve the mystery :)

Edit: and "*" means the data in unavailable

ADD REPLY
0
Entering edit mode

Do you happen to know what was used to perform the alignments? * is a made up chromosome to which unmapped reads are assigned, since all entries in a BAM file have to have a chromosome. It has a tid of -1.

ADD REPLY
0
Entering edit mode

But your pysam doesn't return * for a tid of -1... right? or? Mine (0.9.0) gives the same ValueError as the OP.

ADD REPLY
1
Entering edit mode

No, you'll currently get an error if you ask for the rname of -1. There's a bug report on github for this (from a couple months ago), I think the preferred solution is to raise a more appropriate exception. Having said that, since the code appears to filter out unmapped reads, this should really never happen unless an aligner is doing something silly.

ADD REPLY
1
Entering edit mode

Ah, right of course, it happens even with unmapped read filtering -- now I'm on the same page as everyone :D

Well I have no idea. Would be good if the OP could post the read that causes the ValueError, both from Pysam's point of view and samtools.

ADD REPLY
1
Entering edit mode

HI @Devon @John,

I figured last night that if I use the bam prepared from single end reads or only one mate from the paired end read the job run through. So, the error I posted ValueError: reference_id -1 out of range 0<=tid<65334 only comes up if I supply the bam prepared from paired end reads.

@ Devon, The RNAseq reads were aligned against the diploid transcriptome. I am not sure how much details is need but here is the structure of how it was done: I take population level (custom) genome and gtf file to extract and prepare transcripts sequences for each population, which is then pooled to create diploid transcriptome. So, lots of transcripts are in the fasta file with corresponding haplotype tag. But, it's like a genome with lots of chromosome. The paired end RNAseq data is then aligned against this genome (actually transcriptome) using bowtie1.

@ John @ Devon, So, the error is only showing up when bam is from paired end reads.

To identify the error, I changed the AlignmentMatrixFactory.py file as below, with the following code that throws an exception:

locus, hap = fh.getrname(aln.reference_id).split(delim)

  • I added the following line before the above code: print(aln.reference_id,fh.getrname(aln.reference_id)) to identify the reference_id before splitting it using delimiter.

But, it still is throwing the same error message but prints all the tid's that passed the tid range before the error begins.

The error is as follows:

contd.......

(65329, 'scaffold97500002.1_S')

(65329, 'scaffold97500002.1_S')

(65329, 'scaffold97500002.1_S')

Traceback (most recent call last):

File "/home/everestial007/anaconda2/envs/emase/bin/bam-to-emase", line 4, in <module> __import__('pkg_resources').run_script('emase==0.10.16', 'bam-to-emase')

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/setuptools-21.2.1-py2.7.egg/pkg_resources/__init__.py", line 719, in run_script

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/setuptools-21.2.1-py2.7.egg/pkg_resources/__init__.py", line 1504, in run_script

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/emase-0.10.16-py2.7.egg-info/scripts/bam-to-emase", line 109, in <module> sys.exit(main())

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/emase-0.10.16-py2.7.egg-info/scripts/bam-to-emase", line 95, in main alignmat_factory.prepare(haplotypes, loci, outdir=os.path.dirname(outfile))

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/emase/AlignmentMatrixFactory.py", line 48, in prepare print(aln.reference_id,fh.getrname(aln.reference_id))

File "pysam/calignmentfile.pyx", line 1638, in pysam.calignmentfile.AlignmentFile.getrname (pysam/calignmentfile.c:18168)

File "pysam/calignmentfile.pyx", line 663, in pysam.calignmentfile.AlignmentFile.get_reference_name (pysam/calignmentfile.c:8913) ValueError: reference_id -1 out of range 0<=tid<65334

So, it is able to accept all the tids with values less than 65334 (only when using paired end reads).

I am not able to figure out why should there be an error only when the bam is from paired end alignment?: ValueError: reference_id -1 out of range 0<=tid<65334"

Would there be a way to raise the exception, if not make the bam acceptable.?

Thanks much !

ADD REPLY
2
Entering edit mode

Sorry kirannbishwa, I should have been more clear. Please could you make your code look like this:

try:
    locus, hap = fh.getrname(aln.reference_id).split(delim)
except ValueError:
    print aln
    exit()

Your current code doesn't print the line that causes the error, because it generates the error when trying to print it :) (it also prints all the stuff that does not cause the error!)

So the above code will probably print something like this:

SOLEXA1_0001:4:49:11382:21230#0 0   0   3000742 0   36M -1  -1  36  TTTTTTTTGTTTGTTTGTTTTTTTTTTCTGTTTCTT    array('B', [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])    [('NM', 2), ('NH', 13), ('CC', 'chr15'), ('CP', 75293507)]

Then please run the following command on your BAM file:

samtools view /Users/John/Desktop/ENCFF001LCU.bam | grep SOLEXA1_0001:4:49:11382:21230#0

Where the thing to grep is the alignment ID that pysam printed earlier.

This should be enough information to figure out what went wrong :)

ADD REPLY
1
Entering edit mode

Hi @Devon @John,

thanks much for you help. For now the issue has been solved. There were two problems.

1) One was with haplotype flagging. The command

locus, hap = fh.getrname(aln.tid).split(delim)

would take anything after the first delimiter (i.e _) as a haplotype flag. Any more delimiter with in the scaffold and/or transcript id was problem. I just changed the command to

locus, hap = fh.getrname(aln.tid).rsplit(delim,1)

to allow the first delimiter from the last of the scaffold and/or gene id, transcript id name as a haplotype flag.

2) the other issue regarding the use of PE reads was that I should have aligned each mate separately and then convert into the EMASE format and only merge before doing statistics. I am not sure why its done that way, also it was not indicated very clear in the documentation. I got this information when I reached the author of the software and the program is running smoothly.

Thanks much for the help you have been ! - Bishwa K

ADD REPLY
1
Entering edit mode

Thanks for the update, glad things are working now!

ADD REPLY
1
Entering edit mode

I still don't understand how you got a tid of -1 after removing unmapped reads, but i'm glad you figured it out and are now moving along :)

ADD REPLY
2
Entering edit mode

This could either be a user/programmer error or a bug in pysam, it's ambiguous at the moment. I have a sneaking suspicion that cython is converting the tid to a uint8_t, which is then being overflowed when a larger number is assigned to from htslib. Yes, an unmapped read could presumably also cause this to happen if it's not being checked for, but I recall seeing something similar to this happen previously, so I won't be surprised if it pops up again. In an ideal world one would just exclude unmapped reads before getting the rname...

Edit: In fact, it is checking that the read is mapped. Granted, perhaps it's somehow mapped to '*', but that's rather unlikely.

ADD REPLY
0
Entering edit mode
6.3 years ago

Depending on your use case you might try simplesam. It's a much simpler API than pysam. Your code would become:

fh = simplesam.Reader(open(self.alnfile, 'rb'))
for aln in fh:
    print(aln.rname)

You could also filter out unmapped reads by checking aln.mapped.

ADD COMMENT

Login before adding your answer.

Traffic: 1909 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6