Question: Pysam is giving - ValueError: reference_id -1 out of range 0<=tid<65334. Any suggestions?
0
gravatar for kirannbishwa01
23 months ago by
United States
kirannbishwa01780 wrote:

The Following code:

locus, hap = fh.getrname(aln.tid).split(delim)

throws the following exception:

Traceback (most recent call last):

File "/home/everestial007/anaconda2/envs/emase/bin/bam-to-emase", line 4, in <module> __import__('pkg_resources').run_script('emase==0.10.16', 'bam-to-emase')

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/setuptools-21.2.1-py2.7.egg/pkg_resources/__init__.py", line 719, in run_script

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/setuptools-21.2.1-py2.7.egg/pkg_resources/__init__.py", line 1504, in run_script

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/emase-0.10.16-py2.7.egg-info/scripts/bam-to-emase", line 109, in <module> sys.exit(main())

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/emase-0.10.16-py2.7.egg-info/scripts/bam-to-emase", line 95, in main alignmat_factory.prepare(haplotypes, loci, outdir=os.path.dirname(outfile))

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/emase/AlignmentMatrixFactory.py", line 48, in prepare print(aln.reference_id,fh.getrname(aln.reference_id))

File "pysam/calignmentfile.pyx", line 1638, in pysam.calignmentfile.AlignmentFile.getrname (pysam/calignmentfile.c:18168)

File "pysam/calignmentfile.pyx", line 663, in pysam.calignmentfile.AlignmentFile.get_reference_name (pysam/calignmentfile.c:8913)

ValueError: reference_id -1 out of range 0<=tid<65334

As this program is reading the BAM file using pysam as the follows:

fh = pysam.Samfile(self.alnfile, 'rb')

And the program fetch each line of file using following code:

for aln in fh.fetch(until_eof=True):

I have tried to run the program by changing the tid with reference_id as tid has been deprecated. But I am still getting the same error

ValueError: reference_id -1 out of range 0<=reference_id<65334

There are lots of header (reference_id) in the SAM/BAM file with values more than 65334. Is there a way to increase the range of reference_id and/or tid. Or other way to change the code to make the bam file acceptable.

Thank you in advance !

reference_id pysam bam sam • 1.6k views
ADD COMMENTlink modified 3 months ago by Matt Shirley8.3k • written 23 months ago by kirannbishwa01780

How many contigs/chromosomes are in the file? What version of pysam is this?

ADD REPLYlink written 23 months ago by Devon Ryan78k
0
gravatar for John
23 months ago by
John12k
Germany
John12k wrote:

If a read is unaligned, pysam returns a tid/reference_id of -1, which wont work in getrname()

Perhaps wrap it in a try/except block?

The max number there (65334) is the number of chromosomes in your BAM header, which seems like an awful lot. It's also very close to the uint16 max, which is suspicious, since the theoretical max of the BAM format is 2147483648.

ADD COMMENTlink modified 23 months ago • written 23 months ago by John12k
0
gravatar for kirannbishwa01
23 months ago by
United States
kirannbishwa01780 wrote:

@Devon Ryan @ John,

I am using a tool for ASE (allele specific expression), which actually extracts a transcript sequences using information from genome and gtf file. The RNAseq data is aligned against this transcript sequences; so since there are some many transcripts which is treated as chromosomes, pysam is probably hitting the limit. So, is there a way to get rid of this problem by 1) either to increase the index range or 2) make the program accept the number of transcripts (which is treated as scaffold/chromosome)?

Thanks,

ADD COMMENTlink written 23 months ago by kirannbishwa01780
2

Can you post the version of pysam you have installed (pip freeze | grep pysam)? This appears to be a pysam bug that's being introduced by cython, or at least that's my guess. If you have the most recent version of pysam installed then we should report this bug so it gets fixed (you can then update and things will then work).

ADD REPLYlink written 23 months ago by Devon Ryan78k

@Devon Ryan,

The installed pysam version is pysam==0.9.0 If this is a bug in pysam I hope it gets fixed soon.

@ John I replaced the code with what you have provided. Now, I have other kind of error message:

$ bam-to-emase -a 2ms01e_sub.transcriptome.bam -i emase.pooled.transcripts.info -s M,S -o 2ms01e_sub.transcriptome.h5

Traceback (most recent call last): File "/home/everestial007/anaconda2/envs/emase/bin/bam-to-emase", line 4, in <module> __import__('pkg_resources').run_script('emase==0.10.16', 'bam-to-emase')

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/setuptools-21.2.1-py2.7.egg/pkg_resources/__init__.py", line 719, in run_script

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/setuptools-21.2.1-py2.7.egg/pkg_resources/__init__.py", line 1504, in run_script

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/emase-0.10.16-py2.7.egg-info/scripts/bam-to-emase", line 109, in <module> sys.exit(main())

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/emase-0.10.16-py2.7.egg-info/scripts/bam-to-emase", line 95, in main alignmat_factory.prepare(haplotypes, loci, outdir=os.path.dirname(outfile))

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/emase/AlignmentMatrixFactory.py", line 49, in prepare fhout[hap].write(struct.pack('>I', rid[aln.qname]))

TypeError: unhashable type: 'list'

@ Devon @ John FYI: I am using EMASE for doing ASE analyses. The source of the error for now is coming from https://github.com/churchill-lab/emase/blob/master/emase/AlignmentMatrixFactory.py which is designed to use pysam for calculating the alignment values from the aligned bam file.

Thanks !

Thanks,

ADD REPLYlink modified 23 months ago • written 23 months ago by kirannbishwa01780

Now i'm confused - why isn't this a simple user-error issue?

ValueError: reference_id -1 out of range 0<=reference_id<65334

Is due to the code locus, hap = fh.getrname(aln.tid).split(delim)

Because you can't getrname() of an unmapped read. Instead wrap it in a try/except, or use something like:

locus, hap = None,None if aln.tid == -1 else fh.getrname(aln.tid).split(delim)

ADD REPLYlink written 23 months ago by John12k
3

An updated update to my last reply, the following code works as expected with pysam-0.9.0 on my system:

#!/usr/bin/env python
import subprocess
import os
import pysam

f = open("foo.sam", "w")
f.write("@HD\tVN:1.5\tSO:coordinate\n")
for i in xrange(80000):
    f.write("@SQ\tSN:chr{}\tLN:1000\n".format(i))
f.close()

subprocess.check_call(["samtools", "view", "-Sbo", "foo.bam", "foo.sam"])
os.unlink("foo.sam")

bam = pysam.AlignmentFile("foo.bam", "r")
print(bam.getrname(79999))
os.unlink("foo.bam")

Since the EMAse code is checking the unmapped bit, the only cause I can think of for this is that somehow alignments on * aren't being marked as unmapped. Perhaps samtools idxstats would be informative, since it'll print the number of alignments per-contig (and those to *).

ADD REPLYlink modified 23 months ago • written 23 months ago by Devon Ryan78k

Excellent test! :D Thats very relieving :) Also awesome to see a BAM file written in just a handful of lines like that :D

ADD REPLYlink written 23 months ago by John12k

HI Devon,

I will try to run the analyses by removing the unmapped files. I hope this solves the issue.

Also, excuse me but I don't have a broad knowledge of bioinformatics language. So, what does alignment on * mean?

On the pessimistic side if the run fails even after removing the unmapped fails, what would be the best approach to solve the issue 1) Wait for pysam update 2) if not obviously I will have to see where the error is coming from from the whole set of code.

Thanks for the detailed analyses. I will update you about the results.

ADD REPLYlink written 23 months ago by kirannbishwa01780

In that case, please ask pysam to print reads where the aln.tid is -1 in the BAM file with no unmapped reads.

That should solve the mystery :)

Edit: and "*" means the data in unavailable

ADD REPLYlink modified 23 months ago • written 23 months ago by John12k

Do you happen to know what was used to perform the alignments? * is a made up chromosome to which unmapped reads are assigned, since all entries in a BAM file have to have a chromosome. It has a tid of -1.

ADD REPLYlink written 23 months ago by Devon Ryan78k

But your pysam doesn't return * for a tid of -1... right? or? Mine (0.9.0) gives the same ValueError as the OP.

ADD REPLYlink modified 23 months ago • written 23 months ago by John12k
1

No, you'll currently get an error if you ask for the rname of -1. There's a bug report on github for this (from a couple months ago), I think the preferred solution is to raise a more appropriate exception. Having said that, since the code appears to filter out unmapped reads, this should really never happen unless an aligner is doing something silly.

ADD REPLYlink written 22 months ago by Devon Ryan78k
1

Ah, right of course, it happens even with unmapped read filtering -- now I'm on the same page as everyone :D

Well I have no idea. Would be good if the OP could post the read that causes the ValueError, both from Pysam's point of view and samtools.

ADD REPLYlink written 22 months ago by John12k
1

HI @Devon @John,

I figured last night that if I use the bam prepared from single end reads or only one mate from the paired end read the job run through. So, the error I posted ValueError: reference_id -1 out of range 0<=tid<65334 only comes up if I supply the bam prepared from paired end reads.

@ Devon, The RNAseq reads were aligned against the diploid transcriptome. I am not sure how much details is need but here is the structure of how it was done: I take population level (custom) genome and gtf file to extract and prepare transcripts sequences for each population, which is then pooled to create diploid transcriptome. So, lots of transcripts are in the fasta file with corresponding haplotype tag. But, it's like a genome with lots of chromosome. The paired end RNAseq data is then aligned against this genome (actually transcriptome) using bowtie1.

@ John @ Devon, So, the error is only showing up when bam is from paired end reads.

To identify the error, I changed the AlignmentMatrixFactory.py file as below, with the following code that throws an exception:

locus, hap = fh.getrname(aln.reference_id).split(delim)

  • I added the following line before the above code: print(aln.reference_id,fh.getrname(aln.reference_id)) to identify the reference_id before splitting it using delimiter.

But, it still is throwing the same error message but prints all the tid's that passed the tid range before the error begins.

The error is as follows:

contd.......

(65329, 'scaffold97500002.1_S')

(65329, 'scaffold97500002.1_S')

(65329, 'scaffold97500002.1_S')

Traceback (most recent call last):

File "/home/everestial007/anaconda2/envs/emase/bin/bam-to-emase", line 4, in <module> __import__('pkg_resources').run_script('emase==0.10.16', 'bam-to-emase')

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/setuptools-21.2.1-py2.7.egg/pkg_resources/__init__.py", line 719, in run_script

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/setuptools-21.2.1-py2.7.egg/pkg_resources/__init__.py", line 1504, in run_script

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/emase-0.10.16-py2.7.egg-info/scripts/bam-to-emase", line 109, in <module> sys.exit(main())

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/emase-0.10.16-py2.7.egg-info/scripts/bam-to-emase", line 95, in main alignmat_factory.prepare(haplotypes, loci, outdir=os.path.dirname(outfile))

File "/home/everestial007/anaconda2/envs/emase/lib/python2.7/site-packages/emase/AlignmentMatrixFactory.py", line 48, in prepare print(aln.reference_id,fh.getrname(aln.reference_id))

File "pysam/calignmentfile.pyx", line 1638, in pysam.calignmentfile.AlignmentFile.getrname (pysam/calignmentfile.c:18168)

File "pysam/calignmentfile.pyx", line 663, in pysam.calignmentfile.AlignmentFile.get_reference_name (pysam/calignmentfile.c:8913) ValueError: reference_id -1 out of range 0<=tid<65334

So, it is able to accept all the tids with values less than 65334 (only when using paired end reads).

I am not able to figure out why should there be an error only when the bam is from paired end alignment?: ValueError: reference_id -1 out of range 0<=tid<65334"

Would there be a way to raise the exception, if not make the bam acceptable.?

Thanks much !

ADD REPLYlink modified 22 months ago • written 22 months ago by kirannbishwa01780
2

Sorry kirannbishwa, I should have been more clear. Please could you make your code look like this:

try:
    locus, hap = fh.getrname(aln.reference_id).split(delim)
except ValueError:
    print aln
    exit()

Your current code doesn't print the line that causes the error, because it generates the error when trying to print it :) (it also prints all the stuff that does not cause the error!)

So the above code will probably print something like this:

SOLEXA1_0001:4:49:11382:21230#0 0   0   3000742 0   36M -1  -1  36  TTTTTTTTGTTTGTTTGTTTTTTTTTTCTGTTTCTT    array('B', [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])    [('NM', 2), ('NH', 13), ('CC', 'chr15'), ('CP', 75293507)]

Then please run the following command on your BAM file:

samtools view /Users/John/Desktop/ENCFF001LCU.bam | grep SOLEXA1_0001:4:49:11382:21230#0

Where the thing to grep is the alignment ID that pysam printed earlier.

This should be enough information to figure out what went wrong :)

ADD REPLYlink modified 22 months ago • written 22 months ago by John12k
1

Hi @Devon @John,

thanks much for you help. For now the issue has been solved. There were two problems.

1) One was with haplotype flagging. The command

locus, hap = fh.getrname(aln.tid).split(delim)

would take anything after the first delimiter (i.e _) as a haplotype flag. Any more delimiter with in the scaffold and/or transcript id was problem. I just changed the command to

locus, hap = fh.getrname(aln.tid).rsplit(delim,1)

to allow the first delimiter from the last of the scaffold and/or gene id, transcript id name as a haplotype flag.

2) the other issue regarding the use of PE reads was that I should have aligned each mate separately and then convert into the EMASE format and only merge before doing statistics. I am not sure why its done that way, also it was not indicated very clear in the documentation. I got this information when I reached the author of the software and the program is running smoothly.

Thanks much for the help you have been ! - Bishwa K

ADD REPLYlink modified 22 months ago • written 22 months ago by kirannbishwa01780
1

Thanks for the update, glad things are working now!

ADD REPLYlink written 22 months ago by Devon Ryan78k
1

I still don't understand how you got a tid of -1 after removing unmapped reads, but i'm glad you figured it out and are now moving along :)

ADD REPLYlink written 22 months ago by John12k
2

This could either be a user/programmer error or a bug in pysam, it's ambiguous at the moment. I have a sneaking suspicion that cython is converting the tid to a uint8_t, which is then being overflowed when a larger number is assigned to from htslib. Yes, an unmapped read could presumably also cause this to happen if it's not being checked for, but I recall seeing something similar to this happen previously, so I won't be surprised if it pops up again. In an ideal world one would just exclude unmapped reads before getting the rname...

Edit: In fact, it is checking that the read is mapped. Granted, perhaps it's somehow mapped to '*', but that's rather unlikely.

ADD REPLYlink modified 23 months ago • written 23 months ago by Devon Ryan78k
0
gravatar for Matt Shirley
3 months ago by
Matt Shirley8.3k
Cambridge, MA
Matt Shirley8.3k wrote:

Depending on your use case you might try simplesam. It's a much simpler API than pysam. Your code would become:

fh = simplesam.Reader(open(self.alnfile, 'rb'))
for aln in fh:
    print(aln.rname)

You could also filter out unmapped reads by checking aln.mapped.

ADD COMMENTlink written 3 months ago by Matt Shirley8.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1096 users visited in the last hour