Question: Error Using Bio.Unigene Parser
0
gravatar for lukas1
5.3 years ago by
lukas10
lukas10 wrote:

I am re-analyzing some microarray datasets where GenBank and RefSeq accession IDs are the best identifiers for mapping probes to genes. My primary source for doing the mappings has been the Org.Hs.eg.db package in R, but this does not contain all GenBank accession numbers, and I have been using S.O.U.R.C.E. as a backup. However, this requires some manual copying-and-pasting, and I also worry that S.O.U.R.C.E. may be outdated. Based on the suggestion from this post, I decided to use the Bio.Unigene parser from Biopython to extract a complete list of GenBank and Refseq identifiers from the Hs.data file. I downloaded Hs.data from the UniGene website: ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/Hs.data.gz

This is my script:

from Bio import UniGene

input = open("Hs.data")

records = UniGene.parse(input)

i = 0
mystring = ""

for record in records:
    mystring += "Ug=" + record.ID + ":"
    mystring += "Ez=" + record.gene_id + ":"
    for j in range(len(record.sequence)):
        mystring += record.sequence[j].acc + ";"
    mystring += "|"
    i += 1

When I run this script, I encounter the following error:

>>> 
Traceback (most recent call last):
  File "C:\Users\Lukas\Documents\Python projects\Parse Hs.data from UniGene\Parse - error.py", line 10, in <module>
    for record in records:
  File "C:\Python27\lib\site-packages\Bio\UniGene\__init__.py", line 252, in parse
    record = _read(handle)
  File "C:\Python27\lib\site-packages\Bio\UniGene\__init__.py", line 314, in _read
    sts = STSLine(value)
  File "C:\Python27\lib\site-packages\Bio\UniGene\__init__.py", line 187, in __init__
    self._init_from_text(text)
  File "C:\Python27\lib\site-packages\Bio\UniGene\__init__.py", line 193, in _init_from_text
    key, val = part.split("=")
ValueError: too many values to unpack
>>> i
114161
>>>

I get the exact same error when I run this minimal script:

from Bio import UniGene

input = open("Hs.data")

records = UniGene.parse(input)

for record in records:
    pass

At this point, there are still about 20,000 UniGene records that I haven’t yet processed. I would like to skip the record that causes an error and move on to the next one. How would I do this? The “records” variable refers to a generator object, and the error seems to be caused by the __init__ function for iteration number 114161.

Thanks for your help!

UPDATE: I used Vim for Windows to correct a duplicated equals sign in the original Hs.data file that was causing the problem:

// ID Hs.716839 TITLE Transcribed locus EXPRESS kidney| normal CHROMOSOME 22 STS ACC=SHGC-3369 UNISTS=33527 STS ACC==D22S272 UNISTS=45191 STS ACC=D22S923 UNISTS=92346 SCOUNT 3 SEQUENCE ACC=AI272020.1; NID=g3891187; CLONE=IMAGE:1866118; END=3'; LID=1045; SEQTYPE=EST SEQUENCE ACC=GD144965.1; NID=g215345870; END=3'; LID=23569; SEQTYPE=EST SEQUENCE ACC=GD144918.1; NID=g215328149; END=3'; LID=23569; SEQTYPE=EST //

After correcting that, the code ran without incident. Thank you everyone!

annotation biopython parser • 2.4k views
ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by lukas10
1

Sounds like a bug, have you reported it to the Biopython team? e.g. via https://github.com/biopython/biopython/issues or the mailing list

Also which version of Biopython are you using?

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Peter5.8k

Thanks for your suggestion. I have submitted an issue to GitHub.

I'm using Biopython version 1.63 and Python version 2.7.6.

ADD REPLYlink written 5.3 years ago by lukas10
1
gravatar for AndreiR
5.3 years ago by
AndreiR240
São Paulo
AndreiR240 wrote:

Hello Lukas1,

One idea, not fancy at all, but works is:

Split your file into individual records and parse them individually, so you may be able to isolate some problem easily

Example of individual record:

*// ID Hs.739673

TITLE Transcribed locus

EXPRESS brain| normal

CHROMOSOME 11

SCOUNT 1

SEQUENCE ACC=DB308419.1; NID=g83222069; CLONE=BRHIP2027762; END=3'; LID=18323; SEQTYPE=EST

//*

Here it go:

1- split Hs.data file based on "//" pattern - csplit Hs.data /\/\// {*}

2- add a "//" to the end of files in order to make your parser/script to work - for i in ls xx*;do echo "//" >> $i;done

3- run your script using a for loop - for i in ls xx*;do python script.py $i >> $i"_py";done (just remember to add (file=sys.argv[1] and input = open(file) to your script)

4- then you can merge it back and or isolate the problem.

Andrei R

ADD COMMENTlink written 5.3 years ago by AndreiR240

Thanks for your suggestion! Unfortunately, I don't know Unix. However, your answer prompted me to use LTF (Large Text File) Viewer and manually scroll to the offending entry. I found the following:

// ID Hs.716839 TITLE Transcribed locus EXPRESS kidney| normal CHROMOSOME 22 STS ACC=SHGC-3369 UNISTS=33527 STS ACC==D22S272 UNISTS=45191 STS ACC=D22S923 UNISTS=92346 SCOUNT 3 SEQUENCE ACC=AI272020.1; NID=g3891187; CLONE=IMAGE:1866118; END=3'; LID=1045; SEQTYPE=EST SEQUENCE ACC=GD144965.1; NID=g215345870; END=3'; LID=23569; SEQTYPE=EST SEQUENCE ACC=GD144918.1; NID=g215328149; END=3'; LID=23569; SEQTYPE=EST //

It seems likely that there is a mistake in Hs.data: "==" instead of "=" (in "ACC==D22S272"). Thus, splitting for "=" might result in too many values... I need to find a text editor for Windows that can handle editing large files and then I'll see if that was the problem.

ADD REPLYlink written 5.3 years ago by lukas10

UPDATE: I edited Hs.data using Vim for Windows and the code ran without problems. Thank you everyone! And sorry for not doing this sooner. Maybe I could alert the creators of Hs.data to this problem?

ADD REPLYlink written 5.3 years ago by lukas10

Im glad you have passed by the problem.

ADD REPLYlink written 5.3 years ago by AndreiR240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 767 users visited in the last hour