How to index SeqRecord objects in the list?
1
0
Entering edit mode
18 months ago
adrian18_07 ▴ 10

I would like to get a list of indexes for SeqRecords that are in list f. I tried this:

for x in f:
    ind = f.index(x)
    print(ind)

But I get the error:

0
Traceback (most recent call last):

File "C:\Users\...", line 43, in <module>
ind = f.index(x)

File "C:\Users\...\anaconda3\lib\site-packages\Bio\SeqRecord.py", line 803, in __eq__
raise NotImplementedError(_NO_SEQRECORD_COMPARISON)

NotImplementedError: SeqRecord comparison is deliberately not implemented. Explicitly compare the 
attributes of interest.

Thanks for any answer.

biopython index • 580 views
ADD COMMENT
0
Entering edit mode

As the error suggests, Biopython doesn't allow you to directly compare two SeqRecords as there's not a meaningful way to compare them. You need to choose an attribute (e.g. its sequence or its ID) to compare between entries.

You could get a bit hacky with it and subclass the SeqRecord class, and then define .__eq__ directly yourself, but otherwise unfortunately I think you're going to need to approach this in a different way.

ADD REPLY
1
Entering edit mode
18 months ago
Joe 19k

So this interested me and I spent some time hacking at it today (emphasis on hack).

NB - As Biopython doesn't implement this behaviour, there is probably a very good reason for it. That said, I think hashing the sequences is legit, and the odds of a hash collision are very low (as I've used the whole data structure representation).

Basically, I elected to subclass the SeqRecord, and extend it with the hash as it's built, and make that hash the basis of the equality check. In essence this means 2 things: firstly, you need to define the hash, and the more unique info it can be built from, the less likely it is you'll get a collision; secondly, you need to (re)implement what it means for two objects to be the same/equal (in python this is the __eq__ method).

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import pprint
import sys
import hashlib
import random

class MD5Record(SeqRecord):
    def __init__(self, seqrecord):
        self._seq = seqrecord._seq
        self.id = seqrecord.id
        self.name = seqrecord.name
        self.description = seqrecord.description
        self.dbxrefs = seqrecord.dbxrefs
        self.md5 = hashlib.md5(pprint.pformat(seqrecord).encode('utf-8')).hexdigest()

    def __repr__(self):
        return "{0}(seq={1!r}, id={2!r}, name={3!r}, description={4!r}, dbxrefs={5!r}, md5={6!r})".format(
                 self.__class__.__name__,
                 self.seq, self.id, self.name,
                 self.description, self.dbxrefs,
                 self.md5)

    def __eq__(self, other):
        return self.md5 == other.md5

    def __ne__(self, other):
        return self.md5 != other.md5


recs = list(SeqIO.parse(sys.argv[1], 'fasta'))

hashed_recs = []
for rec in recs:
    pprint.pprint(rec)
    hashed = MD5Record(rec)
    pprint.pprint(hashed)
    hashed_recs.append(hashed)

try:
    recs[0] == recs[0]  # True in theory
    recs[0] == recs[1]  # False, in theory
except NotImplementedError:
    print("BioPython won't let you compare these")

print(hashed_recs[0] == hashed_recs[0])   # True
print(hashed_recs[0] == hashed_recs[1])   # False
print(hashed_recs[0] != hashed_recs[1])   # True

# Pick an entry to index at random
random = random.choice(hashed_recs)    
index = hashed_recs.index(random)
print(index)

Notice I've also redefined the __repr__ to include the md5 for a given record but this is optional. In order for inequality checks to work, you also have to define __ne__ as above.

The key to being able to index the objects in a list, is the __eq__ method, which now works, as equality for the purposes of indexing is defined as the hashes being the same.


Given this input,

$ cat Genes1.fasta
>RandomSequence_ruL79349EujgaR6sv3fehXYxjIUtHQ5C
TGGGAATTCTGCGTAGTCTCGTGACCGAACGCCGCCTTGAAGTGCGCGCCGCTGTACGATGACCGTCAATATTGATCCGT
>RandomSequence_iEjYsbeuYSY84hj8hSt7C0ndLJfPzwBh
AAGTGCGGTGATACCGACACCCAGCGCGCTTTGTACTCATGTTCATGACAAGGTTGTTTAGATGAAATCCCATGGACTAC
>RandomSequence_ewuch2NIsLWfmo5ir2Tz5if26iKSLHU2
CAAACTGTGTCACTTGGTTTTGGGGAAAGCCGTCAAATCCCTGTATTCGCTATCCATCAAATGAGACTGGACTACCCCAC

This output is obtained (truncated the Record displays for character count):

SeqRecord(seq=Seq('TGGGAATTCTGCGTAG...CGT', SingleLetterAlphabet()), id='RandomSequence_ruL79349EujgaR6sv3fehXYxjIUtHQ5C', ... , dbxrefs=[])
MD5Record(seq=Seq('TGGGAATTCTGCGTAG...CGT', SingleLetterAlphabet()), id='RandomSequence_ruL79349EujgaR6sv3fehXYxjIUtHQ5C', ... , dbxrefs=[], md5='e6542eb6cd69eae9993e68c4bd5696ae')
<<<truncated>>>
BioPython won't let you compare these
True
False
True
1

(Note the 1 will change from run to run, as its a random choice.)

ADD COMMENT

Login before adding your answer.

Traffic: 1502 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6