Question

How to index SeqRecord objects in the list?

0

Entering edit mode

4.0 years ago

adrian18_07 ▴ 10

I would like to get a list of indexes for SeqRecords that are in list f. I tried this:

for x in f:
    ind = f.index(x)
    print(ind)

But I get the error:

0
Traceback (most recent call last):

File "C:\Users\...", line 43, in <module>
ind = f.index(x)

File "C:\Users\...\anaconda3\lib\site-packages\Bio\SeqRecord.py", line 803, in __eq__
raise NotImplementedError(_NO_SEQRECORD_COMPARISON)

NotImplementedError: SeqRecord comparison is deliberately not implemented. Explicitly compare the 
attributes of interest.

Thanks for any answer.

biopython index • 1.4k views

ADD COMMENT • link updated 4.0 years ago by Joe 21k • written 4.0 years ago by adrian18_07 ▴ 10

0

Entering edit mode

As the error suggests, Biopython doesn't allow you to directly compare two SeqRecords as there's not a meaningful way to compare them. You need to choose an attribute (e.g. its sequence or its ID) to compare between entries.

You could get a bit hacky with it and subclass the SeqRecord class, and then define .__eq__ directly yourself, but otherwise unfortunately I think you're going to need to approach this in a different way.

ADD REPLY • link 4.0 years ago by Joe 21k

score 1 · Answer 1 · 2020-06-04

So this interested me and I spent some time hacking at it today (emphasis on hack).

NB - As Biopython doesn't implement this behaviour, there is probably a very good reason for it. That said, I think hashing the sequences is legit, and the odds of a hash collision are very low (as I've used the whole data structure representation).

Basically, I elected to subclass the SeqRecord, and extend it with the hash as it's built, and make that hash the basis of the equality check. In essence this means 2 things: firstly, you need to define the hash, and the more unique info it can be built from, the less likely it is you'll get a collision; secondly, you need to (re)implement what it means for two objects to be the same/equal (in python this is the __eq__ method).

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import pprint
import sys
import hashlib
import random

class MD5Record(SeqRecord):
    def __init__(self, seqrecord):
        self._seq = seqrecord._seq
        self.id = seqrecord.id
        self.name = seqrecord.name
        self.description = seqrecord.description
        self.dbxrefs = seqrecord.dbxrefs
        self.md5 = hashlib.md5(pprint.pformat(seqrecord).encode('utf-8')).hexdigest()

    def __repr__(self):
        return "{0}(seq={1!r}, id={2!r}, name={3!r}, description={4!r}, dbxrefs={5!r}, md5={6!r})".format(
                 self.__class__.__name__,
                 self.seq, self.id, self.name,
                 self.description, self.dbxrefs,
                 self.md5)

    def __eq__(self, other):
        return self.md5 == other.md5

    def __ne__(self, other):
        return self.md5 != other.md5


recs = list(SeqIO.parse(sys.argv[1], 'fasta'))

hashed_recs = []
for rec in recs:
    pprint.pprint(rec)
    hashed = MD5Record(rec)
    pprint.pprint(hashed)
    hashed_recs.append(hashed)

try:
    recs[0] == recs[0]  # True in theory
    recs[0] == recs[1]  # False, in theory
except NotImplementedError:
    print("BioPython won't let you compare these")

print(hashed_recs[0] == hashed_recs[0])   # True
print(hashed_recs[0] == hashed_recs[1])   # False
print(hashed_recs[0] != hashed_recs[1])   # True

# Pick an entry to index at random
random = random.choice(hashed_recs)    
index = hashed_recs.index(random)
print(index)

Notice I've also redefined the __repr__ to include the md5 for a given record but this is optional. In order for inequality checks to work, you also have to define __ne__ as above.

The key to being able to index the objects in a list, is the __eq__ method, which now works, as equality for the purposes of indexing is defined as the hashes being the same.

Given this input,

$ cat Genes1.fasta
>RandomSequence_ruL79349EujgaR6sv3fehXYxjIUtHQ5C
TGGGAATTCTGCGTAGTCTCGTGACCGAACGCCGCCTTGAAGTGCGCGCCGCTGTACGATGACCGTCAATATTGATCCGT
>RandomSequence_iEjYsbeuYSY84hj8hSt7C0ndLJfPzwBh
AAGTGCGGTGATACCGACACCCAGCGCGCTTTGTACTCATGTTCATGACAAGGTTGTTTAGATGAAATCCCATGGACTAC
>RandomSequence_ewuch2NIsLWfmo5ir2Tz5if26iKSLHU2
CAAACTGTGTCACTTGGTTTTGGGGAAAGCCGTCAAATCCCTGTATTCGCTATCCATCAAATGAGACTGGACTACCCCAC

This output is obtained (truncated the Record displays for character count):

SeqRecord(seq=Seq('TGGGAATTCTGCGTAG...CGT', SingleLetterAlphabet()), id='RandomSequence_ruL79349EujgaR6sv3fehXYxjIUtHQ5C', ... , dbxrefs=[])
MD5Record(seq=Seq('TGGGAATTCTGCGTAG...CGT', SingleLetterAlphabet()), id='RandomSequence_ruL79349EujgaR6sv3fehXYxjIUtHQ5C', ... , dbxrefs=[], md5='e6542eb6cd69eae9993e68c4bd5696ae')
<<<truncated>>>
BioPython won't let you compare these
True
False
True
1

(Note the 1 will change from run to run, as its a random choice.)