How to index SeqRecord objects in the list?
1
0
Entering edit mode
18 months ago

I would like to get a list of indexes for SeqRecords that are in list f. I tried this:

for x in f:
ind = f.index(x)
print(ind)


But I get the error:

0
Traceback (most recent call last):

File "C:\Users\...", line 43, in <module>
ind = f.index(x)

File "C:\Users\...\anaconda3\lib\site-packages\Bio\SeqRecord.py", line 803, in __eq__
raise NotImplementedError(_NO_SEQRECORD_COMPARISON)

NotImplementedError: SeqRecord comparison is deliberately not implemented. Explicitly compare the
attributes of interest.


biopython index • 580 views
0
Entering edit mode

As the error suggests, Biopython doesn't allow you to directly compare two SeqRecords as there's not a meaningful way to compare them. You need to choose an attribute (e.g. its sequence or its ID) to compare between entries.

You could get a bit hacky with it and subclass the SeqRecord class, and then define .__eq__ directly yourself, but otherwise unfortunately I think you're going to need to approach this in a different way.

1
Entering edit mode
18 months ago
Joe 19k

So this interested me and I spent some time hacking at it today (emphasis on hack).

NB - As Biopython doesn't implement this behaviour, there is probably a very good reason for it. That said, I think hashing the sequences is legit, and the odds of a hash collision are very low (as I've used the whole data structure representation).

Basically, I elected to subclass the SeqRecord, and extend it with the hash as it's built, and make that hash the basis of the equality check. In essence this means 2 things: firstly, you need to define the hash, and the more unique info it can be built from, the less likely it is you'll get a collision; secondly, you need to (re)implement what it means for two objects to be the same/equal (in python this is the __eq__ method).

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import pprint
import sys
import hashlib
import random

class MD5Record(SeqRecord):
def __init__(self, seqrecord):
self._seq = seqrecord._seq
self.id = seqrecord.id
self.name = seqrecord.name
self.description = seqrecord.description
self.dbxrefs = seqrecord.dbxrefs
self.md5 = hashlib.md5(pprint.pformat(seqrecord).encode('utf-8')).hexdigest()

def __repr__(self):
return "{0}(seq={1!r}, id={2!r}, name={3!r}, description={4!r}, dbxrefs={5!r}, md5={6!r})".format(
self.__class__.__name__,
self.seq, self.id, self.name,
self.description, self.dbxrefs,
self.md5)

def __eq__(self, other):
return self.md5 == other.md5

def __ne__(self, other):
return self.md5 != other.md5

recs = list(SeqIO.parse(sys.argv[1], 'fasta'))

hashed_recs = []
for rec in recs:
pprint.pprint(rec)
hashed = MD5Record(rec)
pprint.pprint(hashed)
hashed_recs.append(hashed)

try:
recs[0] == recs[0]  # True in theory
recs[0] == recs[1]  # False, in theory
except NotImplementedError:
print("BioPython won't let you compare these")

print(hashed_recs[0] == hashed_recs[0])   # True
print(hashed_recs[0] == hashed_recs[1])   # False
print(hashed_recs[0] != hashed_recs[1])   # True

# Pick an entry to index at random
random = random.choice(hashed_recs)
index = hashed_recs.index(random)
print(index)


Notice I've also redefined the __repr__ to include the md5 for a given record but this is optional. In order for inequality checks to work, you also have to define __ne__ as above.

The key to being able to index the objects in a list, is the __eq__ method, which now works, as equality for the purposes of indexing is defined as the hashes being the same.

Given this input,

\$ cat Genes1.fasta
>RandomSequence_ruL79349EujgaR6sv3fehXYxjIUtHQ5C
TGGGAATTCTGCGTAGTCTCGTGACCGAACGCCGCCTTGAAGTGCGCGCCGCTGTACGATGACCGTCAATATTGATCCGT
>RandomSequence_iEjYsbeuYSY84hj8hSt7C0ndLJfPzwBh
AAGTGCGGTGATACCGACACCCAGCGCGCTTTGTACTCATGTTCATGACAAGGTTGTTTAGATGAAATCCCATGGACTAC
>RandomSequence_ewuch2NIsLWfmo5ir2Tz5if26iKSLHU2
CAAACTGTGTCACTTGGTTTTGGGGAAAGCCGTCAAATCCCTGTATTCGCTATCCATCAAATGAGACTGGACTACCCCAC


This output is obtained (truncated the Record displays for character count):

SeqRecord(seq=Seq('TGGGAATTCTGCGTAG...CGT', SingleLetterAlphabet()), id='RandomSequence_ruL79349EujgaR6sv3fehXYxjIUtHQ5C', ... , dbxrefs=[])
MD5Record(seq=Seq('TGGGAATTCTGCGTAG...CGT', SingleLetterAlphabet()), id='RandomSequence_ruL79349EujgaR6sv3fehXYxjIUtHQ5C', ... , dbxrefs=[], md5='e6542eb6cd69eae9993e68c4bd5696ae')
<<<truncated>>>
BioPython won't let you compare these
True
False
True
1


(Note the 1 will change from run to run, as its a random choice.)