Hi, I ran into a problem, which I can't fix. I wrote a short script to extract protein sequences from a genome based on the output of BUSCO.
from Bio import SeqIO
fasta_SCR0023 = SeqIO.index('Galaxy37-[FASTA_from_pilon_on_data_35_and_data_32].fasta', 'fasta')
fasta_SCR0023_prot = open('SCR0023_protein.fasta', 'w')
fasta_SCR0023_prot.close()
busco = open('Galaxy38-[Busco_on_data_37__full_table].tabular', 'r')
for line in busco:
if not line.startswith('#'):
parts = line.split('\t')
if parts[1] == 'Complete':
record = fasta_SCR0023[parts[2].strip()]
record.seq = record.seq[int(parts[3])-1:int(parts[4])]
if parts[5] == '-':
record.seq = record.seq.reverse_complement()
record.seq = record.seq.translate()
rec_id = 'SCR0023_' + parts[9].replace(' ', '_')
rec_id = rec_id.strip()
record.id = rec_id + '_' + parts[0].strip()
record.name = rec_id
record.description = parts[9].strip()
print(record)
with open('SCR0023_protein.fasta', 'a') as new_fasta_file:
SeqIO.write(record, new_fasta_file, 'fasta')
print('-------------------------------------------------')
The script runs for some time but the it stops with the following error:
Traceback (most recent call last):
File "C:/Temp/SCR0023_genome/extract_BUSCOs.py", line 13, in <module>
record = fasta_SCR0023[parts[2].strip()]
File "C:\Programme_TT\Python_380\lib\site-packages\Bio\File.py", line 424, in __getitem__
record = self._proxy.get(self._offsets[key])
File "C:\Programme_TT\Python_380\lib\site-packages\Bio\SeqIO\_index.py", line 74, in get
return self._parse(StringIO(_bytes_to_string(self.get_raw(offset))))
MemoryError
The RAM of my PC is sufficient. That's not the problem.
Has anyone any ideas how to fix this?
Cheers, Torsten
Thanks a lot Michael for your reply. I looked at the RAM usage while the script was running. The script stopped with only 70% of RAM used. The fasta file of the genome is about 8 MB,
I also tried your approach, but got the same error message.
Cheers, Torsten
Could you try again please, with the last update making a new seqrecord? How large are your input files?
If you are seeing 70% mem usage of your program (how? e.g. using top?) you might very likely been run out of memory without noticing it.
Try some memory profiling https://stackoverflow.com/questions/552744/how-do-i-profile-memory-usage-in-python The most likely biggest memhog is the index. You have several options then:
Unfortunately I got the same error even with the subrecord. I looked at the Task-Manager of my PC. Script stopped at the second peek.
Ok, this is windows right? It is very likely that you need more RAM. Try one of the options I have mentioned above. I have added a few checks, so your script would be safe to run in chunks too. Break down the fasta file into two or four chunks using e.g. seqkit and run the script using each input fasta file. It will then just add the entries that are found in the chunk and otherwise do nothing. Make sure you delete the output file once before the first attempt to start fresh.
This resulted in the following error:
This is either a RAM or a Windows problem. I ran my initial script on a Linux machine, and it worked just fine. The Linux machine has much more RAM, so this might be the reason. Before you ask: At work, I have to use a Windows machine. I used my private Linux machine to test it. ;-)
Ok, I see I see, honestly, the above notice is a hilarious explanation (so someone 'deliberately did it wrong' does that make it any better?), but try either removing the line or replace with
if not record: continue
.So to deal with missing records, the index works like a dictionary essentially, there are two ways like so: