Question

MemoryError using BioPython to extract BUSCO's from genome

1

Entering edit mode

3.5 years ago

Torsten ▴ 10

Hi, I ran into a problem, which I can't fix. I wrote a short script to extract protein sequences from a genome based on the output of BUSCO.

from Bio import SeqIO

fasta_SCR0023 = SeqIO.index('Galaxy37-[FASTA_from_pilon_on_data_35_and_data_32].fasta', 'fasta')
fasta_SCR0023_prot = open('SCR0023_protein.fasta', 'w')
fasta_SCR0023_prot.close()

busco = open('Galaxy38-[Busco_on_data_37__full_table].tabular', 'r')

for line in busco:
    if not line.startswith('#'):
        parts = line.split('\t')
        if parts[1] == 'Complete':
            record = fasta_SCR0023[parts[2].strip()]
            record.seq = record.seq[int(parts[3])-1:int(parts[4])]
            if parts[5] == '-':
                record.seq = record.seq.reverse_complement()
            record.seq = record.seq.translate()
            rec_id = 'SCR0023_' + parts[9].replace(' ', '_')
            rec_id = rec_id.strip()
            record.id = rec_id + '_' + parts[0].strip()
            record.name = rec_id
            record.description = parts[9].strip()
            print(record)
            with open('SCR0023_protein.fasta', 'a') as new_fasta_file:
                SeqIO.write(record, new_fasta_file, 'fasta')
        print('-------------------------------------------------')

The script runs for some time but the it stops with the following error:

Traceback (most recent call last):
  File "C:/Temp/SCR0023_genome/extract_BUSCOs.py", line 13, in <module>
    record = fasta_SCR0023[parts[2].strip()]
  File "C:\Programme_TT\Python_380\lib\site-packages\Bio\File.py", line 424, in __getitem__
    record = self._proxy.get(self._offsets[key])
  File "C:\Programme_TT\Python_380\lib\site-packages\Bio\SeqIO\_index.py", line 74, in get
    return self._parse(StringIO(_bytes_to_string(self.get_raw(offset))))
MemoryError

The RAM of my PC is sufficient. That's not the problem.

Has anyone any ideas how to fix this?

Cheers, Torsten

index FASTA Biopython BUSCO • 2.1k views

ADD COMMENT • link updated 3.5 years ago by Michael 56k • written 3.5 years ago by Torsten ▴ 10

score 1 · Answer 1 · 2022-04-07

1

Entering edit mode

3.5 years ago

Michael 56k

Difficult to say, also you cannot claim your RAM is sufficient for the task with evidence suggesting otherwise and lacking information about your data size ... but one wild guess:

Try moving this line outside of your outer loop:

with open('SCR0023_protein.fasta', 'a') as new_fasta_file:

Currently, the open call is made for each line in your input file, which is unnecessary. Depending on how python manages system file handles internally, you may simply end up with too many temporarily open file handles. It may not solve it but at least will reduce the number of system calls significantly.

In fact you should not need this line at all because the file was already created and opened, but then closed, in the beginning.

Another point would be to not re-use the seqrecord obtained from the indexed object, because that is a reference not a copy, changing the id's may not go down well. So, try to make a new record instead.

Try this:

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

fasta_SCR0023 = SeqIO.index('Galaxy37-[FASTA_from_pilon_on_data_35_and_data_32].fasta', 'fasta')
fasta_SCR0023_prot = open('SCR0023_protein.fasta', 'a') # you can open in append mode directly


busco = open('Galaxy38-[Busco_on_data_37__full_table].tabular', 'r')

for line in busco:
    if not line.startswith('#'):
        parts = line.split('\t')
        # adding some safety checks, lines could be empty
        if len(parts) > 0 and parts[1].strip() == 'Complete': # you have used strip for some fields, might be safe to use it for all

            record = fasta_SCR0023.get(parts[2].strip()) # returns None in case the record is not there              
            if not record: continue
            myseq = record.seq[int(parts[3])-1:int(parts[4])]
            if parts[5].strip() == '-':
                myseq = myseq.reverse_complement()
            rec_id = 'SCR0023_' + parts[9].strip().replace(' ', '_') + '_' + parts[0].strip()
            aarecord = SeqRecord(
                               myseq,                   # myseq.translate(), ## removed the translation because the result may be invalid
                               id=rec_id,
                               name=rec_id,
                               description=parts[9].strip()                
                               )               
            print(aarecord)
            SeqIO.write(aarecord, fasta_SCR0023_prot, 'fasta')
        print('-------------------------------------------------')
fasta_SCR0023_prot.close()

ADD COMMENT • link 3.5 years ago by Michael 56k

0

Entering edit mode

Thanks a lot Michael for your reply. I looked at the RAM usage while the script was running. The script stopped with only 70% of RAM used. The fasta file of the genome is about 8 MB,

I also tried your approach, but got the same error message.

Cheers, Torsten

ADD REPLY • link 3.5 years ago by Torsten ▴ 10

0

Entering edit mode

Could you try again please, with the last update making a new seqrecord? How large are your input files?

If you are seeing 70% mem usage of your program (how? e.g. using top?) you might very likely been run out of memory without noticing it.

Try some memory profiling https://stackoverflow.com/questions/552744/how-do-i-profile-memory-usage-in-python The most likely biggest memhog is the index. You have several options then:

drop the index and use something else: https://biopython.org/wiki/SeqIO
search through the non-indexed SeqIO each time (slow)
break down the Fasta file into chunks and run the script multiple times, adding some safety checks
get more RAM
use Perl ;) (not kidding, this is most likely going to work just fine with pure Perl or BioPerl )

ADD REPLY • link 3.5 years ago by Michael 56k

0

Entering edit mode

Unfortunately I got the same error even with the subrecord. I looked at the Task-Manager of my PC. Script stopped at the second peek. Memory use

ADD REPLY • link 3.5 years ago by Torsten ▴ 10

0

Entering edit mode

Ok, this is windows right? It is very likely that you need more RAM. Try one of the options I have mentioned above. I have added a few checks, so your script would be safe to run in chunks too. Break down the fasta file into two or four chunks using e.g. seqkit and run the script using each input fasta file. It will then just add the entries that are found in the chunk and otherwise do nothing. Make sure you delete the output file once before the first attempt to start fresh.

ADD REPLY • link 3.5 years ago by Michael 56k

0

Entering edit mode

This resulted in the following error:

Traceback (most recent call last):
  File "C:/Temp/SCR0023_genome/extract_BUSCOs.py", line 19, in <module>
    if record == None: continue
  File "C:\Programme_TT\Python_380\lib\site-packages\Bio\SeqRecord.py", line 803, in __eq__
    raise NotImplementedError(_NO_SEQRECORD_COMPARISON)
NotImplementedError: SeqRecord comparison is deliberately not implemented. Explicitly compare the attributes of interest.

This is either a RAM or a Windows problem. I ran my initial script on a Linux machine, and it worked just fine. The Linux machine has much more RAM, so this might be the reason. Before you ask: At work, I have to use a Windows machine. I used my private Linux machine to test it. ;-)

ADD REPLY • link 3.5 years ago by Torsten ▴ 10

0

Entering edit mode

SeqRecord comparison is deliberately not implemented.

Ok, I see I see, honestly, the above notice is a hilarious explanation (so someone 'deliberately did it wrong' does that make it any better?), but try either removing the line or replace with if not record: continue.

So to deal with missing records, the index works like a dictionary essentially, there are two ways like so:

    try: 
        bla = my_fasta['blastfemi']  # throws key error if not exists
    except: 
        pass
    bla = my_fasta.get('bloody blastfemi') # returns None if not exists 
    if bla:
        print(bla)

ADD REPLY • link 3.5 years ago by Michael 56k

score 1 · Answer 2 · 2022-04-08

Hi Torsten, now, when looking more deeply into your approach, unfortunately, I noticed that your method is invalid, for Eukaryotes at least. You wish to retrieve the BUSCO amino acid sequences using the genomic coordinates in the output file full_table.tsv. However, these are gene start and end coordinates, and they are unlikely to yield a correctly spliced CDS. For bacteria, this might work, but even there frameshifts could occur. Therefore doing the translation will likely give incorrect results. Better output the complete gene nucleotide sequence and run transeq to get all 6 reading frames, or blast the sequences using blastx.

Also, there is a much better way to get all the sequences from a BUSCO run: look into the folder <output_folder>/run_<lineage>_odb10/busco_sequences there are three sub directories: fragmented_busco_sequences/ multi_copy_busco_sequences/ single_copy_busco_sequences/ which contain the respective detected AA-sequences. The extraction method could still be useful for other things, so I am leaving it here.

score 0 · Answer 3 · 2022-04-07

Here is a perl/bioperl script that does the same as the python code. It requires to install BioPerl.

usage: perl buscoExtract.pl genome.fasta busco.txt

#!/usr/bin/env perl
use strict;
use warnings;
use Bio::DB::Fasta;
use Bio::SeqIO;
## adding simple whitespace trimming just in case
sub trim {my $s = shift; $s =~ s/^\s+|\s+$//g; return $s};

my $out = Bio::SeqIO->new( -format => "Fasta" ); 
# Create database from a single Fasta file or even a whole directory of files
# Creates a persistent index file on disk the first time it is run by parsing the sequence identifiers
# This index makes it possible to randomly access the sequences without being loaded into memory
# BioPython doesn't have anything comparable
my $db = Bio::DB::Fasta->new($ARGV[0]);
open my $buscos, $ARGV[1] || die "could not open busco file $!";
while (<$buscos>) {
  chomp;
  next if /^\s*\#|^\s*$/; # skip comments and blank lines
  my @p = split "\t";
  @p = map {trim($_)} @p; # trim trailing whitespace from the array
  #print join ',', @p;
  next unless $p[1] eq 'Complete';
  my $seq =  $db->seq($p[2],$p[3],$p[4],$p[5]); # check if the index is 0 or 1 based in the busco output!
  $p[9] =~ s/\s/_/g;
  my $id = 'SCR0023_' . $p[9] . '_' . $p[0];
  $seq = Bio::Seq->new(-seq => $seq, -display_id => $id, -description => $p[9]);
  $out->write_seq($seq); ## translation of unspliced genomic sequence is moot
}

__END__

#BUSCOID    STATUS  SEQID   START   STOP    STRAND  6   7   8   DESCRIPTION
616 Complete        LSalAtl2s1   100    200 -   bla blubb   foo GeneOfDeath (you got me)
6112    Complete    LSalAtl2s1  300 333 +   bla blubb   foo GeneOfDeath-like (you got me too)
6113    Complete    LSalAtl2s2  1000    1400    +   bla blubb   foo GeneOfDeath-like (you got me too)

Output:

>SCR0023_GeneOfDeath_(you_got_me)_616 GeneOfDeath_(you_got_me)
FSFPSF*IFSKI*FY*EKYTIENLMEYLHFNEI
>SCR0023_GeneOfDeath-like_(you_got_me_too)_6112 GeneOfDeath-like_(you_got_me_too)
KKISSLTKFTH
>SCR0023_GeneOfDeath-like_(you_got_me_too)_6113 GeneOfDeath-like_(you_got_me_too)
VPTTFVVRYEYIHFPFWLIDNIAMIKHNLINNGTLHFRIMNWFL*MMVQNSYTLHRAALW
PMLDLKEI*LSIRYIILIR*MQQICFWRHLV*KLKSITILM*KVQVQ*LLNIMNAQRYIQ
TLKQKNAYFPSNL

When running this process in an infinite loop on a fasta file of size 673MB, the process uses no more than 11.7 MB of memory.