Memory error in python
3
0
Entering edit mode
20 months ago

Hello, I want to extract the human-HCV(Hepatitis C virus) protein-protein interactions (PPI). For doing this, I have downloaded the entire content of the IntAct database as a .txt file. This .txt file has a huge size (4GB). I tried to convert this text file to a CSV file by Python and then extract just human-HCV PPIs. The problem is the size of the file, and I encounter a memory error.

input:

import pandas as pd

read_file.to_csv('intact.csv', index=None)

output: MemoryError: Unable to allocate 162. MiB for an array with shape (41, 1035669) and data type object


how should I solve this issue? I sincerely would appreciate your help.

Protein-Protein Interaction python memory error • 7.1k views
0
Entering edit mode
import pandas as pd

0
Entering edit mode

Did you try zero initialization?

read_file = np.zeros(41, 1035669)  # migth require data type...


did you check the usual suspects on stackoverflow, for example https://stackoverflow.com/questions/57507832/unable-to-allocate-array-with-shape-and-data-type

0
Entering edit mode

No, I didn't try. Sorry, I'm not an expert in python. Should I put your mentioned part of code before read_file = pd.read_csv('intact.txt', delimiter='\t')

1
Entering edit mode
6 months ago
linehammer ▴ 10

Memory errors happens a lot with python when using the 32bit Windows version . This is because 32bit processes only gets 2GB of memory to play with by default.

The solution for this error is that pandas.read_csv() function takes an option called dtype. This lets pandas know what types exist inside your csv data.

For example: by specifying dtype={'age':int} as an option to the .read_csv() will let pandas know that age should be interpreted as a number. This saves you lots of memory.

pd.read_csv('data.csv',dtype={'age':int})


Or try the solution below:

pd.read_csv('data.csv',sep='\t',low_memory=False)

0
Entering edit mode
20 months ago

You don't need the entire file in memory, and you don't need pandas.

Just loop over the lines in the file, replacing tabs by commas. The following code is untested but should give you the general idea.

output = open("myoutput.csv")
for line in open("myinput.tsv"):
output.write(line.replace('\t', ','))

0
Entering edit mode
20 months ago
Mensur Dlakic ★ 15k

You don't need Pandas for this. Or Python. Or Perl, even though one of my suggestions below uses it.

Copy the file:

cp intact.txt intact.csv


Replace tabs with commas:

perl -pi -e 's/\t/\,/g' intact.csv


or

sed -i 's/\t/\,/g' intact.csv
`