How to open HUGE excel files with genomic info
4
1
Entering edit mode
3.3 years ago

What is the best way to open an excel file containing info on over 6 milion SNPS? Each rows is a SNP, wich relative p-values and other info.

I'd like to split the file in sveral different files.

Is there a way to do it?

Thank you

R SNP sequencing GWAS • 1.5k views
ADD COMMENT
4
Entering edit mode

convert it to (== save it as tab-delineated txt file) and process it in linux using awk or such?

ADD REPLY
1
Entering edit mode

Is it an xls/xlsx file or simple a csv file that your computer opens on Excel?

ADD REPLY
1
Entering edit mode

Are you on a Mac or Windows computer? On a Mac, you have access to a wide range of Unix tools, including split, which will split up a large file into several small ones (see here).

Your process might be:

  1. Export a tab-delimited file from Excel
  2. Replace Microsoft line-endings with normal ones (via dos2unix)
  3. Count the true number of lines in your tab-delimited text file (via wc -l)
  4. Pick a desired number and split your file (via split -l N, replacing N with the desired number of lines per chunk)

If you're on Windows, then you might install a Unix operating system on your system (e.g., Cygwin) to get access to standard line processing tools. Windows has Powershell, but it isn't standard outside of Windows and these steps are somewhat more involved.

ADD REPLY
3
Entering edit mode
3.3 years ago

It seems Excel has a limit of 1,048,576 rows so I suppose your SNPs are split across different sheets.

If any useful, I wrote an excelToCsv converter. It's java so it should work on any system. To convert all the sheets you could do:

excelToCsv -f -i snps.xlsx > snps.csv
ADD COMMENT
2
Entering edit mode
3.3 years ago
Mensur Dlakic ★ 27k

Using pandas in python to convert to both comma- and tab-separated text files:

import pandas as pd
df = pd.read_excel('my_file.xlsx', sheet_name=None)
df.to_csv('my_file.csv', index=False)
df.to_csv('my_file.tsv', index=False, sep='\t')

For Excel files with multiple sheets:

import pandas as pd
df = pd.concat(pd.read_excel('my_file.xlsx', sheet_name=None), ignore_index=True)
df.to_csv('my_file.csv', index=False)
df.to_csv('my_file.tsv', index=False, sep='\t')
ADD COMMENT
2
Entering edit mode

I kind of advise against pandas for this. For example, see this question of mine on stackoverflow. In general, I think pandas and R-based solutions don't handle well input that is not already nicely tabular and well behaved.

While writing excelToCsv I realised there are lots of funny things happening with Excel conversion. For example, 0.0123 may be exported as 1E-02 also when exporting using Excel itself (!!). If a number is formatted with commas for thousand separators (e.g. 1,000,000), Excel exports it as a string like "1,000,000", good luck reading it back as a number. Dates are a mess. My program should handle these cases quite well (as far as I can tell...)

ADD REPLY
0
Entering edit mode

Can you comment on how your tool handles these "funny" cases, just out of interest.

ADD REPLY
1
Entering edit mode

The geeky detail is that I use the Java POI package which, it seems to me, is way more sophisticated than various python/R counterparts and gives a lot of low-level options to manipulate cells. In general, it seems to do "the right thing". But basically, by default my program should give output that is very similar to a manual export from Excel and, therefore, could result in data loss. Better, I think, is to use the -f option which discards all the formatting from numeric cells and you avoid 0.0123 becoming 1E-02, 1,000 staying 1,000, FALSE/TRUE becoming sometimes strings sometimes numbers, and possibly other quirks I don't know. Optionally, dates are converted to ISO format again using the POI engine. The relevant bit of code is around here https://github.com/dariober/excelToCsv/blob/master/src/main/java/excelToCsv/Main.java#L170.

ADD REPLY
0
Entering edit mode

Thanks, good to know. Now that you mention POI, I am almost certain that the R package xlsx uses POI as well, that at least gives me confidence that my own scripts to handle Excel in R are not completely useless :)

ADD REPLY
0
Entering edit mode

I like this answer, as a way to import an Excel file into a Python script or notebook for analysis. @Mensur Could you comment on how well Pandas read_excel() works? Does it have strange formatting issues with certain types of cells, or fonts, or characters from Excel?

ADD REPLY
0
Entering edit mode

It has worked in my hands, but keep in mind that many people use space characters, commas instead of decimal points, and other types of notations that can throw off any conversion effort, no matter how carefully executed. Fortunately there are many options to customize the import - see here - and pandas will in most cases make a good guess regarding the type of data in each column. In case of multiple sheets with different data structures, I think it is safest to import them one at a time.

ADD REPLY
0
Entering edit mode

Many people I know think that dask is even better for these conversions than pandas. I don't have much experience with it, but there are solutions out there for both options.

ADD REPLY
1
Entering edit mode
3.3 years ago
thyleal ▴ 160

If you cannot open it to export to another format (csv, txt tab-delimeted etc.) for command line processing, try an online conversion server able to handle your file size. What are the criteria for spliting? Just blocks of rows, eg., 1-1000, 1001-2000 [...]? If so you can try this tool: https://splitmyexcelfile.com/

ADD COMMENT
1
Entering edit mode
3.3 years ago
ATpoint 81k

I found this tidyverse tutorial helpful: https://readxl.tidyverse.org/

You can first list the sheets of the Excel file (if there are multiple), and then decide which one you want to load. There are parameters to limit the number of rows the function reads (n_max), so you can implement chunk-wise loading if memory becomes a concern.

Unfortunately there is no function to automatically send an email with a proper rant to the authors of this Excel file of yours telling them what a terrible idea it is to store large data in such a format rather than a simple compressed text-based file, it is a pity :-P

ADD COMMENT

Login before adding your answer.

Traffic: 2051 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6