Question

LDSC - IOError: Not a gzipped file

0

Entering edit mode

3.2 years ago

d.s.zimmerman • 0

Hi All,

I am trying to perform genetic correlation with a few summary stastics using ldsc (https://github.com/bulik/ldsc) but I keep receiving the same error. I have munged the data which worked fine and outputted the .gz files, however when I run the genetic correlation analyis --rg I get the error in the title. Below is the full error:

_(ldsc) dominicz@login3:/home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/FORMAT$ /home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/ldsc/ldsc.py \
--rg /home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/FORMAT/SCAControlsHRCImpute.all.sumstats.gz,/home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/FORMAT/Barc_BrS_2020.all.sumstats.gz
--ref-ld-chr /home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/eur_w_ld_chr/
--no-intercept
--w-ld-chr /home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/eur_w_ld_chr/
--out /home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/CORRELATION/Barc.SCA

LD Score Regression (LDSC)
Version 1.0.1
(C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
Broad Institute of MIT and Harvard / MIT Department of Mathematics
GNU General Public License v3

Call:

./ldsc.py
--ref-ld-chr /home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/eur_w_ld_chr/
--out /home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/CORRELATION/Barc.SCA
--rg /home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/FORMAT/SCAControlsHRCImpute.all.sumstats.gz,/home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/FORMAT/Barc_BrS_2020.all.sumstats.gz
--no-intercept
--w-ld-chr /home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/eur_w_ld_chr/

Beginning analysis at Thu Feb 4 09:29:52 2021
Reading summary statistics from /home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/FORMAT/SCAControlsHRCImpute.all.sumstats.gz ...
Traceback (most recent call last):
File "/home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/ldsc/ldsc.py", line 642, in
sumstats.estimate_rg(args, log)
File "/home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/ldsc/ldscore/sumstats.py", line 397, in estimate_rg
alleles=True, dropna=True)
File "/home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/ldsc/ldscore/sumstats.py", line 242, in _read_ld_sumstats
sumstats = _read_sumstats(args, log, fh, alleles=alleles, dropna=dropna)
File "/home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/ldsc/ldscore/sumstats.py", line 163, in _read_sumstats
sumstats = ps.sumstats(fh, alleles=alleles, dropna=dropna)
File "/home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/ldsc/ldscore/parse.py", line 89, in sumstats
x = read_csv(fh, usecols=usecols, dtype=dtype_dict, compression=compression)
File "/home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/ldsc/ldscore/parse.py", line 21, in read_csv
return pd.read_csv(fh, delim_whitespace=True, na_values='.', **kwargs)
File "/home/dominicz/.conda/envs/ldsc/lib/python2.7/site-packages/pandas/io/parsers.py", line 655, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/dominicz/.conda/envs/ldsc/lib/python2.7/site-packages/pandas/io/parsers.py", line 405, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/dominicz/.conda/envs/ldsc/lib/python2.7/site-packages/pandas/io/parsers.py", line 762, in init
self._make_engine(self.engine)
File "/home/dominicz/.conda/envs/ldsc/lib/python2.7/site-packages/pandas/io/parsers.py", line 966, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/dominicz/.conda/envs/ldsc/lib/python2.7/site-packages/pandas/io/parsers.py", line 1582, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 562, in pandas._libs.parsers.TextReader.cinit (pandas/_libs/parsers.c:6175)
File "pandas/_libs/parsers.pyx", line 751, in pandas._libs.parsers.TextReader._get_header (pandas/_libs/parsers.c:9268)
File "pandas/_libs/parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)
File "pandas/_libs/parsers.pyx", line 2173, in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28589)
IOError: Not a gzipped file

Analysis finished at Thu Feb 4 09:29:52 2021
Total time elapsed: 0.01s
Traceback (most recent call last):
File "/home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/ldsc/ldsc.py", line 642, in
sumstats.estimate_rg(args, log)
File "/home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/ldsc/ldscore/sumstats.py", line 397, in estimate_rg
alleles=True, dropna=True)
File "/home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/ldsc/ldscore/sumstats.py", line 242, in _read_ld_sumstats
sumstats = _read_sumstats(args, log, fh, alleles=alleles, dropna=dropna)
File "/home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/ldsc/ldscore/sumstats.py", line 163, in _read_sumstats
sumstats = ps.sumstats(fh, alleles=alleles, dropna=dropna)
File "/home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/ldsc/ldscore/parse.py", line 89, in sumstats
x = read_csv(fh, usecols=usecols, dtype=dtype_dict, compression=compression)
File "/home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/ldsc/ldscore/parse.py", line 21, in read_csv
return pd.read_csv(fh, delim_whitespace=True, na_values='.', **kwargs)
File "/home/dominicz/.conda/envs/ldsc/lib/python2.7/site-packages/pandas/io/parsers.py", line 655, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/dominicz/.conda/envs/ldsc/lib/python2.7/site-packages/pandas/io/parsers.py", line 405, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/dominicz/.conda/envs/ldsc/lib/python2.7/site-packages/pandas/io/parsers.py", line 762, in init
self._make_engine(self.engine)
File "/home/dominicz/.conda/envs/ldsc/lib/python2.7/site-packages/pandas/io/parsers.py", line 966, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/dominicz/.conda/envs/ldsc/lib/python2.7/site-packages/pandas/io/parsers.py", line 1582, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 562, in pandas._libs.parsers.TextReader.cinit (pandas/_libs/parsers.c:6175)
File "pandas/_libs/parsers.pyx", line 751, in pandas._libs.parsers.TextReader._get_header (pandas/_libs/parsers.c:9268)
File "pandas/_libs/parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)
File "pandas/_libs/parsers.pyx", line 2173, in pandas._libs.parsers.raise_parser_error (pandas/libs/parsers.c:28589)
IOError: Not a gzipped file

Does anyone know what I may be doing wrong? The files I am using are definitely gzipped.

Thanks in advance!

Dominic

ldsc genetic correlation software error • 1.5k views

ADD COMMENT • link updated 3.2 years ago by Ram 43k • written 3.2 years ago by d.s.zimmerman • 0

0

Entering edit mode

What's the output of gzip -t file command?

ADD REPLY • link 3.2 years ago by brunobsouzaa ▴ 830

0

Entering edit mode

Hi

dominicz@login3:/home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/FORMAT$ gzip -t Barc_BrS_2020.all.sumstats.gz
gzip: Barc_BrS_2020.all.sumstats.gz: not in gzip format

Interesting, so it is not in gzip format then? I was under the impression a .gz file was a gzipped file. Would I remedy this by unzipping then gzipping the files?

ADD REPLY • link updated 3.2 years ago by Ram 43k • written 3.2 years ago by d.s.zimmerman • 0

0

Entering edit mode

Have you downloaded this file from the web or is it yours? I would suggest getting the original file and then gzip it!

Your file seems corrupted, that's why it's no being recognized as a gzipped file!

ADD REPLY • link 3.2 years ago by brunobsouzaa ▴ 830

0

Entering edit mode

The file is from the output of the ldsc's own software to put summary statistic files into the required format to run ldsc analysis. They refer to it as 'munging' the data.

Here is the log of that process:

LD Score Regression (LDSC)
Version 1.0.1
(C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
Broad Institute of MIT and Harvard / MIT Department of Mathematics
GNU General Public License v3

Call: ./munge_sumstats.py \ --out /home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/FORMAT/Barc_BrS_2020.all \ --merge-alleles /home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/w_hm3.snplist \ --sumstats /home/expcard/Projects/GWAS_SCA/GWAS_NTR/SUMSTATS/Barc_BrS_2020

Interpreting column names as follows: snpid: Variant ID (e.g., rs number) n: Sample size a1: Allele 1, interpreted as ref allele for signed sumstat. pval: p-Value beta: [linear/logistic] regression coefficient (0 --> no effect; above 0 --> A1 is trait/risk increasing) a2: Allele 2, interpreted as non-ref allele for signed sumstat.

Reading list of SNPs for allele merge from /home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/w_hm3.snplist Read 1217311 SNPs for allele merge. Reading sumstats from /home/expcard/Projects/GWAS_SCA/GWAS_NTR/SUMSTATS/Barc_BrS_2020 into memory 5000000 SNPs at a time. Read 7065288 SNPs from --sumstats file. Removed 5911803 SNPs not in --merge-alleles. Removed 0 SNPs with missing values. Removed 0 SNPs with INFO <= 0.9. Removed 0 SNPs with MAF <= 0.01. Removed 0 SNPs with out-of-bounds p-values. Removed 287 variants that were not SNPs or were strand-ambiguous. 1153198 SNPs remain. Removed 4747 SNPs with duplicated rs numbers (1148451 SNPs remain). Removed 0 SNPs with N < 8547.33333333 (1148451 SNPs remain). Median value of beta was 0.0, which seems sensible. Removed 179 SNPs whose alleles did not match --merge-alleles (1148272 SNPs remain). Writing summary statistics for 1217311 SNPs (1148272 with nonmissing beta) to /home/expcard/Projects/GWAS_SCA/GWAS_NTR/LDSC/FORMAT/Barc_BrS_2020.all.sumstats.gz.

Metadata: Mean chi^2 = 1.102 Lambda GC = 1.048 Max chi^2 = 693.076 258 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Tue Feb 2 23:14:10 2021 Total time elapsed: 24.13s

I gzipped the the .gz files (returning a .gz.gz file) and these actually worked with the software!

ADD REPLY • link 3.2 years ago by d.s.zimmerman • 0