Crazy high file size for human genome from Ensembl
1
2
Entering edit mode
4.6 years ago
Adrian Pelin ★ 2.6k

Hello,

When I download release 95 repeats soft masked file "ftp://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.toplevel.fa.gz" it is ~1 Gb, however if I decompress it, it becomes 54 Gb.

This is curious as the same soft repeat masked mouse genome decompressed is 2.7 Gb. Any idea why the human genome is so large and if there is any tool to reformat the fasta file into a smaller one?

Thanks, A

Ensembl • 2.0k views
ADD COMMENT
8
Entering edit mode
4.6 years ago
Emily 23k

There are huge numbers of haplotypes in the human GRCh38. In the toplevel DNA sequence files these are represented as the whole chromosome, where most of it is Ns and only the haplotype sequence is actual sequence. This means that the compressed files aren't huge, as they just need to encode how many Ns there are, but decompressed are massive with all the Ns represented.

ADD COMMENT
0
Entering edit mode

That makes a lot of sense! So if I just download the "primary_assembly" I would have a smaller decompressed file size only at the expense of haplotypes, is that a fair prediction? Thanks

ADD REPLY
0
Entering edit mode

Yes, that is absolutely right.

ADD REPLY

Login before adding your answer.

Traffic: 2071 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6