Question

Why Does Macs Use A Genome Size Of 2.7 Billion Instead Of 3 Billion For Human?

9

Entering edit mode

12.3 years ago

Ryan Thompson ★ 3.6k

MACS has the following as part of its help text:

  -g GSIZE, --gsize=GSIZE
                        Effective genome size. It can be 1.0e+9 or 1000000000,
                        or shortcuts:'hs' for human (2.7e9), 'mm' for mouse
                        (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for
                        fruitfly (1.2e8), Default:hs

Specifically, note that the human genome size is given as 2.7 billion bases. However, my understanding is that the human genome size is more like 3 billion bases. When I add up all the chromosome lengths from the UCSC fetchChromSizes utility, I get about 3.1 billion:

$ fetchChromSizes hg19 2>/dev/null | perl -lanE 'BEGIN { say my $sum = 0; } $sum += int $F[1]; END { say "Total: $sum" }'
Total: 3137161264

Even just considering the "main" chromosomes and skipping the "chrUn" and the "_random" ones, the total is still over 3 billion.

So where does MACS's 2.7 billion figure come from? Is this genome size adjusted for repetitiveness or mappability or something?

macs genome • 9.3k views

ADD COMMENT • link updated 12.3 years ago by Fidel ★ 2.0k • written 12.3 years ago by Ryan Thompson ★ 3.6k

0

Entering edit mode

I am not sure how the table is derived, but that is quite different from the number in my head. Excluding gaps, hg19 (the primary assembly only) has only 2.86Gbp sequences, so 2.7e9 amounts to 94% of human genome, instead of 88.3% in the table legend. Furthermore, with 60bp single-end reads, we can hardly access 2.7Gb sequences. With 100bp paired-end, we can only confidently map reads to 94-95% of the genome.

ADD REPLY • link 12.3 years ago by lh3 33k

0

Entering edit mode

I am not sure how the table is derived, but that is quite different from the number in my head. Excluding gaps, hg19 (chromosomes plus unplaced/unlocalized contigs) has only 2.86Gbp sequences, so 2.7e9 amounts to 94% of human genome, instead of 88.3% in the table legend. Furthermore, with 60bp single-end reads, we can hardly access 2.7Gb sequences. With 100bp paired-end, we can only confidently map reads to 94-95% of the genome.

ADD REPLY • link 12.3 years ago by lh3 33k

score 11 · Answer 1 · 2012-04-02

11

Entering edit mode

12.3 years ago

Istvan Albert 101k

Is this genome size adjusted for repetitiveness or mappability or something?

Yes.

See the MACS manual:

It's the mappable genome size or effective genome size which is defined as the genome size which can be sequenced. Because of the repetitive features on the chromsomes, the actual mappable genome size will be smaller than the original size, about 90% or 70% of the genome size. The default hs — 2.7e9 is recommended for UCSC human hg18 assembly. Here are all precompiled parameters for effective genome size

http://liulab.dfci.harvard.edu/MACS/00README.html

ADD COMMENT • link 12.3 years ago by Istvan Albert 101k

0

Entering edit mode

If I took all the long runs of N's out of the human genome sequences and counted the number of remaining nucleotides, would I get about 2.7e9? Or is it more complicated than that? Does repetitiveness or redundancy come into play (for example, do you only the X/Y pseudoautosomal region once instead of twice)?

ADD REPLY • link 12.3 years ago by Ryan Thompson ★ 3.6k

0

Entering edit mode

I have not looked at that myself - just took their word for it.

ADD REPLY • link 12.3 years ago by Istvan Albert 101k

0

Entering edit mode

The PeakSeq publication has a table with different 'effective' genome sizes. See http://www.nature.com/nbt/journal/v27/n1/fig_tab/nbt.1518_T1.html

ADD REPLY • link 12.3 years ago by Fidel ★ 2.0k

score 5 · Answer 2 · 2012-04-03

5

Entering edit mode

12.3 years ago

Fidel ★ 2.0k

MACS uses the 'effective' genome size to compute the probability of detecting a read, considering that only mappable reads were taken into account. Thus, it assumes that you left out from your mapping all reads that map to more than one position.

This parameter is also specific for the read length as the fraction of the genome that is unmappable, decreases for longer reads.

The PeakSeq publication has a table where they present different genome sizes together with the 'effective' or mappable fraction. See http://www.nature.com/nbt/journal/v27/n1/fig_tab/nbt.1518_T1.html This table contains the genome mappability of different common species considering reads of 30 nt. The table caption includes the mappable fraction of the human genome for different read lengths. Notably, the effective genome size of 2.7e9 corresponds to a read length of 60nt.

ADD COMMENT • link 12.3 years ago by Fidel ★ 2.0k

2

Entering edit mode

I am not sure how the table is derived, but that is quite different from the number in my head. Excluding gaps, hg19 (chromosomes plus unplaced/unlocalized contigs) has only 2.86Gbp sequences, so 2.7e9 amounts to 94% of human genome, instead of 88.3% in the table legend. Furthermore, with 60bp single-end reads, we can hardly access 2.7Gb sequences. With 100bp paired-end, we can only confidently map reads to 94-95% of the genome.

ADD REPLY • link 12.3 years ago by lh3 33k

0

Entering edit mode

Thanks for that. This is quite informative, especially because it shows that values to use for a given read size.

ADD REPLY • link 12.3 years ago by Ryan Thompson ★ 3.6k