Question: Why Does Macs Use A Genome Size Of 2.7 Billion Instead Of 3 Billion For Human?
9
gravatar for Ryan Thompson
7.3 years ago by
Ryan Thompson3.4k
TSRI, La Jolla, CA
Ryan Thompson3.4k wrote:

MACS has the following as part of its help text:

  -g GSIZE, --gsize=GSIZE
                        Effective genome size. It can be 1.0e+9 or 1000000000,
                        or shortcuts:'hs' for human (2.7e9), 'mm' for mouse
                        (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for
                        fruitfly (1.2e8), Default:hs

Specifically, note that the human genome size is given as 2.7 billion bases. However, my understanding is that the human genome size is more like 3 billion bases. When I add up all the chromosome lengths from the UCSC fetchChromSizes utility, I get about 3.1 billion:

$ fetchChromSizes hg19 2>/dev/null | perl -lanE 'BEGIN { say my $sum = 0; } $sum += int $F[1]; END { say "Total: $sum" }'
Total: 3137161264

Even just considering the "main" chromosomes and skipping the "chrUn" and the "_random" ones, the total is still over 3 billion.

So where does MACS's 2.7 billion figure come from? Is this genome size adjusted for repetitiveness or mappability or something?

macs genome • 6.2k views
ADD COMMENTlink written 7.3 years ago by Ryan Thompson3.4k

I am not sure how the table is derived, but that is quite different from the number in my head. Excluding gaps, hg19 (the primary assembly only) has only 2.86Gbp sequences, so 2.7e9 amounts to 94% of human genome, instead of 88.3% in the table legend. Furthermore, with 60bp single-end reads, we can hardly access 2.7Gb sequences. With 100bp paired-end, we can only confidently map reads to 94-95% of the genome.

ADD REPLYlink written 7.3 years ago by lh331k

I am not sure how the table is derived, but that is quite different from the number in my head. Excluding gaps, hg19 (chromosomes plus unplaced/unlocalized contigs) has only 2.86Gbp sequences, so 2.7e9 amounts to 94% of human genome, instead of 88.3% in the table legend. Furthermore, with 60bp single-end reads, we can hardly access 2.7Gb sequences. With 100bp paired-end, we can only confidently map reads to 94-95% of the genome.

ADD REPLYlink written 7.3 years ago by lh331k
9
gravatar for Istvan Albert
7.3 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

Is this genome size adjusted for repetitiveness or mappability or something?

Yes.

See the MACS manual:

It's the mappable genome size or effective genome size which is defined as the genome size which can be sequenced. Because of the repetitive features on the chromsomes, the actual mappable genome size will be smaller than the original size, about 90% or 70% of the genome size. The default hs — 2.7e9 is recommended for UCSC human hg18 assembly. Here are all precompiled parameters for effective genome size

http://liulab.dfci.harvard.edu/MACS/00README.html

ADD COMMENTlink written 7.3 years ago by Istvan Albert ♦♦ 80k

If I took all the long runs of N's out of the human genome sequences and counted the number of remaining nucleotides, would I get about 2.7e9? Or is it more complicated than that? Does repetitiveness or redundancy come into play (for example, do you only the X/Y pseudoautosomal region once instead of twice)?

ADD REPLYlink written 7.3 years ago by Ryan Thompson3.4k

I have not looked at that myself - just took their word for it.

ADD REPLYlink written 7.3 years ago by Istvan Albert ♦♦ 80k

The PeakSeq publication has a table with different 'effective' genome sizes. See http://www.nature.com/nbt/journal/v27/n1/fig_tab/nbt.1518_T1.html

ADD REPLYlink written 7.3 years ago by Fidel1.9k
5
gravatar for Fidel
7.3 years ago by
Fidel1.9k
Germany
Fidel1.9k wrote:

MACS uses the 'effective' genome size to compute the probability of detecting a read, considering that only mappable reads were taken into account. Thus, it assumes that you left out from your mapping all reads that map to more than one position.

This parameter is also specific for the read length as the fraction of the genome that is unmappable, decreases for longer reads.

The PeakSeq publication has a table where they present different genome sizes together with the 'effective' or mappable fraction. See http://www.nature.com/nbt/journal/v27/n1/fig_tab/nbt.1518_T1.html This table contains the genome mappability of different common species considering reads of 30 nt. The table caption includes the mappable fraction of the human genome for different read lengths. Notably, the effective genome size of 2.7e9 corresponds to a read length of 60nt.

ADD COMMENTlink written 7.3 years ago by Fidel1.9k
2

I am not sure how the table is derived, but that is quite different from the number in my head. Excluding gaps, hg19 (chromosomes plus unplaced/unlocalized contigs) has only 2.86Gbp sequences, so 2.7e9 amounts to 94% of human genome, instead of 88.3% in the table legend. Furthermore, with 60bp single-end reads, we can hardly access 2.7Gb sequences. With 100bp paired-end, we can only confidently map reads to 94-95% of the genome.

ADD REPLYlink written 7.3 years ago by lh331k

Thanks for that. This is quite informative, especially because it shows that values to use for a given read size.

ADD REPLYlink written 7.3 years ago by Ryan Thompson3.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 827 users visited in the last hour