Question

Could you explain the percentage of A, C , G, T in hg19 ?

0

Entering edit mode

8.2 years ago

sacha ★ 2.4k

This should be a stupid question, but I wonder why the amount of A is close to the amount of T , and same for C, G.

I downloaded the hg19 genome, which contains sequences from 1 strand of all chromosomes. So, there is nothing to do with complementarity because I only have 1 strand!

I create a script which compute the amount of each bases on hg19, I get this following results:

a       854963149
c       592966724
g       593325228
t       856055361
n       239850802

total   3137161264

And as you can see, A is close to T (approx. 27%) and C is close to G ( approx. 18%).

I expected different values for each bases.. or 25% for each one! So, could you explain why?

GC percent hg19 • 3.3k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.2 years ago by sacha ★ 2.4k

1

Entering edit mode

This is related to Chargaff's second rule. Have a look on wiki

ADD REPLY • link 8.2 years ago by russhh 5.7k

0

Entering edit mode

You are right! Sorry! I can explain the first rule, but cannot explain the second one.. I m reading it :

Second parity rule

The second rule holds that both %A ~ %T and %G ~ %C are valid for each of the two DNA strands. [3]This describes only a global feature of the base composition in a single DNA strand.[4]

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by sacha ★ 2.4k

0

Entering edit mode

If you read further, the page says there is evidence from 2006 that the second rule was proven for major kingdoms. I guess you could dig a bit deeper to examine the actual evidence, but the statement lends itself to easy understanding, so there should be no problem comprehending it. "%A~%T and %G~%C" is the logical representation of "In most ~~eukaryotic~~ single DNA strands, the percentage of A is approximately equal to the percentage of T, and the percentage of G is approximately equal to the percentage of C"

ADD REPLY • link 4.3 years ago by Ram 43k

0

Entering edit mode

hi,

look here - https://en.wikipedia.org/wiki/Chargaff%27s_rules

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by Amitm ★ 2.2k

0

Entering edit mode

Chargaff experiment has been done on whole genom , which contains both strand + and -. In this case, complementary explains why A/T and C/G are constant..

This is not the same for my exemple. I working on a single strand !

ADD REPLY • link 8.2 years ago by sacha ★ 2.4k

1

Entering edit mode

Sacha, you're referring to his first rule

ADD REPLY • link 8.2 years ago by russhh 5.7k

Ram · Answer 1 · 2016-01-22

2

Entering edit mode

8.2 years ago

matted 7.8k

It has to do with the fact that there is no "true" underlying correct strand to choose when you consider each of the human chromosomes. If you consider a double-stranded DNA molecule in nature, to get a single-stranded sequence you have two choices to consider. If you take one chromosome and count the bases on the forward strand and then also do that for the reverse strand, the A and T counts would swap and the G and C counts would swap as well. Since chromosome "directions" (strands) are chosen randomly or by historical choices, after a few chromosomes the numbers even out and the fraction of A's matches the fraction of T's and the fraction of G's match the fraction of C's.

And I guess that's not the underlying reason, though hopefully the argument is clear enough. I think the underlying idea is that, at least at a larger scale, information content is just as likely to be encoded on one strand as the other (which makes sense, since if you start from a double-stranded genome there's no consistent molecular way to prefer one over the other). If the information content or processing logic is shared between strands, as we assume it is, it must lead to approximately symmetric A-T and G-C fractions.

A concrete example of this general claim is that genes are equally likely to be on the forward strand as the reverse strand, and the amino acid code is the same no matter what strand a gene is on. Therefore, if we believe those assumptions, we see how Chargaff's second law would hold (at least for coding DNA).

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by matted 7.8k

1

Entering edit mode

I would have thought DNA composition would have played some role (for example, if 10% of the genome came from different organisms, with a different codon bias, that might result in a non-50:50 ratio even if that 10% had a random insertion direction).

I like this idea, because it would seem to follow that if codon bias affected the 50:50 ratio, then it would explain why organisms that do horizontal gene transfer break Chargraff's second parity rule, and why organelles - genomes highly dependant and interacting with an "external" genome - are the worst offenders. In short, my theory would be that the more your genome interacts with others, the less optimised your codon/nucleoside ratios are, the further you deviate from 50:50 even in the presence of random insertion direction. So is matted's random-insertion-direction sufficient to give a 50:50 ratio?

The code below tests random 50:50 reverse-complimenting of DNA, of user-editable codon usage:

import string
import random
def rc(DNA):
    return DNA.translate(string.maketrans('ACGT','TGCA'))[::-1]

genome = [
    'AAA' for _ in xrange(1000)      ]+[
    'AAT' for _ in xrange(10000)     ]+[
    'ATT' for _ in xrange(100000)    ]+[
    'TTT' for _ in xrange(1000000)   ]+[
    'CCC' for _ in xrange(1000000)   ]+[
    'CCG' for _ in xrange(100000)    ]+[
    'CGG' for _ in xrange(10000)     ]+[
    'GGG' for _ in xrange(1000)      ]

before = ''.join(genome)
A,C,G,T = before.count('A'),before.count('C'),before.count('G'),before.count('T')
print 'Before random insertion:'
print float(A)/(A+T), float(C)/(C+G)

for position,codon in enumerate(genome):
    if random.randint(1,100) > 50:   # Flip codon 50%
        genome[position] = rc(codon) #  of the time

after = ''.join(genome)
A,C,G,T = after.count('A'),after.count('C'),after.count('G'),after.count('T')
print 'After random insertion:'
print float(A)/(A+T), float(C)/(C+G)

Sample output:

Before random insertion:
0.036903690369 0.963096309631

After random insertion:
0.500247890819 0.499687968797

So as you can see, even if you start with a highly 'deviant' genome, you still end up with near 50:50 ratios if every codon has a 50% chance of being reverse-complimented. Therefore, I would say my idea of codon usage is false, and matted's answer is correct and sufficient to explain Chargraff's second rule :)

It also stops working when the codon number get small - obviously, because if you only had a single, highly biased gene, it doesnt matter how you orientate it, it's going to result in a highly-biased nucleoside composition. In this test i'm flipping every codon, but in nature you can only flip whole genes. So this suggests that the genomes that break the second rule are probably due to the number and size of the genes, or non-random insertion direction, and bad luck.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by John 13k

Ram · Answer 2 · 2016-01-22

0

Entering edit mode

8.2 years ago

Joseph Hughes ★ 3.0k

A lot of Eukaryotic genomes are AT rich, i.e. GC poor whereas bacteria tend to be GC rich. The GC content also varies from chromosome to chromosome as shown in this post.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by Joseph Hughes ★ 3.0k

Ram · Answer 3 · 2016-01-25

0

Entering edit mode

8.2 years ago

sacha ★ 2.4k

Author of this post, has developed an idea to explain the Chargaff second rule using bioinformatics: http://www.basic.northwestern.edu/g-buehler/genomes/g_chargaff.htm

Briefly, the reason is a material echange between each strands....

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by sacha ★ 2.4k