Could you explain the percentage of A, C , G, T in hg19 ?
3
0
Entering edit mode
7.2 years ago
sacha ★ 2.4k

This should be a stupid question, but I wonder why the amount of A is close to the amount of T , and same for C, G.

I downloaded the hg19 genome, which contains sequences from 1 strand of all chromosomes. So, there is nothing to do with complementarity because I only have 1 strand!

I create a script which compute the amount of each bases on hg19, I get this following results:

a       854963149
c       592966724
g       593325228
t       856055361
n       239850802

total   3137161264


And as you can see, A is close to T (approx. 27%) and C is close to G ( approx. 18%).

I expected different values for each bases.. or 25% for each one! So, could you explain why?

GC percent hg19 • 2.4k views
1
Entering edit mode

This is related to Chargaff's second rule. Have a look on wiki

0
Entering edit mode

You are right! Sorry! I can explain the first rule, but cannot explain the second one.. I m reading it :

Second parity rule

The second rule holds that both %A ~ %T and %G ~ %C are valid for each of the two DNA strands. [3]This describes only a global feature of the base composition in a single DNA strand.[4]

0
Entering edit mode

If you read further, the page says there is evidence from 2006 that the second rule was proven for major kingdoms. I guess you could dig a bit deeper to examine the actual evidence, but the statement lends itself to easy understanding, so there should be no problem comprehending it. "%A~%T and %G~%C" is the logical representation of "In most eukaryotic single DNA strands, the percentage of A is approximately equal to the percentage of T, and the percentage of G is approximately equal to the percentage of C"

0
Entering edit mode

hi,

0
Entering edit mode

Chargaff experiment has been done on whole genom , which contains both strand + and -. In this case, complementary explains why A/T and C/G are constant..

This is not the same for my exemple. I working on a single strand !

1
Entering edit mode

Sacha, you're referring to his first rule

2
Entering edit mode
7.2 years ago
matted 7.7k

It has to do with the fact that there is no "true" underlying correct strand to choose when you consider each of the human chromosomes. If you consider a double-stranded DNA molecule in nature, to get a single-stranded sequence you have two choices to consider. If you take one chromosome and count the bases on the forward strand and then also do that for the reverse strand, the A and T counts would swap and the G and C counts would swap as well. Since chromosome "directions" (strands) are chosen randomly or by historical choices, after a few chromosomes the numbers even out and the fraction of A's matches the fraction of T's and the fraction of G's match the fraction of C's.

And I guess that's not the underlying reason, though hopefully the argument is clear enough. I think the underlying idea is that, at least at a larger scale, information content is just as likely to be encoded on one strand as the other (which makes sense, since if you start from a double-stranded genome there's no consistent molecular way to prefer one over the other). If the information content or processing logic is shared between strands, as we assume it is, it must lead to approximately symmetric A-T and G-C fractions.

A concrete example of this general claim is that genes are equally likely to be on the forward strand as the reverse strand, and the amino acid code is the same no matter what strand a gene is on. Therefore, if we believe those assumptions, we see how Chargaff's second law would hold (at least for coding DNA).

1
Entering edit mode

I would have thought DNA composition would have played some role (for example, if 10% of the genome came from different organisms, with a different codon bias, that might result in a non-50:50 ratio even if that 10% had a random insertion direction).

I like this idea, because it would seem to follow that if codon bias affected the 50:50 ratio, then it would explain why organisms that do horizontal gene transfer break Chargraff's second parity rule, and why organelles - genomes highly dependant and interacting with an "external" genome - are the worst offenders. In short, my theory would be that the more your genome interacts with others, the less optimised your codon/nucleoside ratios are, the further you deviate from 50:50 even in the presence of random insertion direction. So is matted's random-insertion-direction sufficient to give a 50:50 ratio?

The code below tests random 50:50 reverse-complimenting of DNA, of user-editable codon usage:

import string
import random
def rc(DNA):
return DNA.translate(string.maketrans('ACGT','TGCA'))[::-1]

genome = [
'AAA' for _ in xrange(1000)      ]+[
'AAT' for _ in xrange(10000)     ]+[
'ATT' for _ in xrange(100000)    ]+[
'TTT' for _ in xrange(1000000)   ]+[
'CCC' for _ in xrange(1000000)   ]+[
'CCG' for _ in xrange(100000)    ]+[
'CGG' for _ in xrange(10000)     ]+[
'GGG' for _ in xrange(1000)      ]

before = ''.join(genome)
A,C,G,T = before.count('A'),before.count('C'),before.count('G'),before.count('T')
print 'Before random insertion:'
print float(A)/(A+T), float(C)/(C+G)

for position,codon in enumerate(genome):
if random.randint(1,100) > 50:   # Flip codon 50%
genome[position] = rc(codon) #  of the time

after = ''.join(genome)
A,C,G,T = after.count('A'),after.count('C'),after.count('G'),after.count('T')
print 'After random insertion:'
print float(A)/(A+T), float(C)/(C+G)


Sample output:

Before random insertion:
0.036903690369 0.963096309631

After random insertion:
0.500247890819 0.499687968797


So as you can see, even if you start with a highly 'deviant' genome, you still end up with near 50:50 ratios if every codon has a 50% chance of being reverse-complimented. Therefore, I would say my idea of codon usage is false, and matted's answer is correct and sufficient to explain Chargraff's second rule :)

It also stops working when the codon number get small - obviously, because if you only had a single, highly biased gene, it doesnt matter how you orientate it, it's going to result in a highly-biased nucleoside composition. In this test i'm flipping every codon, but in nature you can only flip whole genes. So this suggests that the genomes that break the second rule are probably due to the number and size of the genes, or non-random insertion direction, and bad luck.

0
Entering edit mode
7.2 years ago
Joseph Hughes ★ 3.0k

A lot of Eukaryotic genomes are AT rich, i.e. GC poor whereas bacteria tend to be GC rich. The GC content also varies from chromosome to chromosome as shown in this post.

0
Entering edit mode
7.2 years ago
sacha ★ 2.4k

Author of this post, has developed an idea to explain the Chargaff second rule using bioinformatics: http://www.basic.northwestern.edu/g-buehler/genomes/g_chargaff.htm

Briefly, the reason is a material echange between each strands....