Question: How To Calculate A Region‘ Gc Content In Reference Genome？？
1
5.9 years ago by
siyu130
China
siyu130 wrote:

how to calculate a region‘ GC content in reference genome？？thanks !! So I only have the region.

gc • 6.0k views
modified 4.7 years ago by bioee0 • written 5.9 years ago by siyu130
1

Do you have the sequence data already, or do you need that as well? What have you tried so far?

6
5.9 years ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

If you have your sequences in a FASTA file, you can use `awk` to calculate GC content. The following will calculate GC content over all input sequences, and ignore ambiguous bases (like `N`):

``````\$ awk ' \
BEGIN { \
FS=""; \
cg=0; \
t=0; \
} \
{ \
if (\$1 != ">") { \
for (i = 1; i <= NF; i++) { \
if (\$i ~ /[ACTGactg]/) { \
t++;
} \
if (\$i ~ /[CGcg]/) { \
cg++;
} \
} \
} \
} \
END { \
print cg"\t"t"\t"(cg/t); \
}' foo.fa
``````

For example:

``````\$ more foo.fa
>ctcf
CCGCGNGGNGGCAG

\$ awk '...' foo.fa
11  12  0.916667
``````

If you need GC content per-FASTA element, the `END` block's code can be copied into the main loop:

``````\$ awk ' \
BEGIN { \
FS=""; \
cg=0; \
t=0; \
} \
{ \
if (\$1 != ">") { \
for (i = 1; i <= NF; i++) { \
if (\$i ~ /[ACTGactg]/) { \
t++;
} \
if (\$i ~ /[CGcg]/) { \
cg++;
} \
} \
} \
else { \
if (t > 0) { \
print h"\t"cg"\t"t"\t"(cg/t); \
cg = 0; \
t = 0; \
} \
h = substr(\$0,2); \
} \
} \
END { \
print h"\t"cg"\t"t"\t"(cg/t); \
}' foo.fa
``````

The output from this includes each element's header:

``````\$ awk '...' foo.fa
ctcf    11  12  0.916667
``````

To get a FASTA file, if your organism-of-interest has reference data on the UCSC Genome Browser, you can obtain sequence data for your regions-of-interest through the use of a custom track.

First, create a BED file containing your regions-of-interest and upload that file to a custom track via `My Data > Custom Tracks`. Then ask the browser to return the sequence data associated with these custom regions:

1. Select `Tools > Table Browser`
2. Select `Custom Tracks` from the `group` pull-down menu
3. Select your custom track from the `table` menu
4. Set the `region` radio button to `genome`
5. Set the `output format` to `sequence`
6. Click `get output`
7. Set any post-processing options and click `get sequence`

Save the resulting FASTA file to a place where you can find it. Then run it through one of the two `awk` scripts above.

0
4.7 years ago by
bioee0
United States
bioee0 wrote:

1. determine the slide window size

2. calculate the gc content for each window size

Please refer to http://www.biologicscorp.com/tools/GCContent  which plots using google chart and it is easy to know how to calculate.