how to calculate a region‘ GC content in reference genome??thanks !! So I only have the region.
how to calculate a region‘ GC content in reference genome??thanks !! So I only have the region.
If you have your sequences in a FASTA file, you can use awk to calculate GC content. The following will calculate GC content over all input sequences, and ignore ambiguous bases (like N):
$ awk ' \
BEGIN { \
FS=""; \
cg=0; \
t=0; \
} \
{ \
if ($1 != ">") { \
for (i = 1; i <= NF; i++) { \
if ($i ~ /[ACTGactg]/) { \
t++;
} \
if ($i ~ /[CGcg]/) { \
cg++;
} \
} \
} \
} \
END { \
print cg"\t"t"\t"(cg/t); \
}' foo.fa
For example:
$ more foo.fa
>ctcf
CCGCGNGGNGGCAG
$ awk '...' foo.fa
11 12 0.916667
If you need GC content per-FASTA element, the END block's code can be copied into the main loop:
$ awk ' \
BEGIN { \
FS=""; \
cg=0; \
t=0; \
} \
{ \
if ($1 != ">") { \
for (i = 1; i <= NF; i++) { \
if ($i ~ /[ACTGactg]/) { \
t++;
} \
if ($i ~ /[CGcg]/) { \
cg++;
} \
} \
} \
else { \
if (t > 0) { \
print h"\t"cg"\t"t"\t"(cg/t); \
cg = 0; \
t = 0; \
} \
h = substr($0,2); \
} \
} \
END { \
print h"\t"cg"\t"t"\t"(cg/t); \
}' foo.fa
The output from this includes each element's header:
$ awk '...' foo.fa
ctcf 11 12 0.916667
To get a FASTA file, if your organism-of-interest has reference data on the UCSC Genome Browser, you can obtain sequence data for your regions-of-interest through the use of a custom track.
First, create a BED file containing your regions-of-interest and upload that file to a custom track via My Data > Custom Tracks. Then ask the browser to return the sequence data associated with these custom regions:
Tools > Table BrowserCustom Tracks from the group pull-down menutable menuregion radio button to genomeoutput format to sequenceget outputget sequenceSave the resulting FASTA file to a place where you can find it. Then run it through one of the two awk scripts above.
Please refer to http://www.biologicscorp.com/tools/GCContent which plots using google chart and it is easy to know how to calculate.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Do you have the sequence data already, or do you need that as well? What have you tried so far?