Question: How To Calculate A Region‘ Gc Content In Reference Genome??
1
gravatar for siyu
5.9 years ago by
siyu130
China
siyu130 wrote:

how to calculate a region‘ GC content in reference genome??thanks !! So I only have the region.

gc • 6.0k views
ADD COMMENTlink modified 4.7 years ago by bioee0 • written 5.9 years ago by siyu130
1

Do you have the sequence data already, or do you need that as well? What have you tried so far?

ADD REPLYlink written 5.9 years ago by Alex Reynolds27k
6
gravatar for Alex Reynolds
5.9 years ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

If you have your sequences in a FASTA file, you can use awk to calculate GC content. The following will calculate GC content over all input sequences, and ignore ambiguous bases (like N):

$ awk ' \
BEGIN { \
    FS=""; \
    cg=0; \
    t=0; \
} \
{ \
    if ($1 != ">") { \
        for (i = 1; i <= NF; i++) { \
            if ($i ~ /[ACTGactg]/) { \
                t++;
            } \
            if ($i ~ /[CGcg]/) { \
                cg++;
            } \
        } \
    } \
} \
END { \
    print cg"\t"t"\t"(cg/t); \
}' foo.fa

For example:

$ more foo.fa
>ctcf
CCGCGNGGNGGCAG

$ awk '...' foo.fa
11  12  0.916667

If you need GC content per-FASTA element, the END block's code can be copied into the main loop:

$ awk ' \
BEGIN { \
    FS=""; \
    cg=0; \
    t=0; \
} \
{ \
    if ($1 != ">") { \
        for (i = 1; i <= NF; i++) { \
            if ($i ~ /[ACTGactg]/) { \
                t++;
            } \
            if ($i ~ /[CGcg]/) { \
                cg++;
            } \
        } \
    } \
    else { \
        if (t > 0) { \
            print h"\t"cg"\t"t"\t"(cg/t); \
            cg = 0; \
            t = 0; \
        } \
        h = substr($0,2); \
    } \
} \
END { \
    print h"\t"cg"\t"t"\t"(cg/t); \
}' foo.fa

The output from this includes each element's header:

$ awk '...' foo.fa
ctcf    11  12  0.916667

To get a FASTA file, if your organism-of-interest has reference data on the UCSC Genome Browser, you can obtain sequence data for your regions-of-interest through the use of a custom track.

First, create a BED file containing your regions-of-interest and upload that file to a custom track via My Data > Custom Tracks. Then ask the browser to return the sequence data associated with these custom regions:

  1. Select Tools > Table Browser
  2. Select Custom Tracks from the group pull-down menu
  3. Select your custom track from the table menu
  4. Set the region radio button to genome
  5. Set the output format to sequence
  6. Click get output
  7. Set any post-processing options and click get sequence

Save the resulting FASTA file to a place where you can find it. Then run it through one of the two awk scripts above.

ADD COMMENTlink modified 23 months ago • written 5.9 years ago by Alex Reynolds27k
0
gravatar for bioee
4.7 years ago by
bioee0
United States
bioee0 wrote:

1. determine the slide window size

2. calculate the gc content for each window size

Please refer to http://www.biologicscorp.com/tools/GCContent  which plots using google chart and it is easy to know how to calculate. 

ADD COMMENTlink written 4.7 years ago by bioee0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 809 users visited in the last hour