How To Calculate A Region‘ Gc Content In Reference Genome??
2
1
Entering edit mode
8.1 years ago
siyu ▴ 140

how to calculate a region‘ GC content in reference genome??thanks !! So I only have the region.

gc • 8.3k views
ADD COMMENT
1
Entering edit mode

Do you have the sequence data already, or do you need that as well? What have you tried so far?

ADD REPLY
8
Entering edit mode
8.1 years ago

If you have your sequences in a FASTA file, you can use awk to calculate GC content. The following will calculate GC content over all input sequences, and ignore ambiguous bases (like N):

$ awk ' \
BEGIN { \
    FS=""; \
    cg=0; \
    t=0; \
} \
{ \
    if ($1 != ">") { \
        for (i = 1; i <= NF; i++) { \
            if ($i ~ /[ACTGactg]/) { \
                t++;
            } \
            if ($i ~ /[CGcg]/) { \
                cg++;
            } \
        } \
    } \
} \
END { \
    print cg"\t"t"\t"(cg/t); \
}' foo.fa

For example:

$ more foo.fa
>ctcf
CCGCGNGGNGGCAG

$ awk '...' foo.fa
11  12  0.916667

If you need GC content per-FASTA element, the END block's code can be copied into the main loop:

$ awk ' \
BEGIN { \
    FS=""; \
    cg=0; \
    t=0; \
} \
{ \
    if ($1 != ">") { \
        for (i = 1; i <= NF; i++) { \
            if ($i ~ /[ACTGactg]/) { \
                t++;
            } \
            if ($i ~ /[CGcg]/) { \
                cg++;
            } \
        } \
    } \
    else { \
        if (t > 0) { \
            print h"\t"cg"\t"t"\t"(cg/t); \
            cg = 0; \
            t = 0; \
        } \
        h = substr($0,2); \
    } \
} \
END { \
    print h"\t"cg"\t"t"\t"(cg/t); \
}' foo.fa

The output from this includes each element's header:

$ awk '...' foo.fa
ctcf    11  12  0.916667

To get a FASTA file, if your organism-of-interest has reference data on the UCSC Genome Browser, you can obtain sequence data for your regions-of-interest through the use of a custom track.

First, create a BED file containing your regions-of-interest and upload that file to a custom track via My Data > Custom Tracks. Then ask the browser to return the sequence data associated with these custom regions:

  1. Select Tools > Table Browser
  2. Select Custom Tracks from the group pull-down menu
  3. Select your custom track from the table menu
  4. Set the region radio button to genome
  5. Set the output format to sequence
  6. Click get output
  7. Set any post-processing options and click get sequence

Save the resulting FASTA file to a place where you can find it. Then run it through one of the two awk scripts above.

ADD COMMENT
0
Entering edit mode
6.8 years ago
bioee • 0

1. determine the slide window size

2. calculate the gc content for each window size

Please refer to http://www.biologicscorp.com/tools/GCContent  which plots using google chart and it is easy to know how to calculate. 

ADD COMMENT

Login before adding your answer.

Traffic: 2907 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6