Calculate GC content for entire chromosome
Entering edit mode
3 months ago

Hi, I have an assembly I created using bwa-mem. I want to determine the GC percentage for each chromosome itself, not the individual reads. For example, Chr1 has a GC percentage of ___. What is the best way to go about doing this?

Thank you!

bam GC assembly bwa-mem • 623 views
Entering edit mode
12 weeks ago

The main one I use is from the bbmap package (conda installable).

GC is 0.428 = 43% in this example. It's very quick. MA_organelles1.fasta

A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.2854  0.2148  0.2137  0.2861  0.0000  0.0000  0.0000  0.4285  0.0273

Main genome scaffold total:             2
Main genome contig total:               2
Main genome scaffold sequence total:    0.710 MB
Main genome contig sequence total:      0.710 MB        0.000% gap
Main genome scaffold N/L50:             1/569.63 KB
Main genome contig N/L50:               1/569.63 KB
Main genome scaffold N/L90:             2/140.384 KB
Main genome contig N/L90:               2/140.384 KB
Max scaffold length:                    569.63 KB
Max contig length:                      569.63 KB
Number of scaffolds > 50 KB:            2
% main genome in scaffolds > 50 KB:     100.00%

Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig  
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                      2               2         710,014         710,014   100.00%
 100 KB                      2               2         710,014         710,014   100.00%
 250 KB                      1               1         569,630         569,630   100.00%
 500 KB                      1               1         569,630         569,630   100.00%

There is also for finding divergent GC regions seqtk gc

    seqtk gc

Usage: seqtk gc [options] <in.fa>
  -w         identify high-AT regions
  -f FLOAT   min GC fraction (or AT fraction for -w) [0.60]
  -l INT     min region length to output [20]
  -x FLOAT   X-dropoff [10.0]
Entering edit mode

What does this do? It's not documented well and does not work for me in terms of, it produces no output:

cat foo.fa

seqtk gc foo.fa
Entering edit mode

Try tripling your example seq length, see 20bp min region

  -l INT     min region length to output [20]

To answer your question, it picks out divergent GC regions over default 60% or fraction 0.60 with a mi length of 20bp, and produces a bed file of those regions.

So actually not what the question was I'm afraid.

Entering edit mode
3 months ago
ben@f ▴ 10

If you're comfortable using Python, I've created a script that calculates the GC content and GC-skew for each contig, scaffold, or chromosome in a fasta file. This is specifically designed for generating data for a circos plot.

To use the script, make sure you have Biopython installed in your conda environment. Once that's set up, here's how the script functions:

Run the script in your terminal using the command:

python3 -f file.fasta

You can adjust the window size (-w) and step size (-s) according to your preferences. the default was set to 1000bp


## this script is to generate GC content
import sys
from Bio import SeqIO
from argparse import ArgumentParser

def gc_content_window(s):
    gc = sum(s.count(x) for x in ['G', 'C', 'g', 'c', 'S', 's'])
    return round(gc / len(s), 4)

def gc_skew_window(s):
    g, c = s.lower().count('g'), s.lower().count('c')
    return round((g - c) / (g + c), 4) if g + c > 0 else 0

def main():
    parser = ArgumentParser(description="Calculate GC content and GC skew of input Fasta sequence")
    parser.add_argument("-f", "--file", dest="filename", help="Input Fasta format file", metavar="FASTA", required=True)
    parser.add_argument("-w", "--window", dest="WindowSize", help="WindowSize to calculate (default: 1000)", default=1000, type=int)
    parser.add_argument("-s", "--step", dest="StepSize", help="StepSize for slide widows (default: 1000)", default=1000, type=int)
    args = parser.parse_args()

    with open("gc_content.tsv", "w") as gc_content_file, open("gc_skew.tsv", "w") as gc_skew_file:
        for record in SeqIO.parse(args.filename, 'fasta'):
            seq = record.seq
            for i in range(0, len(seq), args.StepSize):
                subseq = seq[i:i + args.WindowSize]
                start, end = i + 1, min(i + args.StepSize, len(seq))

if __name__ == '__main__':

Hopefully, this helps


Login before adding your answer.

Traffic: 1353 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6