Question

Calculate GC content for entire chromosome

0

Entering edit mode

7 months ago

skilleta0527 • 0

Hi, I have an assembly I created using bwa-mem. I want to determine the GC percentage for each chromosome itself, not the individual reads. For example, Chr1 has a GC percentage of ___. What is the best way to go about doing this?

Thank you!

bam GC assembly bwa-mem • 1.0k views

ADD COMMENT • link updated 7 months ago by colindaven 6.5k • written 7 months ago by skilleta0527 • 0

score 3 · Answer 1 · 2023-12-06

The main one I use is stats.sh from the bbmap package (conda installable).

GC is 0.428 = 43% in this example. It's very quick.

stats.sh MA_organelles1.fasta

A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.2854  0.2148  0.2137  0.2861  0.0000  0.0000  0.0000  0.4285  0.0273

Main genome scaffold total:             2
Main genome contig total:               2
Main genome scaffold sequence total:    0.710 MB
Main genome contig sequence total:      0.710 MB        0.000% gap
Main genome scaffold N/L50:             1/569.63 KB
Main genome contig N/L50:               1/569.63 KB
Main genome scaffold N/L90:             2/140.384 KB
Main genome contig N/L90:               2/140.384 KB
Max scaffold length:                    569.63 KB
Max contig length:                      569.63 KB
Number of scaffolds > 50 KB:            2
% main genome in scaffolds > 50 KB:     100.00%


Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig  
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                      2               2         710,014         710,014   100.00%
 100 KB                      2               2         710,014         710,014   100.00%
 250 KB                      1               1         569,630         569,630   100.00%
 500 KB                      1               1         569,630         569,630   100.00%

There is also for finding divergent GC regions seqtk gc

https://github.com/lh3/seqtk

    seqtk gc

Usage: seqtk gc [options] <in.fa>
Options:
  -w         identify high-AT regions
  -f FLOAT   min GC fraction (or AT fraction for -w) [0.60]
  -l INT     min region length to output [20]
  -x FLOAT   X-dropoff [10.0]

Ram · Answer 2 · 2023-12-05

If you're comfortable using Python, I've created a script that calculates the GC content and GC-skew for each contig, scaffold, or chromosome in a fasta file. This is specifically designed for generating data for a circos plot.

To use the script, make sure you have Biopython installed in your conda environment. Once that's set up, here's how the script functions:

Run the script in your terminal using the command:

python3 gc_content_skew-optimized.py -f file.fasta

You can adjust the window size (-w) and step size (-s) according to your preferences. the default was set to 1000bp

#!/usr/bin/python

## this script is to generate GC content
import sys
from Bio import SeqIO
from argparse import ArgumentParser

def gc_content_window(s):
    gc = sum(s.count(x) for x in ['G', 'C', 'g', 'c', 'S', 's'])
    return round(gc / len(s), 4)

def gc_skew_window(s):
    g, c = s.lower().count('g'), s.lower().count('c')
    return round((g - c) / (g + c), 4) if g + c > 0 else 0

def main():
    parser = ArgumentParser(description="Calculate GC content and GC skew of input Fasta sequence")
    parser.add_argument("-f", "--file", dest="filename", help="Input Fasta format file", metavar="FASTA", required=True)
    parser.add_argument("-w", "--window", dest="WindowSize", help="WindowSize to calculate (default: 1000)", default=1000, type=int)
    parser.add_argument("-s", "--step", dest="StepSize", help="StepSize for slide widows (default: 1000)", default=1000, type=int)
    args = parser.parse_args()

    with open("gc_content.tsv", "w") as gc_content_file, open("gc_skew.tsv", "w") as gc_skew_file:
        for record in SeqIO.parse(args.filename, 'fasta'):
            seq = record.seq
            for i in range(0, len(seq), args.StepSize):
                subseq = seq[i:i + args.WindowSize]
                start, end = i + 1, min(i + args.StepSize, len(seq))
                gc_content_file.write(f"{record.id}\t{start}\t{end}\t{gc_content_window(subseq)}\n")
                gc_skew_file.write(f"{record.id}\t{start}\t{end}\t{gc_skew_window(subseq)}\n")

if __name__ == '__main__':
    main()

Hopefully, this helps

score 1 · Answer 3 · 2023-12-05

1

Entering edit mode

7 months ago

ATpoint 84k

How calculate CG content for each contig in an transcript assembly?

ADD COMMENT • link 7 months ago by ATpoint 84k