Question: Calculating Gc Content For All Ccds
gravatar for Davy
5.0 years ago by
United States
Davy360 wrote:

Does anyone know of a tool or package (perl/python/R) that will calculate the GC content of every coding sequence in the genome?

I need to do this on a per exon basis, as well as on a whole transcript (excluding the UTRs obviously) basis.

The gc content of every 5 bases is given here so I thought that if I have to do this myself then perhaps I could simply map the coordinates in the above file to a GTF file containing the genomic chromosomal coordinates of all CCDS of every gene. The only problem with this is the gc5Base file is 1.5 Gb is size so can't be loaded into memory.

Anyone got any suggestions???

Cheers, Davy

EDIT: Solution suggested by Aaronquinlan was to use bedtools with a fasta file of the reference genome and the gtf file with exons in it. I tried:

bedtools nuc -fi Homosapiensassembly19.fasta -bed test.txt 1usercol 2usercol 3usercol 4usercol 5usercol 6usercol 7usercol 8usercol 9usercol 10pctat 11pctgc 12numA 13numC 14numG 15numT 16numN 17numoth 18seq_len Feature (chr1:67000041-67000044) beyond the length of chr1 size (0 bp). Skipping.

and get this feature beyond length of chr1 error.

the first identifier in my fasta file is

1 dna:chromosome chromosome:GRCh37:1:1:249250621:1

and my gtf file looks like:

chr1 hg19ccdsGene CDS 67000042 67000051 0.000000 + 0 geneid "CCDS30744.1"; transcript_id "CCDS30744.1";

gc next-gen • 13k views
ADD COMMENTlink modified 5.0 years ago by Aaronquinlan9.7k • written 5.0 years ago by Davy360
gravatar for Aaronquinlan
5.0 years ago by
United States
Aaronquinlan9.7k wrote:

The nuc option in bedtools can do this for you. All you'll need is a BED or GTF file of just the exons, as well as a FASTA file of the reference genome you are dealing with. Below is an example assuming you are dealing with the latest build of the human genome.

bedtools nuc -fi hg19.fa -bed exons.bed

Have a look at the help to understand the output. In brief, the second column reported after your original BED or GTF entry will be the GC%.

bedtools nuc

Output format: 
The following information will be reported after each BED entry:
    1) %AT content
    2) %GC content
    3) Number of As observed
    4) Number of Cs observed
    5) Number of Gs observed
    6) Number of Ts observed
    7) Number of Ns observed
    8) Number of other bases observed
    9) The length of the explored sequence/interval.
    10) The seq. extracted from the FASTA file. (opt., if -seq is used)
    11) The number of times a user's pattern was observed.
        (opt., if -pattern is used.)
ADD COMMENTlink written 5.0 years ago by Aaronquinlan9.7k

This appears to be exactly what I want, except when I run bedtools as you have suggested above, I get Feature (chr1:67000041-67000044) beyond the length of chr1 size (0 bp). Skipping.

for every feature in my GTF file. Is there some thing wrong with either my fasta or gtf file. I will edit original post to show examples of each. Cheers!

ADD REPLYlink written 5.0 years ago by Davy360

One possibility is that the chromosome names in your FASTA headers do not match the chrom names in your GTF file.

ADD REPLYlink written 5.0 years ago by Aaronquinlan9.7k

Indeed that was it. Thanks again!!!

ADD REPLYlink written 5.0 years ago by Davy360

I want to calculate the GC content for transcripts by providing BED12 file. It seems bedtools will ignore the last six columns and calculate the GC content for exon+intron. Is it possible to use bedtools to calculate GC only for transcripts? Thanks.

ADD REPLYlink written 4.9 years ago by me.xhuang0

Sorry to revive an old thread, but do you know how N's are treated in the GC-calculations? I guess they should not be taken into account. Jon

ADD REPLYlink written 20 months ago by jon.brate130
gravatar for Istvan Albert
5.0 years ago by
Istvan Albert ♦♦ 71k
University Park, USA
Istvan Albert ♦♦ 71k wrote:

A possible way to do it is to extract the sequences for your exons with a tool like fastaFromBed from bedtools then use a GC counting script, you can find many options for doing that, from Emboss geecee to writing your own

ADD COMMENTlink written 5.0 years ago by Istvan Albert ♦♦ 71k

ooops! I was not specific enough in my question I'm afraid. I want to do this with the reference sequence. Not with my sample sequences.

ADD REPLYlink written 5.0 years ago by Davy360

fastaFrombed is not working anymore. Use bedtools getfasta

Tool:    bedtools getfasta (aka fastaFromBed)
Version: v2.24.0
Summary: Extract DNA sequences into a fasta file based on feature coordinates.

Usage:   bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf> -fo <fasta> 

    -fi Input FASTA file
    -bed    BED/GFF/VCF file of ranges to extract from -fi
    -fo Output file (can be FASTA or TAB-delimited)
    -name   Use the name field for the FASTA header
    -split  given BED12 fmt., extract and concatenate the sequencesfrom the BED "blocks" (e.g., exons)
    -tab    Write output in TAB delimited format.
        - Default is FASTA format.

    -s  Force strandedness. If the feature occupies the antisense,
        strand, the sequence will be reverse complemented.
        - By default, strand information is ignored.

    -fullHeader Use full fasta header.
        - By default, only the word before the first space or tab is used.
ADD REPLYlink written 10 months ago by anastazie.d30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 447 users visited in the last hour