Calculating Gc Content For All Ccds
2
10
Entering edit mode
10.5 years ago
Davy ▴ 390

Does anyone know of a tool or package (perl/python/R) that will calculate the GC content of every coding sequence in the genome?

I need to do this on a per exon basis, as well as on a whole transcript (excluding the UTRs obviously) basis.

The gc content of every 5 bases is given here http://hgdownload.cse.ucsc.edu/goldenPath/hg19/gc5Base/ so I thought that if I have to do this myself then perhaps I could simply map the coordinates in the above file to a GTF file containing the genomic chromosomal coordinates of all CCDS of every gene. The only problem with this is the gc5Base file is 1.5 Gb is size so can't be loaded into memory.

Anyone got any suggestions???

Cheers, Davy

EDIT: Solution suggested by Aaronquinlan was to use bedtools with a fasta file of the reference genome and the gtf file with exons in it. I tried:

bedtools nuc -fi Homosapiensassembly19.fasta -bed test.txt 1usercol 2usercol 3usercol 4usercol 5usercol 6usercol 7usercol 8usercol 9usercol 10pctat 11pctgc 12numA 13numC 14numG 15numT 16numN 17numoth 18seq_len Feature (chr1:67000041-67000044) beyond the length of chr1 size (0 bp). Skipping.

and get this feature beyond length of chr1 error.

the first identifier in my fasta file is

1 dna:chromosome chromosome:GRCh37:1:1:249250621:1

and my gtf file looks like:

chr1 hg19ccdsGene CDS 67000042 67000051 0.000000 + 0 geneid "CCDS30744.1"; transcript_id "CCDS30744.1";

next-gen gc gc • 30k views
27
Entering edit mode
10.5 years ago

The nuc option in bedtools can do this for you. All you'll need is a BED or GTF file of just the exons, as well as a FASTA file of the reference genome you are dealing with. Below is an example assuming you are dealing with the latest build of the human genome.

bedtools nuc -fi hg19.fa -bed exons.bed


Have a look at the help to understand the output. In brief, the second column reported after your original BED or GTF entry will be the GC%.

bedtools nuc

Output format:
The following information will be reported after each BED entry:
1) %AT content
2) %GC content
3) Number of As observed
4) Number of Cs observed
5) Number of Gs observed
6) Number of Ts observed
7) Number of Ns observed
8) Number of other bases observed
9) The length of the explored sequence/interval.
10) The seq. extracted from the FASTA file. (opt., if -seq is used)
11) The number of times a user's pattern was observed.
(opt., if -pattern is used.)

0
Entering edit mode

This appears to be exactly what I want, except when I run bedtools as you have suggested above, I get Feature (chr1:67000041-67000044) beyond the length of chr1 size (0 bp). Skipping.

for every feature in my GTF file. Is there some thing wrong with either my fasta or gtf file. I will edit original post to show examples of each. Cheers!

2
Entering edit mode

One possibility is that the chromosome names in your FASTA headers do not match the chrom names in your GTF file.

0
Entering edit mode

Indeed that was it. Thanks again!!!

0
Entering edit mode

I want to calculate the GC content for transcripts by providing BED12 file. It seems bedtools will ignore the last six columns and calculate the GC content for exon+intron. Is it possible to use bedtools to calculate GC only for transcripts? Thanks.

0
Entering edit mode

Sorry to revive an old thread, but do you know how N's are treated in the GC-calculations? I guess they should not be taken into account. Jon

3
Entering edit mode
10.5 years ago

A possible way to do it is to extract the sequences for your exons with a tool like fastaFromBed from bedtools then use a GC counting script, you can find many options for doing that, from Emboss geecee to writing your own

1
Entering edit mode

fastaFrombed is not working anymore. Use bedtools getfasta

Tool:    bedtools getfasta (aka fastaFromBed)
Version: v2.24.0
Summary: Extract DNA sequences into a fasta file based on feature coordinates.

Usage:   bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf> -fo <fasta>

Options:
-fi Input FASTA file
-bed    BED/GFF/VCF file of ranges to extract from -fi
-fo Output file (can be FASTA or TAB-delimited)
-name   Use the name field for the FASTA header
-split  given BED12 fmt., extract and concatenate the sequencesfrom the BED "blocks" (e.g., exons)
-tab    Write output in TAB delimited format.
- Default is FASTA format.

-s  Force strandedness. If the feature occupies the antisense,
strand, the sequence will be reverse complemented.
- By default, strand information is ignored.

- By default, only the word before the first space or tab is used.

0
Entering edit mode

ooops! I was not specific enough in my question I'm afraid. I want to do this with the reference sequence. Not with my sample sequences.