Does anyone know of a tool or package (perl/python/R) that will calculate the GC content of every coding sequence in the genome?
I need to do this on a per exon basis, as well as on a whole transcript (excluding the UTRs obviously) basis.
The gc content of every 5 bases is given here http://hgdownload.cse.ucsc.edu/goldenPath/hg19/gc5Base/ so I thought that if I have to do this myself then perhaps I could simply map the coordinates in the above file to a GTF file containing the genomic chromosomal coordinates of all CCDS of every gene. The only problem with this is the gc5Base file is 1.5 Gb is size so can't be loaded into memory.
Anyone got any suggestions???
EDIT: Solution suggested by Aaronquinlan was to use bedtools with a fasta file of the reference genome and the gtf file with exons in it. I tried:
bedtools nuc -fi Homosapiensassembly19.fasta -bed test.txt 1usercol 2usercol 3usercol 4usercol 5usercol 6usercol 7usercol 8usercol 9usercol 10pctat 11pctgc 12numA 13numC 14numG 15numT 16numN 17numoth 18seq_len Feature (chr1:67000041-67000044) beyond the length of chr1 size (0 bp). Skipping.
and get this feature beyond length of chr1 error.
the first identifier in my fasta file is
1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
and my gtf file looks like:
chr1 hg19ccdsGene CDS 67000042 67000051 0.000000 + 0 geneid "CCDS30744.1"; transcript_id "CCDS30744.1";