Does anyone know of a tool or package (perl/python/R) that will calculate the GC content of every coding sequence in the genome?
I need to do this on a per exon basis, as well as on a whole transcript (excluding the UTRs obviously) basis.
The gc content of every 5 bases is given here http://hgdownload.cse.ucsc.edu/goldenPath/hg19/gc5Base/ so I thought that if I have to do this myself then perhaps I could simply map the coordinates in the above file to a GTF file containing the genomic chromosomal coordinates of all CCDS of every gene. The only problem with this is the gc5Base file is 1.5 Gb is size so can't be loaded into memory.
Anyone got any suggestions???
EDIT: Solution suggested by Aaronquinlan was to use bedtools with a fasta file of the reference genome and the gtf file with exons in it. I tried:
bedtools nuc -fi Homosapiensassembly19.fasta -bed test.txt 1usercol 2usercol 3usercol 4usercol 5usercol 6usercol 7usercol 8usercol 9usercol 10pctat 11pctgc 12numA 13numC 14numG 15numT 16numN 17numoth 18seq_len Feature (chr1:67000041-67000044) beyond the length of chr1 size (0 bp). Skipping.
and get this feature beyond length of chr1 error.
the first identifier in my fasta file is
1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
and my gtf file looks like:
chr1 hg19ccdsGene CDS 67000042 67000051 0.000000 + 0 geneid "CCDS30744.1"; transcript_id "CCDS30744.1";
This appears to be exactly what I want, except when I run bedtools as you have suggested above, I get Feature (chr1:67000041-67000044) beyond the length of chr1 size (0 bp). Skipping.
for every feature in my GTF file. Is there some thing wrong with either my fasta or gtf file. I will edit original post to show examples of each. Cheers!
One possibility is that the chromosome names in your FASTA headers do not match the chrom names in your GTF file.
Indeed that was it. Thanks again!!!
I want to calculate the GC content for transcripts by providing BED12 file. It seems bedtools will ignore the last six columns and calculate the GC content for exon+intron. Is it possible to use bedtools to calculate GC only for transcripts? Thanks.
Sorry to revive an old thread, but do you know how N's are treated in the GC-calculations? I guess they should not be taken into account. Jon