I have a really easy one for you today and it's annoying me I haven't found the answer myself yet.
My PI would like me to create a pie chart of types of genomic locations that occur in the whole genome. For example, what percentage of the whole genome is intronic, exonic, intergenic, a 5'UTR etc etc. I'm wondering which file I would use to create this and what tool? I'm thinking some sort of bed file of the whole genome to then annotate with Homer but I'm not sure exactly which file and format to go with. I have to do the hg19 UCSC genome as well as the newest rat Rnor6.0 ensembl genome.
# Get the Human TxDb object, and restrict it to standard chromosomes (no random or Un chromosomes)
> Tx.human = TxDb.Hsapiens.UCSC.hg19.knownGene
# Total number of bases in the human genome.
> tot.wholegenome = sum(as.numeric(seqlengths(exons(Tx.human))))
# Total bases covered by exons
> tot.exons = exons(Tx.human) %>%
reduce %>% # merge overlapping exons to avoid double-counting
width %>% # get width of each exon
Now you have both the total number of bases in the genome, and the bases covered by exons. You can plot it with your library of preference (e.g. ggplot2)
To get introns, intergenic regions, etc.. just use the genes(), cds(), and other TxDb functions, and intersect them.
I'm just going to throw out an easy way to do this using the ChIPseeker R package from Bioconductor. You would first annotate your peaks, and then use the annoPie function to achieve your desired results automatically.