I have a data set of mutation calls that I would like to map onto a reference human genome in order to infer the ratio of synonymous to non-synonymous substitutions.
To proceed, I used a list of hgnc gene names from the human genome and attempted to obtain their coordinates using BioMart, i.e.
library("BSgenome.Hsapiens.UCSC.hg19")
library("biomaRt")
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
f = getBM(attributes = c("genomic_coding_start", "genomic_coding_end"), filters = "hgnc_symbol", values = genes, mart = mart)
I selected "coding" because for dN/dS I'm only concerned with matching identified mutations to coding regions. What I found was that the coordinates in genomic_coding_start - genomic_coding_end did NOT correspond to the lengths of the returned cds obtained using the attribute "coding."
Is there a simple way to obtain both the cds sequence and its chromosome location using biomaRt, and if not, I would greatly appreciate an alternative way to obtain this information.
Thank you in advance for your feedback and assistance.
Would exon_chrom_start and exon_chrom_end give me what I need then?
I don't know about the state of annotations you are looking at. I can list some things you might want to look out for based on my experience. You will need to pull all the exons.
It's possible you will have multiple copies of essentially the same exons, because the exons will be for each transcript instead of each gene.
It's possible you will have overlapping, but not identical exons because exons in different transcripts might occasionally have different sizes.
The first exon and last exon may or may not include the 3 prime and 5 prime UTRs.
After you get the exons, you would be able to compute the CDS. Note the CDS is just going to be from the ATG to the stop codon excluding the introns. This is why the inclusion of the UTRs might be a problem for you.
Also, is there an equivalent attribute to "coding" that will only return the exon sequences?