Question

Obtaining Cds Reference Genome Sequences And Coordinance Using Biomart

0

Entering edit mode

10.8 years ago

Max ▴ 150

I have a data set of mutation calls that I would like to map onto a reference human genome in order to infer the ratio of synonymous to non-synonymous substitutions.

To proceed, I used a list of hgnc gene names from the human genome and attempted to obtain their coordinates using BioMart, i.e.

library("BSgenome.Hsapiens.UCSC.hg19")
library("biomaRt")
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
f = getBM(attributes = c("genomic_coding_start", "genomic_coding_end"), filters = "hgnc_symbol", values = genes, mart = mart)

I selected "coding" because for dN/dS I'm only concerned with matching identified mutations to coding regions. What I found was that the coordinates in genomic_coding_start - genomic_coding_end did NOT correspond to the lengths of the returned cds obtained using the attribute "coding."

Is there a simple way to obtain both the cds sequence and its chromosome location using biomaRt, and if not, I would greatly appreciate an alternative way to obtain this information.

Thank you in advance for your feedback and assistance.

biomart ensembl • 3.9k views

ADD COMMENT • link updated 10.8 years ago by KCC ★ 4.1k • written 10.8 years ago by Max ▴ 150

score 0 · Answer 1 · 2013-07-15

0

Entering edit mode

10.8 years ago

KCC ★ 4.1k

I do not think you can expect the length of the cds to agree with the genomic_coding_start and genomic_coding_end. The reason is the cds should not include the introns, while the distance (genomic_coding_end-genomic_coding_start) would include the introns.

ADD COMMENT • link 10.8 years ago by KCC ★ 4.1k

0

Entering edit mode

Would exon_chrom_start and exon_chrom_end give me what I need then?

ADD REPLY • link 10.8 years ago by Max ▴ 150

0

Entering edit mode

I don't know about the state of annotations you are looking at. I can list some things you might want to look out for based on my experience. You will need to pull all the exons.

It's possible you will have multiple copies of essentially the same exons, because the exons will be for each transcript instead of each gene.
It's possible you will have overlapping, but not identical exons because exons in different transcripts might occasionally have different sizes.
The first exon and last exon may or may not include the 3 prime and 5 prime UTRs.

After you get the exons, you would be able to compute the CDS. Note the CDS is just going to be from the ATG to the stop codon excluding the introns. This is why the inclusion of the UTRs might be a problem for you.

ADD REPLY • link 10.8 years ago by KCC ★ 4.1k

0

Entering edit mode

Also, is there an equivalent attribute to "coding" that will only return the exon sequences?

ADD REPLY • link 10.8 years ago by Max ▴ 150