Question: Obtaining Cds Reference Genome Sequences And Coordinance Using Biomart
0
gravatar for Max
6.2 years ago by
Max130
Max130 wrote:

I have a data set of mutation calls that I would like to map onto a reference human genome in order to infer the ratio of synonymous to non-synonymous substitutions.

To proceed, I used a list of hgnc gene names from the human genome and attempted to obtain their coordinates using BioMart, i.e.

library("BSgenome.Hsapiens.UCSC.hg19")
library("biomaRt")
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
f = getBM(attributes = c("genomic_coding_start", "genomic_coding_end"), filters = "hgnc_symbol", values = genes, mart = mart)

I selected "coding" because for dN/dS I'm only concerned with matching identified mutations to coding regions. What I found was that the coordinates in genomic_coding_start - genomic_coding_end did NOT correspond to the lengths of the returned cds obtained using the attribute "coding."

Is there a simple way to obtain both the cds sequence and its chromosome location using biomaRt, and if not, I would greatly appreciate an alternative way to obtain this information.

Thank you in advance for your feedback and assistance.

ensembl biomart • 2.2k views
ADD COMMENTlink modified 6.2 years ago by KCC4.0k • written 6.2 years ago by Max130
0
gravatar for KCC
6.2 years ago by
KCC4.0k
Cambridge, MA
KCC4.0k wrote:

I do not think you can expect the length of the cds to agree with the genomic_coding_start and genomic_coding_end. The reason is the cds should not include the introns, while the distance (genomic_coding_end-genomic_coding_start) would include the introns.

ADD COMMENTlink written 6.2 years ago by KCC4.0k

Would exon_chrom_start and exon_chrom_end give me what I need then?

ADD REPLYlink written 6.2 years ago by Max130

I don't know about the state of annotations you are looking at. I can list some things you might want to look out for based on my experience. You will need to pull all the exons.

  1. It's possible you will have multiple copies of essentially the same exons, because the exons will be for each transcript instead of each gene.

  2. It's possible you will have overlapping, but not identical exons because exons in different transcripts might occasionally have different sizes.

  3. The first exon and last exon may or may not include the 3 prime and 5 prime UTRs.

After you get the exons, you would be able to compute the CDS. Note the CDS is just going to be from the ATG to the stop codon excluding the introns. This is why the inclusion of the UTRs might be a problem for you.

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by KCC4.0k

Also, is there an equivalent attribute to "coding" that will only return the exon sequences?

ADD REPLYlink written 6.2 years ago by Max130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2096 users visited in the last hour