Question: Biomart Yields Incomplete Results When Converting Refseq_Dna To Refseq_Peptide
3
gravatar for Ryan Thompson
9.8 years ago by
Ryan Thompson3.4k
TSRI, La Jolla, CA
Ryan Thompson3.4k wrote:

I am trying to use biomaRt in R to retrieve the corresponding refseqpeptide IDs for a list of refseqdna mRNA transcript ids. However, for some transcripts, no peptide ID is returned, even though other sources clearly indicate an associated peptide for that transcript. For example, "NM_000092" has this problem. Using the martview web interface, I can reproduce the same results. Here is a link.

[EDIT] - converted URL to tinyurl

http://tinyurl.com/37e6s6e

You can see that I have queried for refseqdna equal to NM000092, and retrieved dna and protein identifiers in both refseq and Ensembl. Only the refseq protein ID is empty. If you look on the NCBI record for NM000092, you'll see that the answer should be NP000083:

/product="collagen alpha-4(IV) chain precursor"
/protein_id="NP_000083.3"

Also, if I search on bioDBnet's db2db tool, it does find the associated peptide ID.

Furthermore, searching with IDConverter also yields the correct results, and IDConverter explicitly states that its refseq_peptide info comes from Ensembl, which is presumably the same source as biomart.

So why isn't biomart finding some mRNA-peptide associations that other tools are?

biomart conversion • 2.8k views
ADD COMMENTlink modified 9.8 years ago by Uma0 • written 9.8 years ago by Ryan Thompson3.4k
5
gravatar for Neilfws
9.8 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

I think that this occurs due to some subtleties in both the way that BioMart works and the way RefSeq defines reference mRNAs and their products.

Try this query instead: http://tinyurl.com/27d8nk5

It uses the HGNC symbol for the gene (COL4A4), in place of the RefSeq mRNA. You should see a result like this:

BioMart COL4A4

This shows that there are 2 Ensembl transcripts. One of them maps to the RefSeq mRNA, the other maps to the RefSeq protein.

It is a little difficult to determine what the RefSeq curators had in mind here! The protein product of each Ensembl transcript is the same length. Presumably, someone has decided that the reference mRNA in RefSeq should map to one of the transcripts, but the reference protein should map to the other.

I guess the conclusion is: try different search terms if you don't see what you expected.

ADD COMMENTlink modified 9.8 years ago • written 9.8 years ago by Neilfws48k

Hmm. According to Ensembl, the two transcripts listed differ by one exon, and the protein products are not identical. So it looks like the problem is inconsistencies between RefSeq and Ensembl, which are causing problems because biomart is presumably using Ensembl IDs as an intermediary to do the conversion from RefSeq RNA to RefSeq peptide. I need to handle this programmatically for several hundred problemmatic transcripts. Is there any package in R/Bioconductor that converts RefSeq RNA to RefSeq peptide without going through Ensembl IDs?

ADD REPLYlink written 9.8 years ago by Ryan Thompson3.4k

Also, querying for the protein products of a gene is not the same as querying for the protein products of a transcript, and for my application, the difference matters. So querying based on gene name is not really way for me to fix this.

ADD REPLYlink written 9.8 years ago by Ryan Thompson3.4k

I tried using DAVIDQuery in place of biomart, but DAVID does gene-centric queries, so it has the same problem as my previous comment.

ADD REPLYlink written 9.8 years ago by Ryan Thompson3.4k

The problem is in the mappings, not in the query. This is just a strange case, where RefSeq RNA and RefSeq protein do not map as you would expect - i.e. to the same transcript. If you want a direct mapping, it's probably best to use NCBI resources, as Pierre suggested, rather than BioMart.

ADD REPLYlink written 9.8 years ago by Neilfws48k

The problem is in the mappings, not in the query. This is just a strange case, where RefSeq RNA and RefSeq protein do not map as you would expect - i.e. to the same transcript. If you want a direct mapping, it's probably best to use NCBI resources rather than BioMart.

ADD REPLYlink written 9.8 years ago by Neilfws48k
0
gravatar for Uma
9.7 years ago by
Uma0
Uma0 wrote:

bioDBnet's dbWalk tool can be used to define the path to be used for conversions. So in this case the bioDBnet path would be 'RefSeq mRNA Accession->RefSeq Protein Accession'.

ADD COMMENTlink written 9.7 years ago by Uma0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1215 users visited in the last hour