Question: No corresponding Gene Symbol for Affymertix Probe Set ID
1
gravatar for Amir Hajian
3.5 years ago by
Amir Hajian40
Tehran
Amir Hajian40 wrote:

I've downloaded this breast cancer expression profile data from NCBI (GPL570), which has 54675 rows. In this dataset, rows are probes, which i want to convert them into gene symbols to give it to GENIE3. But i've encountered with these problems:

1. 12227 rows of this data, doesn't have any corresponding gene symbol, how can i deal with it?

2. As i know, human genome has 20,000-25,000 genes, and this data, except the rows without corresponding gene symbols, has 21,025 rows with unique gene symbols/probe id. Doesn't it exceed the acceptable area?

i had the problem of having a many to one relation between gene symbols and probe ids, but i think it would be ok, if i consider an average value for expression data with one gene symbol.

Can anyone help me?

ADD COMMENTlink written 3.5 years ago by Amir Hajian40

You should get the probe sequences and map them to a reference genome of your choice. Any probe annotation provided by vendors or study authors several years ago is very likely obsolete, don't rely on it for serious work.

ADD REPLYlink written 3.5 years ago by Jean-Karim Heriche19k

so, are you suggesting me to replace the corresponding gene symbols from Original "Human Genome U133 Plus 2.0 Array" annotation to my using dataset?
if yes, there are some prob ids with no gene symbols in the original annotation, too!!!
And there are some probe ids in both dataset, that they don't match? does it prove your saying?

ADD REPLYlink written 3.5 years ago by Amir Hajian40

What I meant is that you should get the sequence of each probe and map it (e.g. with blast) to a current annotated genome reference to find out what current gene(s) each probe represent. Usually you have no idea how up-to-date and accurate the vendor-provided association between probe ID and gene symbol is. For probes designed some time ago, you'll always have discrepancy when mapping them to a more recent genome. For example, some probes that were targeting a unique genes back when they were designed will now target nothing or several genes. Also the notion of gene is not the same depending on which reference annotation you're using e.g. a gene in Ensembl is not the same as a gene in Entrez. What you do with probes that map to multiple genes is usually problem-dependent and up to you to decide.

Given that you mention the Human Genome U133 Plus 2.0 Array, you might want to have a look at the hgu133plus2.db Bioconductor package.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Jean-Karim Heriche19k

actually i am working with plants but did you try NetAffx™ Analysis Center contains many options for human from IDs conversion to ect..for example by entering prob sequence you could retrieve the symbol 

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by F3.4k

Thank you so much! it helped to decrease 12,000 unknown probes, to 8,000. but still there are some prob ids with no gene symbols in the NetAffx database, too!!!

ADD REPLYlink written 3.5 years ago by Amir Hajian40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1751 users visited in the last hour