Question: Best way to annotate Affy's U219 in R
gravatar for DaniCee
2.1 years ago by
DaniCee10 wrote:

Hello everyone,

I want to annotate probe IDs from Affymetrix GeneTitan U219 Array in R.

For that purpose I use Bioconductor package hgu219.db, to retrieve just EntrezGene IDs.

However, there are quite a lot of probe IDs that do not seem to be present in hgu219.db, up to 3000+ in my list of 49000+

This is a MWE:

xx.df <-
subset(xx.df, probe_id %in% c("11715106_x_at","11715107_s_at","11715111_s_at","11715138_s_at","11715140_s_at"))

I just included 5 of the probes I cannot annotate, which seem pretty standard; in fact, I can find them in the table here, and they all seem to have EntrezGene ID annotation.

So my question is: Is there a reason why these probes are not listed in hgu219.db? Is there any other preferred way to annotate them in R?

Many thanks!

ADD COMMENTlink written 2.1 years ago by DaniCee10

I don't know this R package but my strategy with sequence-based reagents is to always remap them myself to a version of Ensembl and keep working with this version of Ensembl for the rest of the project so that annotations are consistent throughout the project.

ADD REPLYlink written 2.0 years ago by Jean-Karim Heriche21k

Could you provide a link?

ADD REPLYlink written 2.0 years ago by DaniCee10

For the time being, I have downloaded the table from

ADD REPLYlink written 2.0 years ago by DaniCee10

What I usually do is I download the annotation file from the producer website (.txt file), which gives you the entrezID from each probe. It should be somewhere here:

ADD REPLYlink written 2.0 years ago by Selenocysteine550

This is always my approach too. I then do the annotation manually within R by mapping the probe IDs to the CSV / TSV annotation file.

ADD REPLYlink written 2.0 years ago by Kevin Blighe51k

Vendor information can often be unreliable. It is not documented and either inaccurate, not up-to-date or mapped to the wrong reference. Just download the probe sequences and map them yourself to the reference genome you're using in your project. Actually, Ensembl also maps some microarray probes and this can be retrieved from BioMart.

ADD REPLYlink written 2.0 years ago by Jean-Karim Heriche21k

This seems like the place to start from: I will download the annotation table and go with it. Doing the mapping myself sounds like unnecessary, or is it? Anyways it is weird that Bioconductor hgu219.db does not provide full annotation when it has been updated just on the 20th October 2017...

ADD REPLYlink written 2.0 years ago by DaniCee10

My view is that doing the mapping yourself is often necessary because of the reasons I listed above. If you know how the vendor did the annotations (e.g. what software was used to do the alignment, what were the parameters used, what where the criteria used in case of multiple matches, of mismatches, which reference genome was used ...) then go ahead and use these annotations. Think about what you would do if towards the end of the project, you realize that your best candidate gene derives from probes that actually map to two different places in the reference genome you use for your project.

ADD REPLYlink written 2.0 years ago by Jean-Karim Heriche21k

I think Jean is right: give that the amount of work is not so big, you can annotate the probes with the file, but also in parallel check all of the annotations manually and see how many are inconsistent. It is in general good bioinformatics practice to have complete control of your data and its sources, and be completely sure that you are looking at good stuff. Of course on a big-scale study the difference could be minimal, but still it gives you better control and understanding of your data.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by Selenocysteine550

What would you use for that? bowtie2? Could you develop it into an answer? Like where to get the probe sequences and the reference genome from, and which are the relevant commands to use for mapping...

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by DaniCee10

Align each probe to your genome of choice, e.g. the genome of the cell line you're doing experiments with. For this you can use any suitable software e.g. blastn, exonerate .... Then you associate probes to transcripts dealing along the way with mismatches (i.e. how many mismatches do you tolerate to consider a probe to target a particular transcript ?). After this you need to deal with probes with multiple targets.
For more details, have a look a this paper describing the Ensembl pipeline for annotating microarray probes.

ADD REPLYlink written 2.0 years ago by Jean-Karim Heriche21k

Hi @Jean-Karim (or anyone), could you develop this comment into an answer with the proper software to use, proper mismatch thresholds, where to obtain the proper human genome to map the probes to, where to obtain each probe sequence, etc... Many thanks!

ADD REPLYlink modified 22 months ago • written 22 months ago by DaniCee10

I can't give you much more precision as there are choices that you have to make yourself. For software you can use any that can deal with the probes you have. Parameters depend on the software you choose. The mismatch threshold depends on what you think the hybridization conditions can tolerate. For example, if experimental conditions were such that only perfectly matching probes would anneal then I would reject any target that has 1 or more mismatch to the probe under consideration. Check the paper I linked to, they used exonerate and the parameters are given. Finally as to the choice of reference, I normally work with Ensembl because the data is easily available and well organized and I find using the perl API very convenient.

ADD REPLYlink written 22 months ago by Jean-Karim Heriche21k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1318 users visited in the last hour