Question

Best way to annotate Affy's U219 in R

0

Entering edit mode

6.5 years ago

DaniCee ▴ 10

Hello everyone,

I want to annotate probe IDs from Affymetrix GeneTitan U219 Array in R.

For that purpose I use Bioconductor package hgu219.db, to retrieve just EntrezGene IDs.

However, there are quite a lot of probe IDs that do not seem to be present in hgu219.db, up to 3000+ in my list of 49000+

This is a MWE:

#source("https://bioconductor.org/biocLite.R")
#biocLite("hgu219.db")
library(hgu219.db)
xx.df <- as.data.frame(hgu219ENTREZID)
subset(xx.df, probe_id %in% c("11715106_x_at","11715107_s_at","11715111_s_at","11715138_s_at","11715140_s_at"))

I just included 5 of the probes I cannot annotate, which seem pretty standard; in fact, I can find them in the table here, and they all seem to have EntrezGene ID annotation.

So my question is: Is there a reason why these probes are not listed in hgu219.db? Is there any other preferred way to annotate them in R?

Many thanks!

R microarray annotation affymetrix u219 • 4.1k views

ADD COMMENT • link 6.5 years ago by DaniCee ▴ 10

1

Entering edit mode

I don't know this R package but my strategy with sequence-based reagents is to always remap them myself to a version of Ensembl and keep working with this version of Ensembl for the rest of the project so that annotations are consistent throughout the project.

ADD REPLY • link 6.5 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Could you provide a link?

ADD REPLY • link 6.5 years ago by DaniCee ▴ 10

0

Entering edit mode

For the time being, I have downloaded the table from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL13667

ADD REPLY • link 6.5 years ago by DaniCee ▴ 10

1

Entering edit mode

What I usually do is I download the annotation file from the producer website (.txt file), which gives you the entrezID from each probe. It should be somewhere here: http://www.affymetrix.com/support/technical/byproduct.affx?product=HG-U219

ADD REPLY • link 6.5 years ago by Selenocysteine ▴ 620

0

Entering edit mode

This is always my approach too. I then do the annotation manually within R by mapping the probe IDs to the CSV / TSV annotation file.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

2

Entering edit mode

Vendor information can often be unreliable. It is not documented and either inaccurate, not up-to-date or mapped to the wrong reference. Just download the probe sequences and map them yourself to the reference genome you're using in your project. Actually, Ensembl also maps some microarray probes and this can be retrieved from BioMart.

ADD REPLY • link 6.5 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

This seems like the place to start from: http://www.affymetrix.com/support/technical/byproduct.affx?product=HG-U219 I will download the annotation table and go with it. Doing the mapping myself sounds like unnecessary, or is it? Anyways it is weird that Bioconductor hgu219.db does not provide full annotation when it has been updated just on the 20th October 2017...

ADD REPLY • link 6.5 years ago by DaniCee ▴ 10

0

Entering edit mode

My view is that doing the mapping yourself is often necessary because of the reasons I listed above. If you know how the vendor did the annotations (e.g. what software was used to do the alignment, what were the parameters used, what where the criteria used in case of multiple matches, of mismatches, which reference genome was used ...) then go ahead and use these annotations. Think about what you would do if towards the end of the project, you realize that your best candidate gene derives from probes that actually map to two different places in the reference genome you use for your project.

ADD REPLY • link 6.5 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I think Jean is right: give that the amount of work is not so big, you can annotate the probes with the file, but also in parallel check all of the annotations manually and see how many are inconsistent. It is in general good bioinformatics practice to have complete control of your data and its sources, and be completely sure that you are looking at good stuff. Of course on a big-scale study the difference could be minimal, but still it gives you better control and understanding of your data.

ADD REPLY • link 6.5 years ago by Selenocysteine ▴ 620

0

Entering edit mode

What would you use for that? bowtie2? Could you develop it into an answer? Like where to get the probe sequences and the reference genome from, and which are the relevant commands to use for mapping...

ADD REPLY • link 6.5 years ago by DaniCee ▴ 10

0

Entering edit mode

Align each probe to your genome of choice, e.g. the genome of the cell line you're doing experiments with. For this you can use any suitable software e.g. blastn, exonerate .... Then you associate probes to transcripts dealing along the way with mismatches (i.e. how many mismatches do you tolerate to consider a probe to target a particular transcript ?). After this you need to deal with probes with multiple targets.
For more details, have a look a this paper describing the Ensembl pipeline for annotating microarray probes.

ADD REPLY • link 6.5 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Hi @Jean-Karim (or anyone), could you develop this comment into an answer with the proper software to use, proper mismatch thresholds, where to obtain the proper human genome to map the probes to, where to obtain each probe sequence, etc... Many thanks!

ADD REPLY • link 6.3 years ago by DaniCee ▴ 10

0

Entering edit mode

I can't give you much more precision as there are choices that you have to make yourself. For software you can use any that can deal with the probes you have. Parameters depend on the software you choose. The mismatch threshold depends on what you think the hybridization conditions can tolerate. For example, if experimental conditions were such that only perfectly matching probes would anneal then I would reject any target that has 1 or more mismatch to the probe under consideration. Check the paper I linked to, they used exonerate and the parameters are given. Finally as to the choice of reference, I normally work with Ensembl because the data is easily available and well organized and I find using the perl API very convenient.

ADD REPLY • link 6.3 years ago by Jean-Karim Heriche 27k