Question

Common Probes - Ensemble Problem

1

Entering edit mode

13.7 years ago

Julien Textoris ▴ 430

Hi all,

I want to do a meta-analysis on various datasets. First, I want to reduce each dataset to the probes common to all datasets (from various platforms for example).

I choose to check for common ensembl IDs as a reference (but i could have take something else)

the code below is how I started for 2 datasets, but i don't know how to finish, and how to do for more than 2 ??

x = hgu133aENSEMBL
length(x)
# [1] 22283
x.mapped = mappedkeys(x)
length(x.mapped)
# [1] 19742
#I also exclude probes that were filtered by pre-processing
length(which(x.mapped%in%row.names(martinez.data) == TRUE))
# [1] 19742 so none
xx = as.list(x[x.mapped])
xx[1:5]
#Some probes points to more than one ENSEMBLID, I only keep the ONE to ONE match
x.ind = sapply(xx,length)<2
x.names = unlist(as.list(x[x.mapped])[which(x.ind == TRUE)])
length(x.names)
# [1] 19127

#x.names contains the ensembl gene IDs

y = hgug4112aENSEMBL
length(y)
# [1] 41000
y.mapped=mappedkeys(y)
length(y.mapped)
# [1] 27382
#I also exclude probes that were filtered by pre-processing
length(which(y.mapped%in%row.names(vikram.data) == TRUE))
# [1] 20922  smaller number so i keep only the match
y.mapped = y.mapped[which(y.mapped%in%row.names(vikram.data) == TRUE)]
yy = as.list(y[y.mapped])
yy[1:5]
y.ind = sapply(yy,length)<2
y.names = unlist(as.list(y[y.mapped])[which(y.ind == TRUE)])
length(y.names)
# [1] 20214

Here, I don't know how to do, because, with match or %in%, i don't get the same set of ensembl IDs for x and y ...

I'm sure there is something far more simple, but i can't see it know.

Thanks for your help,

Julien

microarray meta • 2.6k views

ADD COMMENT • link 13.7 years ago by Julien Textoris ▴ 430

score 1 · Answer 1 · 2011-08-29

Did you look at the set stuff in R?

Also, I'm not sure exactly what you're doing, but it looks like in this section:

yy = as.list(y[y.mapped])
yy[1:5]
y.ind = sapply(yy,length)<2
y.names = unlist(as.list(y[y.mapped])[which(y.ind == TRUE)])

You are excluding probes that map to the same ENSEMBL ID, not just taking 1 of them. I think you can simplify this entire process by using the sets stuff above, wrapping that in a function that you call for each set, and then doing an intersection on the names.

score 0 · Answer 2 · 2011-08-29

0

Entering edit mode

13.7 years ago

Jeremy Leipzig 23k

merge(toTable(x[x.mapped]),toTable(y[y.mapped]),by="ensembl_id")

??

ADD COMMENT • link 13.7 years ago by Jeremy Leipzig 23k

score 0 · Answer 3 · 2011-08-29

Thanks to both, i can propose something here for other people that might have the same problem :

Suppose you want to analyse 4 datasets A, B, C, and D, which come from 3 platforms hgug4112a, hgu133a and hgu95av2.

Due to preprocessing, the sets of probes on each dataset may be smaller than the total set of probes. this is taken into account.

The script is :

x = hgu133aENSEMBL
x.mapped = mappedkeys(x)
# The line below is to only take into account probes that were not 
# filtered during preprocessing. There are 2 datasets from the same platform here.
x.mapped = x.mapped[which(x.mapped%in%row.names(A.data) == TRUE &
      x.mapped%in%row.names(B.data) == TRUE )]

y = hgug4112aENSEMBL
y.mapped=mappedkeys(y)
# The line below is to only take into account probes that were not 
# filtered during preprocessing. There is only 1 dataset from this platform.
y.mapped = y.mapped[which(y.mapped%in%row.names(C.data) == TRUE)]

# m1 is the the first intersection of ENSEMBL_IDs
m1 = intersect(toTable(x[x.mapped])$ensembl_id,toTable(y[y.mapped])$ensembl_id)

length(m1)
#[1] 10461

#Thihs is for the last (4th) dataset
z = hgu95av2ENSEMBL
z.mapped = mappedkeys(z)
z.mapped = z.mapped[which(z.mapped%in%row.names(D.data) == TRUE)]

# If i'm not wrong, the order in which you cascade the intersections does not matter.
# If we had a 5th dataset, we would do m3 = intersect(m2, toTable( ...
m2 = intersect(m1,toTable(z[z.mapped])$ensembl_id)
length(m2)
# [1] 7476

x.names = unlist(as.list(x[x.mapped]))
x.probes = names(x.names[match(m2,x.names)])
# Match will take only the "first" match, which is a solution for the problem discussed below
length(x.probes)
# [1] 7476

y.names = unlist(as.list(y[y.mapped]))
y.probes = names(y.names[match(m2,y.names)])
length(y.probes)
# [1] 7476

z.names = unlist(as.list(z[z.mapped]))
z.probes = names(z.names[match(m2,z.names)])
length(z.probes)
# [1] 7476

Sorry for that, but there is a mistake in the code above :

DON'T USE THIS SYNTAX : unlist(as.list(x[x.mapped])) => because to avoid duplicate names, it adds a number after the probe name (and toTable don't) :

unlist(as.list(x[x.mapped]))[1:1000]
       1007_s_at1        1007_s_at2        1007_s_at3        1007_s_at4 
"ENSG00000137332" "ENSG00000204580" "ENSG00000215522" "ENSG00000230456" 
       1007_s_at5           1053_at            117_at            121_at 
"ENSG00000234078" "ENSG00000049541" "ENSG00000173110" "ENSG00000125618"

toTable(x[x.mapped])[1:1000,]
        probe_id      ensembl_id
1      1007_s_at ENSG00000137332
2      1007_s_at ENSG00000204580
3      1007_s_at ENSG00000215522
4      1007_s_at ENSG00000230456
5      1007_s_at ENSG00000234078

So the right code is :

x.names = toTable(x[x.mapped])
x.probes = x.names$probe_id[match(m2,x.names$ensembl_id)]
length(x.probes)
# [1] 7476

y.names = toTable(y[y.mapped])
y.probes = y.names$probe_id[match(m2,y.names$ensembl_id)]
length(y.probes)
# [1] 7476

z.names = toTable(z[z.mapped])
z.probes = z.names$probe_id[match(m2,z.names$ensembl_id)]
length(z.probes)
# [1] 7476

And then, you can construct a reduced eset with this :

A.eset = new("ExpressionSet",
                  exprs = as.matrix(A.data[
            which(y.probes%in%row.names(A.data)),1:9],
            dimnames=list(c(y.probes,row.names(A.pData)))),
            phenoData = A.phenoData, experimentData = A.expData,
            annotation = "hgug4112a")

B.eset = new("ExpressionSet",
            exprs = as.matrix(B.data[
            which(x.probes%in%row.names(B.data)),7:15],
            dimnames=list(c(x.probes,row.names(B.pData)))),
            phenoData = B.phenoData, experimentData = B.expData,
            annotation = "hgu133a")

etc ...

So this works fine, but it comes with another problem i can't answer now, how to choose between probes for the same ID. For exemple, in Jeremy Leipzig example, we get a data.frame like that :

merge(toTable(x[x.mapped]),toTable(y[y.mapped]),by="ensembl_id")[1:10,]
        ensembl_id  probe_id.x   probe_id.y
1  ENSG00000000419   202673_at  A_23_P68472
2  ENSG00000000457    41329_at  A_23_P74320
3  ENSG00000000457 205607_s_at  A_23_P74320
4  ENSG00000000460 220840_s_at  A_23_P11862
5  ENSG00000000938 208438_s_at  A_24_P51517
6  ENSG00000000938 208438_s_at A_23_P103932
7  ENSG00000000938 208438_s_at A_24_P316634
8  ENSG00000001084   202922_at A_23_P352879
9  ENSG00000001084   202922_at A_23_P145114
10 ENSG00000001084 202923_s_at A_23_P352879
11 ENSG00000001084 202923_s_at A_23_P145114

The dimensions of the resulted data.frame is 35006 x 3, which is higher than the original tables' dimensions that were : x => 21363 x 2 and y => 22712 x 2 (because all mappings are given, see lines 5 to 7, and 8 to 11).

I don't know wich slot of the annotation object is the best for the mapping. I choosed ensembl_id but one could have take SYMBOL, or something else. I think CHRLOC is too precise, but SYMBOL might exclude non annotated genes ? Do you know if we could have a slot in the annotation object to mapped probes to an exon ? That might help to match probesets one to each others ?

Thanks again for the help.

Julien