Question

How to deal with "///" separator in Geo2r when loading data to enrichr

0

Entering edit mode

3.7 years ago

vladimir.vinarsky • 0

Hello I encountered formatting issue when trying to use output from GEO2R to enrichr, which got me stuck, could anybody help with the imminent practical (1) and broader statistical (2) question?

I am reanalyzing micorarray datasets obtained from GEO, doing differential gene expression by GEO2R with the idea that the differentially expressed genes will be uploaded in enrichr to see the changes in GOs, transcription, etc. However for some of the probes, there are more then one gene annotated. These are separated by "///" separator, which interferes with load enrichr function (one line one gene, probably no special characters).

(1) Does anybody know how to get around this problem? Is there a tool for conversion? There used to be GEO2enrichr extension but that is not supported now.

(2) How should these multiple entries per probe be handled statistically? If the probe is for two isoforms of the gene and I simply add both, would introduced an error in the dataset. If I keep just one of the isoforms, than my data are also incorrect. What is the correct approach?

Thanks for help

Vladimir

geo2r enrichr R • 974 views

ADD COMMENT • link updated 3.7 years ago by dsull ★ 5.8k • written 3.7 years ago by vladimir.vinarsky • 0

score 2 · Answer 1 · 2020-08-04

I'm not sure what your data.frame looks like exactly, so I'll make some example data that should show you how to deal with the delimited gene names in the actual data.

df <- data.frame(gene=c("A///B", "C", "D///E///F"), log2FC=rnorm(3, 2, 1))

> df
       gene   log2FC
1     A///B 1.920434
2         C 1.652814
3 D///E///F 2.032746

There's a handy tidyr function that will let you separate those delimited values into separate rows.

library("tidyr")

df <- separate_rows(df, gene, sep="///")

> df
# A tibble: 6 x 2
  gene  log2FC
  <chr>  <dbl>
1 A       1.92
2 B       1.92
3 C       1.65
4 D       2.03
5 E       2.03
6 F       2.03

As for which isoforms to keep, that is a difficult question to give an answer too, so I'll let someone else chime in if they feel more comfortable.

score 1 · Answer 2 · 2020-08-04

Regarding multiple probes per genes, there's no consensus on what is best, especially since this is microarrays where different probes don't necessarily represent distinct isoforms. For p-values, you could use something like Fisher's method to aggregate p-values across multiple probes. For log2FC, it's a bit complicated; you could use the probe with the greatest fold change, you could average probes, etc. -- but you have to interpret your results accordingly (i.e. let's assume probes do represent distinct isoforms; one isoform goes up a huge amount while three other isoforms go down a moderate amount -- what's your interpretation of these results? what about if those three other isoforms show only a very tiny change -- what's your interpretation?).

Many RNA-seq workflows deal with this isoform issue just by counting the number of the reads that align to a gene of interest (with disregard for isoform information) or, when working with transcript-level alignments, summing up the transcript abundances. But of course, with microarrays, you can't sum up probe intensities -- because: for one, different probes don't necessarily represent different isoforms and don't necessarily provide coverage over the entirety of the gene; for two, different probes have different affinities.

Alternately, you can just leave it as is -- just report probe-level data (unless you have downstream analyses that needs gene-level log2 fold changes).