Question

Importing data from GEO

0

Entering edit mode

2.1 years ago

LHA_trash • 0

Hi everyone,

I am struggling since hours to import expression data from GEO. I am trying to work on this dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE151095 It is a Superseries. Some scRNAseq but also bulkRNA seq. I am especially interested in the bulkRNA seq samples (in detail: DC2 and DC3 samples with and without TLR as stated in the sample list).

I have tried this approach in R:

library(DESeq2)
library(tidyverse)
library(GEOquery)
gse<-getGEO(GEO='GSE151095', GSEMatrix=TRUE)
gse <- gse[[1]]
metadata <-pData(phenoData(gse))
exprs(gse)

GSM4565845 GSM4565846 GSM4565847 #and so on

So this does not seem to work, it only gives me the GSM characters. I also could not find a .csv file with countdata to import manually. Any advice, on how I can work on this dataset or on GEO datasets in general? Is there a recommended workflow?

Thanks in advance!

GEO GSE R • 1.6k views

ADD COMMENT • link 2.1 years ago by LHA_trash • 0

score 1 · Answer 1 · 2022-04-14

1

Entering edit mode

2.1 years ago

Kevin Blighe 87k

Hello,

getGEO() only works for retrieving the expression data for array data, not high throughput sequencing [data] (bulk RNA-seq, scRNA-seq, etc).

You will have to retrieve the data that is in the TAR file (see https://ftp.ncbi.nlm.nih.gov/geo/series/GSE151nnn/GSE151095/suppl/) and then use that manually. To import the MTX scRNA-seq data, you can use DropletUtils (R), or, indeed, Seurat, or something else.

Kind regards,

Kevin

ADD COMMENT • link 2.1 years ago by Kevin Blighe 87k

0

Entering edit mode

Hey Kevin, thanks for your answer. Is it correct to import the data for the bulk sequencing from here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE151073, I really only want to compare DC2 and DC3 with and without TLR. THose samples are included.

Using the GSE151073_PG_Bulk_Blood_raw_counts.csv.gz?

gse<-getGEO(GEO='GSE151073', GSEMatrix=TRUE)
data<-read_csv("GSE151073_PG_Bulk_Blood_raw_counts.csv.gz") ##https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE151073


gse<-gse[[1]]
metadata <-pData(phenoData(gse))

I did this and i got totally different results after DESeq2, than the group who published the results. But I really don't unterstand why. Looking back at the raw counts, their results are impossible. So i think, that I might have done something wrong.

ADD REPLY • link 2.1 years ago by LHA_trash • 0

0

Entering edit mode

Hi again. In which way were the results different?

ADD REPLY • link 2.1 years ago by Kevin Blighe 87k

0

Entering edit mode

I Will try to quickly state my workflow and post their results. This might be a bigger post then.

head(data)
# A tibble: 6 × 45
  ...1       `DC2-CD5neg-d1` `DC2-CD5neg-d2` `DC2-CD5neg-d3` `DC2-CD5neg-d4` `DC2-CD5pos-d1` `DC2-CD5pos-d2` `DC2-CD5pos-d3` `DC2-CD5pos-d4` `DC3-d1` `DC3-d2` `DC3-d3` `DC3-d4` `Mono-CD88pos-…` `Mono-CD88pos-…`
  <chr>                <dbl>           <dbl>           <dbl>           <dbl>           <dbl>           <dbl>           <dbl>           <dbl>    <dbl>    <dbl>    <dbl>    <dbl>            <dbl>            <dbl>
1 ENSG00000…               0               2               7               0               2               0               2               0        1        0        0        0                0                3
2 ENSG00000…              32              37              46              32              47              41              31              29       36       33       49       27               28               22
3 ENSG00000…               0               2               0               2               0               0               0               0        0        0        0        0                1                0
4 ENSG00000…               0               0               1               2               0               0               0               0        0        0        0        0                1                3
5 ENSG00000…               0               0               0               0               0               0               0               0        0        0        0        0                0                0
6 ENSG00000…               0               0               0               0               0               0               0               0        0        0        3        0                0                0
# … with 30 more variables: `Mono-CD88pos-d3 ##and so on

I performed tidying to adjust metadata and the data (=countdata)

metadata1 <-select(metadata,c(1,20,48,49,50))
metadata2 <- metadata1 %>%
  filter(title %in% c("DC3_1","DC3_2","DC3_3","DC3_4","DC2_CD5pos_1","DC2_CD5pos_2","DC2_CD5pos_3","DC2_CD5pos_4","DC2_TLR_1","DC2_TLR_3","DC2_TLR_4","DC3_TLR_1","DC3_TLR_3","DC3_TLR_4"))
data2 <-select(data,c(1,6,7,8,9,10,11,12,13,28,29,30,31,32,33)) #those are the corresponding columns to metadata (also DC2 and DC3)
metadata3<-metadata2 %>% 
  mutate(across(where(is.character), str_remove_all, pattern = fixed(" ")))
metadata4<- metadata3 %>%
  dplyr::rename("celltype"="cell type:ch1") %>%
  dplyr::rename("description"="description.1")%>%
  dplyr::rename("donor"="donor:ch1")%>%
  dplyr::rename("tissue"="tissue:ch1")
metadata5 <- metadata4
vec <- c(rep("DC2_ctrl",4),rep("DC3_ctrl",4),rep("DC2_TLR",3), rep("DC3_TLR",3))

vec2 <- c(rep("DC2",4),rep("DC3",4),rep("DC2",3), rep("DC3",3))

vec3 <- c(rep("Ctrl",8),rep("TLR",6))

metadata6 <- cbind(metadata5,cell_group=vec,celltype2=vec2,stimulation=vec3)
rownames(metadata6) <- metadata6$description
data3 <- data2 %>%
  rename("GeneID"="...1")

Getting these final data.frames

head(data3)
# A tibble: 6 × 15
  GeneID    `DC2-CD5pos-d1` `DC2-CD5pos-d2` `DC2-CD5pos-d3` `DC2-CD5pos-d4` `DC3-d1` `DC3-d2` `DC3-d3` `DC3-d4` `DC2-BTLA-S-d1` `DC2-BTLA-S-d3` `DC2-BTLA-S-d4` `DC3-CD163-S-d1` `DC3-CD163-S-d3` `DC3-CD163-S-d4`
  <chr>               <dbl>           <dbl>           <dbl>           <dbl>    <dbl>    <dbl>    <dbl>    <dbl>           <dbl>           <dbl>           <dbl>            <dbl>            <dbl>            <dbl>
1 ENSG0000…               2               0               2               0        1        0        0        0               2               6               2                2                5                0
2 ENSG0000…              47              41              31              29       36       33       49       27               4               6               1                8                9                5
3 ENSG0000…               0               0               0               0        0        0        0        0               0               1               0                2                0                0
4 ENSG0000…               0               0               0               0        0        0        0        0               1               0               0                0                0                0
5 ENSG0000…               0               0               0               0        0        0        0        0               0               0               0                0                0                0
6 ENSG0000…               0               0               0               0        0        0        3        0               0               0               0                0                0                0
head(metadata6)
                 title   description         celltype donor          tissue cell_group celltype2 stimulation
DC2-CD5pos-d1 DC2_CD5pos_1 DC2-CD5pos-d1   CD1c+CD5+cells     1 Peripheralblood   DC2_ctrl       DC2        Ctrl
DC2-CD5pos-d2 DC2_CD5pos_2 DC2-CD5pos-d2   CD1c+CD5+cells     2 Peripheralblood   DC2_ctrl       DC2        Ctrl
DC2-CD5pos-d3 DC2_CD5pos_3 DC2-CD5pos-d3   CD1c+CD5+cells     3 Peripheralblood   DC2_ctrl       DC2        Ctrl
DC2-CD5pos-d4 DC2_CD5pos_4 DC2-CD5pos-d4   CD1c+CD5+cells     4 Peripheralblood   DC2_ctrl       DC2        Ctrl
DC3-d1               DC3_1        DC3-d1 CD1c+CD163+cells     1 Peripheralblood   DC3_ctrl       DC3        Ctrl
DC3-d2               DC3_2        DC3-d2 CD1c+CD163+cells     2 Peripheralblood   DC3_ctrl       DC3        Ctrl

prepareing DESeq

countdata <- data3 %>%
  column_to_rownames("GeneID") %>%
  as.matrix()
design2 <- as.formula(~cell_group)
ddsObj.raw2 <- DESeqDataSetFromMatrix(countData = countdata,
                                     colData = metadata6,
                                     design = design2)
ddsObj2 <- DESeq(ddsObj.raw2)
rDC3vDC2_c<-results(ddsObj2,
                    name="cell_group_DC3_ctrl_vs_DC2_ctrl",
                    alpha=0.05)
rDC3vDC2_t<-results(ddsObj2,
                    contrast=c("cell_group","DC3_TLR","DC2_TLR"),
                    alpha=0.05)

I will stop here and just quickly say what I did next: I used lfcShrink (for the contrast group I used ashr), I used biomaRt to annotate the genes (actually the gene names were listed behind the ENSG-number, so I could see if i did any mistakes and then I plotted the values. My vulcano plot looks totally different then the one from the publication.

Vulcano plot from publication u can see, that theey state e.g. IL1B as upregulated DEG for DC3_TLR, but when i look back at the original raw count data table for IL1B, this is what I see: rawcounts_IL1B I know that those are raw counts, but how can this end up as a DEG upregulated in DC3? (This always refers to the DC3-CD163_s column, which means TLR stimulated)

my plot as comparison, my plot

ADD REPLY • link 2.1 years ago by LHA_trash • 0

1

Entering edit mode

Hmm, the volcano plot from the publication looks completely bogus. For one, the x-axis values all seem positive, and why is that 10 and not -5?

Some of your genes also appear to have the opposite directionality, in terms of fold-change, but --would you believe-- I have seen this more often than I would like to admit. It is clear that a few --possibly many-- published works contain erroneous results.

ADD REPLY • link 2.1 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks for taking the time to look at it, because I was unsure whether I fucked it up at some place. Well this is sad, as this is a major paper in our field...

ADD REPLY • link 2.1 years ago by LHA_trash • 0