1
1
Entering edit mode
2.8 years ago
zelda ▴ 30

I'm quite new to the bioinformatics world and just started working with the GEO data set GSE119600. My intention is to extract deferentially expressed genes and will start my work using the Lumi package to import ,read and normalize the raw data into R. Then I would use LIMMA package to generate the deferentially expressed genes. My questions are:

1- what file is suitable to import the raw data from? As there are 2 types of files : RAW.tar and non-normalized.txt.gz. 2- what is the general workflow for the lumi pakcage?

Thanks!

R lumi LIMMA • 3.8k views
0
Entering edit mode

hi zelda, i'm having the same problem with raw data from illumina beadchip in GEO. How you solved your problem ?

0
Entering edit mode

Hi Zelda, i'm having the same problem... How do you proceed ?

0
Entering edit mode

Please elaborate on the problem. Please show what you have already tried, and share any warning and / or error messages that have appeared.

0
Entering edit mode

I''m quite new in bioinformatics world and started working with GEO dataset GSE42023. Our goals is to extract deferentially expressed genes . My question is:

The gse42023 provides me 2 types of files : RAW.tar and non-normalized.txt.gz. I know that : i need normalize this data to proceed with dea analisis, but i am really confused with this data.

Specially in non-normalized.txt , i have the genes (rows) , and the samples and p-value detection(collumns), in this case, how to proceed ?

Thanks! Example of non_normalized.txt

ID_REF YTY 82T Detection Pval YTY 84T Detection Pval YTY 88T Detection Pval YTY 78T ILMN_2104295 11777.34 0 15654.76 0 16447.22 0 18152.73 ILMN_1804851 47.92878 0.09337349 64776 0.01506024 50.35888 0.06626506 87.81049 ILMN_2412624 779.1193 0 994.2435 0 752119 0 1202821 ILMN_2402629 66.27144 0.009036144 63.89339 0.01957831 46.61563 0.1325301 62.49742

0
Entering edit mode

Unfortunately, you have the same problem as many people.

For some reason, for the Illumina microarray studies, GEO requires that authors upload data in this non-standard format. I am not sure that you can use the standard Bioconductor package, lumi, for this. Instead, you may have to process this manually, and I provide an advanced workflow here: A: illumina Arrays Illumina HumanHT-12 V3.0 expression beadchip reading data

0
Entering edit mode

Thank's for your response Kevin!!! I am following your workflow. But i have a question: How normalize this data ? Now i have a matrix with ID ref and p-value detection for all samples. Please forgive me for so many questions.

0
Entering edit mode

Hey, you need to remove the detection p-value columns. I explain this in the other post (but I do not provide the code):

"You should then extract out the Detection PVal columns and save them for later, and also set the rownames of the object to be equal to ID_REF. The final x object should be just the expression levels, and it should be a data-matrix."

The normalisation is then performed with the neqc() function, also mentioned in the other post.

Sorry, this is a very non-standard workflow.

0
Entering edit mode

So, i normalize this data: ( ID_REF and the counts of genes)

ID_REF                  YTY 82T YTY 84T YTY 88T YTY 78T YTO 68T YTO 92T YTO 70T YTY 94T
ILMN_2104295    11777.34    15654.76    16447.22    18152.73    17644.91    12811.94    14148.09    28912.88
ILMN_1804851    47.92878    64776   50.35888    87.81049    81.91159    81.92358    54.1439 157.5312
ILMN_2412624    779.1193    994.2435    752119  1202821 1129763 951.2262    1179794 3904947
ILMN_2402629    66.27144    63.89339    46.61563    62.49742    65.48244    68.64259    86.67648    147.1232
ILMN_1713732    132.8008    251.0701    184.7917    406.4644    465953  252.8271    206557  610.4857


Or This data: (ID_REF and just the p-value detection)?

ID_REF                 YTY 82T         YTY 84T         YTY88T            YTY 78T     YTO 68T                YTO 92T
ILMN_2104295    0                    0                              0                 0           0                           0
ILMN_1804851    0.09337349  0.01506024  0.06626506  0.03313253  0.01054217  0.007530121
ILMN_2412624    0   0   0   0   0   0
ILMN_2402629    0.009036144 0.01957831  0.1325301   0.1656626   0.05572289  0.02259036
ILMN_1713732    0.000156345 2.80E-06    0.000959368 3.32E-06    2.33E-06    2.05E-05
ILMN_1697220    0.000163673 0.01506584  0.008168896 0.009110582 0.001960087 0.02678343
ILMN_1807610    0   0.006024096 0.001506024 0.004518072 0.003012048 0.02108434
ILMN_2408039    0.07981928  0.1536145   0.2003012   0.128012    0.2289157   0.2153614
ILMN_1802167    0.009036144 0.004518072 0.009036144 0.05421687  0.01054217  0

1
Entering edit mode

You need to set ID_REF as rownames, and then remove that first column (ID_REF) from the data (in both cases).

You later use the detection p-values in the following section in my other post: 'filter out control probes, those with no symbol, and those that failed' (these detection p-values will be contained in the object detectionpvalues)

The columns of both objects also should be aligned perfectly.

1
Entering edit mode
2.8 years ago

Hey,

Like many other Illumina studies on GEO, this study (GSE119600) does not contain the original raw data IDAT files, which is unfortunate.

GSE119600_RAW.tar just contains the probe annotations for GPL10558 Human HT-12 v4 R1 and R2 (2 separate files). GSE119600_non-normalized.txt.gz just contains a single table of expression values. i am not sure that these are suitable for lumi, but you are welcome to try.

As a beginner, the easiest option for you is to use GEO2R to obtain the expression data as an ExpressionSet object. From the main accession page ( https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE119600 ), click on the Analyze with GEO2R button.

Then, click on the R script tab, which will generate R code that you can use to obtain the data.

## --------------------

Alternatively, you can adapt a previous answer of mine such that you can still use this raw data with limma:

In this case, the annot object should be one of those annotation files in the GSE119600_RAW.tar file. The targetinfo object you will have to create yourself.

Kevin

1
Entering edit mode

Thank you Kevin! It is much appreciated.

0
Entering edit mode

Hi Kevin i had the same problem with GSE140830_raw_data.txt.gz. i created targetinfo and annot matrices and followed your workflow. but i getting an error, when i try to normalize by neqc():

Error in if (alpha <= 0) stop("alpha must be positive") : missing value where TRUE/FALSE needed

It should be great if you can give me any suggestions to fix this error.

Thanks

0
Entering edit mode

Hello, you will have to show all of your code so that I can test it. Also, if you are referring to my code here, A: illumina Arrays Illumina HumanHT-12 V3.0 expression beadchip reading data, then you should be aware that this is for Advanced users (apologies for assuming that you may not be advanced).

If possible, try to retrieve the data via GEO2R ( see blue button on the GEO main accession page: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE140830 )

0
Entering edit mode

Thank you Kevin i don't aimed to find DEGs from this dataset, i intend to extract a normalized expression matrix to test the preservation of the main modules of my wgcna. the supplementary file of dataset GSE140830 contains GSE140830_final_normalized_data.txt.gz and GSE140830_raw_data.txt.gz.
i'm following this code:

library(GEOquery)
filePaths = getGEOSuppFiles("GSE140830")
dim(datFTD)
#47231  2181
targetinfo<-datFTD[,1:13]
exp<-datFTD[,c(1,14:2181)]
exp<-exp[,order(names(exp))]
names(exp)<-toupper(names(exp))
dim(exp)
#47231  1085
names(exp)<-gsub("X","",names(exp))
rownames(exp)<-exp[,1]
exp<-exp[,-1]
annot<-cbind(targetinfo,exp)
dim(annot)
#47231  1097
Exdata<-exp[,grep("SIGNAL",names(exp))]
Pvdata<-exp[,grep("DETECT",names(exp))]
dim(Exdata)
#47231   542
dim(Pvdata)
#47231   542

project <- new('EListRaw')
project@.Data[[1]] <- 'illumina'
project@.Data[[2]] <- targetinfo
project@.Data[[3]] <- annot
project@.Data[[4]] <- Exdata
project@.Data[[5]] <- NULL
project$E <- Exdata project$targets <- targetinfo
project$genes <- annot project$other$Detection <- Pvdata project.bgcorrect.norm <- neqc(project, offset = 16)  As you assumed, I'm not an advanced user, I am PhD candidate in neuroscience and i know bioinformatics as much as doing my thesis. I will be appreciated if you could help me. ADD REPLY 0 Entering edit mode Thanks, I have taken a look and could reproduce the error with this code: library(GEOquery) filePaths = getGEOSuppFiles("GSE140830") datFTD<-read.delim(rownames(filePaths)[2]) dim(datFTD) targetinfo<-datFTD[,1:13] exp<-datFTD[,c(1,14:2181)] exp<-exp[,-grep("BEAD_STDERR",names(exp))] dim(exp) names(exp)<-gsub("X","",names(exp)) rownames(exp)<-exp[,1] exp<-exp[,-1] annot<-cbind(targetinfo) dim(annot) Exdata<-exp[,grep("SIGNAL",names(exp))] Pvdata<-exp[,grep("DETECT",names(exp))] dim(Exdata) dim(Pvdata) project <- new('EListRaw') project@.Data[[1]] <- 'illumina' project@.Data[[2]] <- targetinfo project@.Data[[3]] <- annot project@.Data[[4]] <- Exdata project@.Data[[5]] <- NULL project$E <- Exdata
project$targets <- targetinfo project$genes <- annot
project$other$Detection <- Pvdata

project.bgcorrect.norm <- neqc(project, offset = 16)


Looking closer, it seems that the authors have not provided all control probes with the raw data; so, the normalisation cannot function. I can only see 1 control probe included in the data via table(project\$genes#SOURCE). It seems, according to here https://support.bioconductor.org/p/42502/, that you can still normalise it without control probes via:

project.bgcorrect.norm <- backgroundCorrect(project, method = 'normexp')


## ----------------

I checked the data via GEO2R and the rownames are non-sensical characters - not sure what happened there.

Another option may be to load the already-normalised data that is also available on GEO: GSE140830_final_normalized_data.txt.gz This, at least, appears to be in a usable format, and de-necessitates all of the above complicated code. It can be inferred via the GEO record that this data is normalised as "quantile and VST normalized signal"

1
Entering edit mode

Thank you Kevin I appreciate it.

0
Entering edit mode

Hi, I have been following this thread. Like you, am new to such analysis. Since the data is from Illumina beadchip, it is difficult for mean to perform normalization and post processing. Can you help me with the R script that u finally used?

TIA Shriyansh