GPL6883_HumanRef-8_V3_0_R0_11282963_A (illumina expression beadchip)
3
4
Entering edit mode
3.4 years ago

I''m quite new in bioinformatics world and started working with GEO dataset GSE42023. Our goals is to extract deferentially expressed genes . My question is:

The gse42023 provides me 2 types of files : RAW.tar and non-normalized.txt.gz. I know that : i need normalize this data to proceed with dea analisis, but i am really confused with this data.

Specially in non-normalized.txt , i have the genes (rows) , and the samples and p-value detection(collumns), in this case, how to proceed ?

Thanks! Example of non_normalized.txt

ID_REF YTY 82T Detection Pval YTY 84T Detection Pval YTY 88T Detection Pval YTY 78T ILMN_2104295 11777.34 0 15654.76 0 16447.22 0 18152.73 ILMN_1804851 47.92878 0.09337349 64776 0.01506024 50.35888 0.06626506 87.81049 ILMN_2412624 779.1193 0 994.2435 0 752119 0 1202821 ILMN_2402629 66.27144 0.009036144 63.89339 0.01957831 46.61563 0.1325301 62.49742

R • 4.5k views
ADD COMMENT
4
Entering edit mode
3.4 years ago

Sample code for GSE79404

#!/usr/bin/Rscript

# general config
  baseDir <- 'GSE79404/'
  bgxfile <- 'Annot/HumanHT-12_V4_0_R2_15002873_B.bgx'
  targetsfile <- 'targets.txt'
  setwd(baseDir)
  options(scipen = 99)
  require(limma)
  require(RColorBrewer)
  require(PCAtools)

# read in the data and convert the data to an EListRaw object, which is a data object for single channel data
  x <- read.table(paste0(baseDir, 'raw/GSE79404_non-normalized.txt'),
    header = TRUE, sep = '\t', stringsAsFactors = FALSE, skip = 4)
  newnames <- colnames(read.table(paste0(baseDir, 'raw/GSE79404_non-normalized.txt'),
    header = TRUE, sep = '\t', stringsAsFactors = FALSE, skip = 3))
  detectionpvalues <- x[,grep('Detection.Pval', colnames(x))]
  x <- x[,-grep('Detection.Pval', colnames(x))]
  probes <- x$ID_REF
  x <- data.matrix(x[,2:ncol(x)])
  rownames(x) <- probes
  colnames(x) <- gsub('^X', '', colnames(x))
  colnames(x) <- newnames[grep('GSM', newnames)]

# read in annotation
  annot <- illuminaio::readBGX(bgxfile)$probes
  annot <- annot[,which(colnames(annot) %in% c('Source','Symbol','Transcript','ILMN_Gene','RefSeq_ID',
    'Entrez_Gene_ID','Symbol','Protein_Product','Probe_Id','Probe_Type',
    'Probe_Start','Chromosome','Probe_Chr_Orientation','Probe_Coordinates',
    'Cytoband', 'Definition', 'Ontology_Component', 'Ontology_Process',
    'Ontology_Function', 'Synonyms'))]
  annot <- annot[which(annot$Probe_Id %in% rownames(x)),]
  annot <- annot[match(rownames(x), annot$Probe_Id),]

# update the target info
  targetinfo <- readTargets(targetsfile, sep = '\t')
  rownames(targetinfo) <- targetinfo$IDATfile
  x <- x[,match(rownames(targetinfo), colnames(x))]
  if (!all(colnames(x) == rownames(targetinfo)))
    stop('Target info is not aligned to expression data - they must be in the same order')

# create a custom EListRaw object
  project <- new('EListRaw')
  project@.Data[[1]] <- 'illumina'
  project@.Data[[2]] <- targetinfo
  #project@.Data[[3]] <- annot
  project@.Data[[3]] <- NULL
  project@.Data[[4]] <- x
  project@.Data[[5]] <- NULL
  project$E <- x
  project$targets <- targetinfo
  #project$genes <- annot
  project$genes <- NULL
  project$other$Detection <- detectionpvalues

# generate QC plots of raw intensities
  ... ...  

# for BeadArrays, background correction and normalisation are handled by a single function: neqc()
# this is the same as per Agilent single colour arrays
#
# perform background correction on the fluorescent intensities
#   'normexp' is beneficial in that it doesn't result in negative values, meaning no data is lost
#   for ideal offset, see Gordon Smyth's answer, here: https://stat.ethz.ch/pipermail/bioconductor/2006-April/012554.html
# normalize the data with the 'quantile' method, to be consistent with RMA for Affymetrix arrays
  project.bgcorrect.norm <- neqc(project, offset = 16)

# filter out control probes, those with no symbol, and those that failed
  annot <- annot[which(annot$Probe_Id %in% rownames(project.bgcorrect.norm)),]
  project.bgcorrect.norm <- project.bgcorrect.norm[which(rownames(project.bgcorrect.norm) %in% annot$Probe_Id),]
  annot <- annot[match(rownames(project.bgcorrect.norm), annot$Probe_Id),]
  project.bgcorrect.norm@.Data[[3]] <- annot
  project.bgcorrect.norm$genes <- annot
  Control <- project.bgcorrect.norm$genes$Source=="ILMN_Controls"
  NoSymbol <- project.bgcorrect.norm$genes$Symbol == ""
  isexpr <- rowSums(project.bgcorrect.norm$other$Detection <= 0.05) >= 3
  project.bgcorrect.norm.filt <- project.bgcorrect.norm[!Control & !NoSymbol & isexpr, ]
  dim(project.bgcorrect.norm)
  dim(project.bgcorrect.norm.filt)

# remove annotation columns we no longer need
  project.bgcorrect.norm.filt$genes <- project.bgcorrect.norm.filt$genes[,c(
    'Probe_Id',
    'Definition','Ontology_Component','Ontology_Process','Ontology_Function',
    'Chromosome','Probe_Coordinates','Cytoband','Probe_Chr_Orientation',
    'RefSeq_ID','Entrez_Gene_ID','Symbol')]
  head(project.bgcorrect.norm.filt$genes)

# summarise across genes by mean
# ID is used to identify the replicates
  project.bgcorrect.norm.filt.mean <- avereps(project.bgcorrect.norm.filt,
    ID = project.bgcorrect.norm.filt$genes$Symbol)
  dim(project.bgcorrect.norm.filt.mean)

targets.txt should look like:

IDATfile    Group
GSM2094831  Healthy
GSM2094832  Healthy
GSM2094833  Disease
GSM2094834  Healthy
...         ...
ADD COMMENT
0
Entering edit mode

Hi Kevin! How are you ?
Using your responses, I did the pre process from raw data! (thank you very much for this!!) Now I'm going to proceed with differential expression analysis. Can I use the limma package for this tipe of data ? One more question, why i need summarise across genes by mean?

ADD REPLY
0
Entering edit mode

Olá! Sim / yes, you can use limma in the standard way.

You can use mean or median (or something else). Mean is okay in this scenario because we already filtered out 'bad' probes just after normalisation via neqc(). In other situations, mean can be affected by outlier values.

ADD REPLY
0
Entering edit mode

Ok, in this case, can I use "project.bgcorrect.norm.filt.mean$E" or I have to extract the mean in another way ?

ADD REPLY
0
Entering edit mode

Yes, project.bgcorrect.norm.filt.mean$E is the expression matrix on which you can use limma to perfrom differential expression analysis.

ADD REPLY
0
Entering edit mode

Hey Kevin! When I try proceed with DEA analysis ,I'm having a litle troubles. Can you help-me ?

design<- model.matrix(~ targetinfo$group)  
> fit <- lmFit(project.bgcorrect.norm.filt.mean$E, design)
> contrasts<-makeContrasts(habits-non_habits,levels = design)
Error in makeContrasts(habits - non_habits, levels = design) : 
  The levels must by syntactically valid names in R, see help(make.names).  Non-valid names: targetinfo$groupnon_habits
Além disso: Warning message:
In makeContrasts(habits - non_habits, levels = design) :
  Renaming (Intercept) to Intercept
ADD REPLY
0
Entering edit mode

It means that your variable names cannot have spaces or other characters like hyphens, back or forward slashes, etc.

Bad:

Negative-Control
Negative/Control
Negative Control

Good:

NegativeControl
Negative_Control
Negative.Control
ADD REPLY
0
Entering edit mode

My Variables already have:

habits non_habits

ADD REPLY
0
Entering edit mode

What is the output of:

levels(targetinfo$groupnon_habits)

?

ADD REPLY
0
Entering edit mode

I think I did something wrong...

levels(targetinfo$groupnon_habits) NULL

ADD REPLY
0
Entering edit mode

I fixed the mistakes... now, when I run:


targets <- factor(targets$group) design <- model.matrix(~0+targets) colnames(design) <- levels(targets) fit <- lmFit(project.bgcorrect.norm.filt.mean$E, design) contrasts<-makeContrasts(habits-non_habits,levels = design) fit2 <- contrasts.fit(fit,contrasts) fit2 <-eBayes(fit2,trend = TRUE) summary(decideTests(fit2,method = "global")) toptable(fit2,coef = 1)


I recieve:


summary(decideTests(fit2,method = "global")) habits - non_habits Down 0 NotSig 11333 Up 0

>

                logFC        AveExpr         t                P.Value
  

AGPAT1 -1.775380 15.53613 -3.794783 0.001044925 HOXD1 4.103525 15.24342 3.789864 0.001057287 FSCN1 3.550966 15.32459 3.310087 0.003297495 STT3B 3.315899 14.04364 3.288376 0.003469572 POFUT1 1.942657 13.70937 3.229489 0.003981497 SFTPA1 3.459939 14.72384 3.223289 0.004039488 LOC56902 2.832716 13.73653 3.184775 0.004418548 CLEC12B 3.440431 18.44207 3.153483 0.004751719 MAP4K5 -2.168434 13.90378 -3.136730 0.004939974 GPR97 -3.919494 14.60409 -3.099695 0.005382097 adj.P.Val B AGPAT1 0.8360452 -4.426130 HOXD1 0.8360452 -4.426427 FSCN1 0.8360452 -4.456384 STT3B 0.8360452 -4.457782 POFUT1 0.8360452 -4.461588 SFTPA1 0.8360452 -4.461990 LOC56902 0.8360452 -4.464492 CLEC12B 0.8360452 -4.466530 MAP4K5 0.8360452 -4.467624 GPR97 0.8360452 -4.470047

I returned on GEO2R, and, for this serie, aparently theres no Diferencially expression bethwen groups. Ps. I wanna compare Non-habits whith habits.

ADD REPLY
0
Entering edit mode

Hi kevin!For getting reliable expression matrices, I wanna know how to process the two non-normalized data from GEO database(GSE98278 & GSE47472). Here are the colnames of the two non-normalized data.Could someone share the codes?This will be a great help for me.enter image description hereenter image description here

ADD REPLY
0
Entering edit mode

Hello Kevin, I am pretty new to R and very unfamiliar with Illumina data. This script is helping a lot, I am working on GSE38481 and GSE38484, however, to understand this workflow I downloaded the dataset you mentioned and am running the pipeline. But when I do so I cannot seem to locate a bgx or targets file. How do I get those?

ADD REPLY
0
Entering edit mode

Hi, the bgx file is just an annotation file - you technically don't need it if you can create your own annotation file via another means, like Bioconductor annotation packages.

This code should create a master table for the Illumina Human v4

require(illuminaHumanv4.db)
select(
  illuminaHumanv4.db,
  keys = keys(illuminaHumanv4.db),
  column = c('PROBEID', 'SYMBOL', 'ENTREZID', 'ENSEMBL'),
  keytype = 'PROBEID')

Other annotation packages are listed here: https://www.bioconductor.org/packages/release/BiocViews.html#___AnnotationData

Going this way involves even more customisation, though, and increased risk.

Regarding the targets file, that is for you to create manually - it's basically the metadata file.

ADD REPLY
0
Entering edit mode

Oh alright! Thanks!

ADD REPLY
1
Entering edit mode
3.4 years ago

Please obtain the data via GEO2R: https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE42023

Kevin

ADD COMMENT
0
Entering edit mode

Hey Kevin! I follow the instructions from your workflow, and it's going well! But, when i tried to execute bc correct and norm, R give me this warning messages:

Error in if (p[which.max(y)] < p[which.min(y)]) detection.p <- 1 - detection.p : 
  argumento tem comprimento zero
Além disso: Warning messages:
1: In which.max(y) : NAs introduzidos por coerção
2: In which.min(y) : NAs introduzidos por coerção
ADD REPLY
1
Entering edit mode

Oh, fala português? Cadê o código que esta tentando de executar? Pode responder em inglês

ADD REPLY
0
Entering edit mode

Hey Kevin, yes! I am a brazilian student! i found where i losted my mind!

now, i set ID_REF as rownames, and created an ElisRaw. Now, i proceed with bccorrection and normalisation.But R provides me this warning: Note: inferring mean and variance of negative control probe intensities from the detection p-values.

how to proceed now?

one more question: What is offset, and why 16 ? :)

ADD REPLY
1
Entering edit mode

Obrigado, mas qual linea especificamente produz o erro?

ADD REPLY
0
Entering edit mode

Hey Kevin!

I'm still confused about offset=16 (sorry about this)

When i try to proceed with # filter out control probes, those with no symbol, and those that failed# I put:

> Control <- raw_data.bgcorrect.norm$genes$Source=="ILMN_Controls"
> NoSymbol <- raw_data.bgcorrect.norm$genes$Symbol == ""
> NoSymbol <- raw_data.bgcorrect.norm$genes$Symbol == " "
> isexpr <- rowSums(non_normalized_pvalue <= 0.05) >= 3
> raw_data.bgcorrect.norm.filt <-raw_data.bgcorrect.norm[!Control & !NoSymbol & isexpr, ]

And R give me this warning: Error in object[[a]][i, , drop = FALSE] : número incorreto de dimensões

In this case, what is my fail ?

ADD REPLY
0
Entering edit mode

for my analysis i need this stages ? - filter out control probes, those with no symbol, and those that failed - summarise across genes by mean

ADD REPLY
0
Entering edit mode

You may not need these stages / steps.

Before executing (running) these lines of code, you should have already run:

project.bgcorrect.norm <- neqc(project, offset = 16)

What is the output of

table(Control)
table(NoSymbol)
table(isexpr)

?

The offset is added to the expression values after the background correction so that no negative values appear in the data.

ADD REPLY
0
Entering edit mode

when i run :

raw_data.bgcorrect.norm <-neqc(raw_data, offset = 16)

r provide's me a new elist, and this warning : Note: inferring mean and variance of negative control probe intensities from the detection p-values.

about the second question:

table(Control)    ------> R give me :   FALSE 4098
table(NoSymbol)------>R give me : FALSE 637676
table(isexpr)-----------> R give me:   FALSE 4052     TRUE 13422
ADD REPLY
0
Entering edit mode

what data i should have in:

Control <- raw_data.bgcorrect.norm$genes$controls=="ILMN_Controls"

NoSymbol <- raw_data.bgcorrect.norm$genes$probes == " " ?

ADD REPLY
0
Entering edit mode

There is evidently a problem with your annotation object, raw_data.bgcorrect.norm$genes.

From the other thread, did you perform this part?

If you then read in the BGX file and align it to your x object (as annot)

?

Your annotation object has to be 100% / perfectly aligned with your expression data.

--------------

If you want, you can avoid the filtering steps involving Control, NoSymbol, and isexpr. The Note that appears after you run neqc() indicates that the normalisation process is already utilising the detection p-values; so, this is good news.

16 is a [somewhat] random / aleatório number, but that recommended by Professor Gordon Smyth.

ADD REPLY
2
Entering edit mode

Please keep in mind that you can avoid this advanced procedure by obtaining the normalised data automatically via GEO2R:

library(Biobase)
library(GEOquery)

# load series and platform data from GEO
gset <- getGEO("GSE42023", GSEMatrix =TRUE, getGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL6883", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

dev.new(width=4+dim(gset)[[2]]/5, height=6)
par(mar=c(2+round(max(nchar(sampleNames(gset)))/2),4,2,1))
title <- paste ("GSE42023", '/', annotation(gset), " selected samples", sep ='')
boxplot(exprs(gset), boxwex=0.7, notch=T, main=title, outline=FALSE, las=2)

If you are a beginner in this, there is no reason why you need to go through this advanced process.

ADD REPLY
0
Entering edit mode

I'm afraid not... to import the BGX data, I used: library(illuminaio)

my_idat_files <- paste("/home/daniela/Área de Trabalho/R_GIT_HUB/Scripts-em-R/GSE42023/GPL6883_HumanRef-8_V3_0_R0_11282963_A.bgx")

bgxfile <-my_idat_files

bgx <-readBGX(bgxfile)

After, I just proceed with your workflow:

raw_data <- " " raw_data <- new("EListRaw",raw_data) raw_data@.Data [[1]] <- 'illumina' raw_data@.Data [[2]] <- clin_data raw_data@.Data [[3]] <-bgx raw_data@.Data [[4]] <- non_normalized_counts raw_data@.Data [[5]] <-NULL

raw_data$E <- non_normalized_counts raw_data$targets <- cbind(clin_data$description,clin_data$group) raw_data$genes <-bgx raw_data$other$Detection <- non_normalized_pvalue

I really now that, but I am a very insterest ... so I whant to learning sorry if I'm boring you : (

ADD REPLY
1
Entering edit mode

Basically, the annotation always has to be perfectly aligned to the expression data based on the probe IDs. This is even more important in this case because this is a very advanced procedure, but you indicated that you are a beginner with no local help (?).

You can align objects via match(), which(), or via dplyr: https://www.infoworld.com/article/3454356/how-to-merge-data-in-r-using-r-merge-dplyr-or-datatable.html

As you may understand, for different reasons, I have hesitancy to just provide all code to you.

ADD REPLY
1
Entering edit mode

There's no problem with this... I will read this article and trying to understand . Unfortunately I just have a few people help me( also trying understand ). Thanks for help me

ADD REPLY
1
Entering edit mode

Any luck?

ADD REPLY
1
Entering edit mode

Hey Kevin! I found some articles and I'm watching some classes (videos) about.

I hope return to talk with you soon :)

ADD REPLY
0
Entering edit mode

Hey Kevin! Sorry for my late response!

Finally i used the merge function , to align the probes with the expression matrix (as annot). When I try to merge the controls with the expression matrix , R provides me: Error in merge.data.frame(raw_int, controls, by.x = "Probe_Id", all.x = TRUE, : 'by.x' e 'by.y' especificam números de colunas diferentes I think is because the expression matrix doesn't have the controls probe ID.

So I go on with background correction / normalisation and only (expression matrix) has normalized values, the others don't . Is this correct ?

ADD REPLY
0
Entering edit mode

Olá! Não sei o que está acontecendo, mas, sim, é possivél que é relacionado com os probes de controle. Eu coloquei código em uma nova resposta. Pode tentar de usar ese?

ADD REPLY
1
Entering edit mode

Sim! Tento agora e te retorno!

ADD REPLY
0
Entering edit mode
17 months ago
seta ★ 1.9k

Hi Kevin, thanks for your always help and support!.

I used your workflow to analyze the Illumina microarray data, GSE35088, enter link description here. However, I did not obtain DE genes while GEO2R gave about 1000 DE genes. As I found that GEO2R does not perform the filtering step while when I filtered the probes (control probes, those with no symbol, and those that failed), only about 800 probes remained out of 22523 probes. Also, the normalization step is different between the two workflows. Is it right? so using GEO2R would not be safe for getting accurate results, is it right? kindly let me know if you have any suggestion/advice?

thank you.

ADD COMMENT

Login before adding your answer.

Traffic: 1700 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6