Question: GPL6883_HumanRef-8_V3_0_R0_11282963_A (illumina expression beadchip)
0
gravatar for daniela.paola.s.p
17 days ago by
Brazil/Montes Claros-MG/Universidade Estadual de Montes Claros
daniela.paola.s.p30 wrote:

I''m quite new in bioinformatics world and started working with GEO dataset GSE42023. Our goals is to extract deferentially expressed genes . My question is:

The gse42023 provides me 2 types of files : RAW.tar and non-normalized.txt.gz. I know that : i need normalize this data to proceed with dea analisis, but i am really confused with this data.

Specially in non-normalized.txt , i have the genes (rows) , and the samples and p-value detection(collumns), in this case, how to proceed ?

Thanks! Example of non_normalized.txt

ID_REF YTY 82T Detection Pval YTY 84T Detection Pval YTY 88T Detection Pval YTY 78T ILMN_2104295 11777.34 0 15654.76 0 16447.22 0 18152.73 ILMN_1804851 47.92878 0.09337349 64776 0.01506024 50.35888 0.06626506 87.81049 ILMN_2412624 779.1193 0 994.2435 0 752119 0 1202821 ILMN_2402629 66.27144 0.009036144 63.89339 0.01957831 46.61563 0.1325301 62.49742

R • 280 views
ADD COMMENTlink modified 11 days ago by Kevin Blighe68k • written 17 days ago by daniela.paola.s.p30
1
gravatar for Kevin Blighe
16 days ago by
Kevin Blighe68k
Republic of Ireland
Kevin Blighe68k wrote:

Please obtain the data via GEO2R: https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE42023

Kevin

ADD COMMENTlink modified 15 days ago • written 16 days ago by Kevin Blighe68k

Hey Kevin! I follow the instructions from your workflow, and it's going well! But, when i tried to execute bc correct and norm, R give me this warning messages:

Error in if (p[which.max(y)] < p[which.min(y)]) detection.p <- 1 - detection.p : 
  argumento tem comprimento zero
Além disso: Warning messages:
1: In which.max(y) : NAs introduzidos por coerção
2: In which.min(y) : NAs introduzidos por coerção
ADD REPLYlink modified 16 days ago by Kevin Blighe68k • written 16 days ago by daniela.paola.s.p30
1

Oh, fala português? Cadê o código que esta tentando de executar? Pode responder em inglês

ADD REPLYlink written 16 days ago by Kevin Blighe68k

Hey Kevin, yes! I am a brazilian student! i found where i losted my mind!

now, i set ID_REF as rownames, and created an ElisRaw. Now, i proceed with bccorrection and normalisation.But R provides me this warning: Note: inferring mean and variance of negative control probe intensities from the detection p-values.

how to proceed now?

one more question: What is offset, and why 16 ? :)

ADD REPLYlink modified 15 days ago • written 16 days ago by daniela.paola.s.p30
1

Obrigado, mas qual linea especificamente produz o erro?

ADD REPLYlink written 15 days ago by Kevin Blighe68k

Hey Kevin!

I'm still confused about offset=16 (sorry about this)

When i try to proceed with # filter out control probes, those with no symbol, and those that failed# I put:

> Control <- raw_data.bgcorrect.norm$genes$Source=="ILMN_Controls"
> NoSymbol <- raw_data.bgcorrect.norm$genes$Symbol == ""
> NoSymbol <- raw_data.bgcorrect.norm$genes$Symbol == " "
> isexpr <- rowSums(non_normalized_pvalue <= 0.05) >= 3
> raw_data.bgcorrect.norm.filt <-raw_data.bgcorrect.norm[!Control & !NoSymbol & isexpr, ]

And R give me this warning: Error in object[[a]][i, , drop = FALSE] : número incorreto de dimensões

In this case, what is my fail ?

ADD REPLYlink modified 15 days ago by Kevin Blighe68k • written 15 days ago by daniela.paola.s.p30

for my analysis i need this stages ? - filter out control probes, those with no symbol, and those that failed - summarise across genes by mean

ADD REPLYlink written 15 days ago by daniela.paola.s.p30

You may not need these stages / steps.

Before executing (running) these lines of code, you should have already run:

project.bgcorrect.norm <- neqc(project, offset = 16)

What is the output of

table(Control)
table(NoSymbol)
table(isexpr)

?

The offset is added to the expression values after the background correction so that no negative values appear in the data.

ADD REPLYlink written 15 days ago by Kevin Blighe68k

when i run :

raw_data.bgcorrect.norm <-neqc(raw_data, offset = 16)

r provide's me a new elist, and this warning : Note: inferring mean and variance of negative control probe intensities from the detection p-values.

about the second question:

table(Control)    ------> R give me :   FALSE 4098
table(NoSymbol)------>R give me : FALSE 637676
table(isexpr)-----------> R give me:   FALSE 4052     TRUE 13422
ADD REPLYlink modified 15 days ago by Kevin Blighe68k • written 15 days ago by daniela.paola.s.p30

what data i should have in:

Control <- raw_data.bgcorrect.norm$genes$controls=="ILMN_Controls"

NoSymbol <- raw_data.bgcorrect.norm$genes$probes == " " ?

ADD REPLYlink written 15 days ago by daniela.paola.s.p30

There is evidently a problem with your annotation object, raw_data.bgcorrect.norm$genes.

From the other thread, did you perform this part?

If you then read in the BGX file and align it to your x object (as annot)

?

Your annotation object has to be 100% / perfectly aligned with your expression data.

--------------

If you want, you can avoid the filtering steps involving Control, NoSymbol, and isexpr. The Note that appears after you run neqc() indicates that the normalisation process is already utilising the detection p-values; so, this is good news.

16 is a [somewhat] random / aleatório number, but that recommended by Professor Gordon Smyth.

ADD REPLYlink written 15 days ago by Kevin Blighe68k

Please keep in mind that you can avoid this advanced procedure by obtaining the normalised data automatically via GEO2R:

library(Biobase)
library(GEOquery)

# load series and platform data from GEO
gset <- getGEO("GSE42023", GSEMatrix =TRUE, getGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL6883", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

dev.new(width=4+dim(gset)[[2]]/5, height=6)
par(mar=c(2+round(max(nchar(sampleNames(gset)))/2),4,2,1))
title <- paste ("GSE42023", '/', annotation(gset), " selected samples", sep ='')
boxplot(exprs(gset), boxwex=0.7, notch=T, main=title, outline=FALSE, las=2)

If you are a beginner in this, there is no reason why you need to go through this advanced process.

ADD REPLYlink written 15 days ago by Kevin Blighe68k

I'm afraid not... to import the BGX data, I used: library(illuminaio)

my_idat_files <- paste("/home/daniela/Área de Trabalho/R_GIT_HUB/Scripts-em-R/GSE42023/GPL6883_HumanRef-8_V3_0_R0_11282963_A.bgx")

bgxfile <-my_idat_files

bgx <-readBGX(bgxfile)

After, I just proceed with your workflow:

raw_data <- " " raw_data <- new("EListRaw",raw_data) raw_data@.Data [[1]] <- 'illumina' raw_data@.Data [[2]] <- clin_data raw_data@.Data [[3]] <-bgx raw_data@.Data [[4]] <- non_normalized_counts raw_data@.Data [[5]] <-NULL

raw_data$E <- non_normalized_counts raw_data$targets <- cbind(clin_data$description,clin_data$group) raw_data$genes <-bgx raw_data$other$Detection <- non_normalized_pvalue

I really now that, but I am a very insterest ... so I whant to learning sorry if I'm boring you : (

ADD REPLYlink modified 15 days ago by Kevin Blighe68k • written 15 days ago by daniela.paola.s.p30
1

Basically, the annotation always has to be perfectly aligned to the expression data based on the probe IDs. This is even more important in this case because this is a very advanced procedure, but you indicated that you are a beginner with no local help (?).

You can align objects via match(), which(), or via dplyr: https://www.infoworld.com/article/3454356/how-to-merge-data-in-r-using-r-merge-dplyr-or-datatable.html

As you may understand, for different reasons, I have hesitancy to just provide all code to you.

ADD REPLYlink written 15 days ago by Kevin Blighe68k
1

There's no problem with this... I will read this article and trying to understand . Unfortunately I just have a few people help me( also trying understand ). Thanks for help me

ADD REPLYlink written 15 days ago by daniela.paola.s.p30
1

Any luck?

ADD REPLYlink written 14 days ago by Kevin Blighe68k
1

Hey Kevin! I found some articles and I'm watching some classes (videos) about.

I hope return to talk with you soon :)

ADD REPLYlink written 14 days ago by daniela.paola.s.p30

Hey Kevin! Sorry for my late response!

Finally i used the merge function , to align the probes with the expression matrix (as annot). When I try to merge the controls with the expression matrix , R provides me: Error in merge.data.frame(raw_int, controls, by.x = "Probe_Id", all.x = TRUE, : 'by.x' e 'by.y' especificam números de colunas diferentes I think is because the expression matrix doesn't have the controls probe ID.

So I go on with background correction / normalisation and only (expression matrix) has normalized values, the others don't . Is this correct ?

ADD REPLYlink written 11 days ago by daniela.paola.s.p30

Olá! Não sei o que está acontecendo, mas, sim, é possivél que é relacionado com os probes de controle. Eu coloquei código em uma nova resposta. Pode tentar de usar ese?

ADD REPLYlink written 11 days ago by Kevin Blighe68k
1

Sim! Tento agora e te retorno!

ADD REPLYlink written 11 days ago by daniela.paola.s.p30
1
gravatar for Kevin Blighe
11 days ago by
Kevin Blighe68k
Republic of Ireland
Kevin Blighe68k wrote:

Sample code for GSE79404

#!/usr/bin/Rscript

# general config
  baseDir <- 'GSE79404/'
  bgxfile <- 'Annot/HumanHT-12_V4_0_R2_15002873_B.bgx'
  targetsfile <- 'targets.txt'
  setwd(baseDir)
  options(scipen = 99)
  require(limma)
  require(RColorBrewer)
  require(PCAtools)

# read in the data and convert the data to an EListRaw object, which is a data object for single channel data
  x <- read.table(paste0(baseDir, 'raw/GSE79404_non-normalized.txt'),
    header = TRUE, sep = '\t', stringsAsFactors = FALSE, skip = 4)
  newnames <- colnames(read.table(paste0(baseDir, 'raw/GSE79404_non-normalized.txt'),
    header = TRUE, sep = '\t', stringsAsFactors = FALSE, skip = 3))
  detectionpvalues <- x[,grep('Detection.Pval', colnames(x))]
  x <- x[,-grep('Detection.Pval', colnames(x))]
  probes <- x$ID_REF
  x <- data.matrix(x[,2:ncol(x)])
  rownames(x) <- probes
  colnames(x) <- gsub('^X', '', colnames(x))
  colnames(x) <- newnames[grep('GSM', newnames)]

# read in annotation
  annot <- illuminaio::readBGX(bgxfile)$probes
  annot <- annot[,which(colnames(annot) %in% c('Source','Symbol','Transcript','ILMN_Gene','RefSeq_ID',
    'Entrez_Gene_ID','Symbol','Protein_Product','Probe_Id','Probe_Type',
    'Probe_Start','Chromosome','Probe_Chr_Orientation','Probe_Coordinates',
    'Cytoband', 'Definition', 'Ontology_Component', 'Ontology_Process',
    'Ontology_Function', 'Synonyms'))]
  annot <- annot[which(annot$Probe_Id %in% rownames(x)),]
  annot <- annot[match(rownames(x), annot$Probe_Id),]

# update the target info
  targetinfo <- readTargets(targetsfile, sep = '\t')
  rownames(targetinfo) <- targetinfo$IDATfile
  x <- x[,match(rownames(targetinfo), colnames(x))]
  if (!all(colnames(x) == rownames(targetinfo)))
    stop('Target info is not aligned to expression data - they must be in the same order')

# create a custom EListRaw object
  project <- new('EListRaw')
  project@.Data[[1]] <- 'illumina'
  project@.Data[[2]] <- targetinfo
  #project@.Data[[3]] <- annot
  project@.Data[[3]] <- NULL
  project@.Data[[4]] <- x
  project@.Data[[5]] <- NULL
  project$E <- x
  project$targets <- targetinfo
  #project$genes <- annot
  project$genes <- NULL
  project$other$Detection <- detectionpvalues

# generate QC plots of raw intensities
  ... ...  

# for BeadArrays, background correction and normalisation are handled by a single function: neqc()
# this is the same as per Agilent single colour arrays
#
# perform background correction on the fluorescent intensities
#   'normexp' is beneficial in that it doesn't result in negative values, meaning no data is lost
#   for ideal offset, see Gordon Smyth's answer, here: https://stat.ethz.ch/pipermail/bioconductor/2006-April/012554.html
# normalize the data with the 'quantile' method, to be consistent with RMA for Affymetrix arrays
  project.bgcorrect.norm <- neqc(project, offset = 16)

# filter out control probes, those with no symbol, and those that failed
  annot <- annot[which(annot$Probe_Id %in% rownames(project.bgcorrect.norm)),]
  project.bgcorrect.norm <- project.bgcorrect.norm[which(rownames(project.bgcorrect.norm) %in% annot$Probe_Id),]
  annot <- annot[match(rownames(project.bgcorrect.norm), annot$Probe_Id),]
  project.bgcorrect.norm@.Data[[3]] <- annot
  project.bgcorrect.norm$genes <- annot
  Control <- project.bgcorrect.norm$genes$Source=="ILMN_Controls"
  NoSymbol <- project.bgcorrect.norm$genes$Symbol == ""
  isexpr <- rowSums(project.bgcorrect.norm$other$Detection <= 0.05) >= 3
  project.bgcorrect.norm.filt <- project.bgcorrect.norm[!Control & !NoSymbol & isexpr, ]
  dim(project.bgcorrect.norm)
  dim(project.bgcorrect.norm.filt)

# remove annotation columns we no longer need
  project.bgcorrect.norm.filt$genes <- project.bgcorrect.norm.filt$genes[,c(
    'Probe_Id',
    'Definition','Ontology_Component','Ontology_Process','Ontology_Function',
    'Chromosome','Probe_Coordinates','Cytoband','Probe_Chr_Orientation',
    'RefSeq_ID','Entrez_Gene_ID','Symbol')]
  head(project.bgcorrect.norm.filt$genes)

# summarise across genes by mean
# ID is used to identify the replicates
  project.bgcorrect.norm.filt.mean <- avereps(project.bgcorrect.norm.filt,
    ID = project.bgcorrect.norm.filt$genes$Symbol)
  dim(project.bgcorrect.norm.filt.mean)

targets.txt should look like:

IDATfile    Group
GSM2094831  Healthy
GSM2094832  Healthy
GSM2094833  Disease
GSM2094834  Healthy
...         ...
ADD COMMENTlink written 11 days ago by Kevin Blighe68k

Hi Kevin! How are you ?
Using your responses, I did the pre process from raw data! (thank you very much for this!!) Now I'm going to proceed with differential expression analysis. Can I use the limma package for this tipe of data ? One more question, why i need summarise across genes by mean?

ADD REPLYlink written 3 days ago by daniela.paola.s.p30

Olá! Sim / yes, you can use limma in the standard way.

You can use mean or median (or something else). Mean is okay in this scenario because we already filtered out 'bad' probes just after normalisation via neqc(). In other situations, mean can be affected by outlier values.

ADD REPLYlink written 3 days ago by Kevin Blighe68k

Ok, in this case, can I use "project.bgcorrect.norm.filt.mean$E" or I have to extract the mean in another way ?

ADD REPLYlink written 3 days ago by daniela.paola.s.p30

Yes, project.bgcorrect.norm.filt.mean$E is the expression matrix on which you can use limma to perfrom differential expression analysis.

ADD REPLYlink written 3 days ago by Kevin Blighe68k

Hey Kevin! When I try proceed with DEA analysis ,I'm having a litle troubles. Can you help-me ?

design<- model.matrix(~ targetinfo$group)  
> fit <- lmFit(project.bgcorrect.norm.filt.mean$E, design)
> contrasts<-makeContrasts(habits-non_habits,levels = design)
Error in makeContrasts(habits - non_habits, levels = design) : 
  The levels must by syntactically valid names in R, see help(make.names).  Non-valid names: targetinfo$groupnon_habits
Além disso: Warning message:
In makeContrasts(habits - non_habits, levels = design) :
  Renaming (Intercept) to Intercept
ADD REPLYlink modified 2 days ago by Kevin Blighe68k • written 3 days ago by daniela.paola.s.p30

It means that your variable names cannot have spaces or other characters like hyphens, back or forward slashes, etc.

Bad:

Negative-Control
Negative/Control
Negative Control

Good:

NegativeControl
Negative_Control
Negative.Control
ADD REPLYlink modified 3 days ago • written 3 days ago by Kevin Blighe68k

My Variables already have:

habits non_habits

ADD REPLYlink written 2 days ago by daniela.paola.s.p30

What is the output of:

levels(targetinfo$groupnon_habits)

?

ADD REPLYlink written 2 days ago by Kevin Blighe68k

I think I did something wrong...

levels(targetinfo$groupnon_habits) NULL

ADD REPLYlink written 2 days ago by daniela.paola.s.p30

I fixed the mistakes... now, when I run:


targets <- factor(targets$group) design <- model.matrix(~0+targets) colnames(design) <- levels(targets) fit <- lmFit(project.bgcorrect.norm.filt.mean$E, design) contrasts<-makeContrasts(habits-non_habits,levels = design) fit2 <- contrasts.fit(fit,contrasts) fit2 <-eBayes(fit2,trend = TRUE) summary(decideTests(fit2,method = "global")) toptable(fit2,coef = 1)


I recieve:


summary(decideTests(fit2,method = "global")) habits - non_habits Down 0 NotSig 11333 Up 0

>

                logFC        AveExpr         t                P.Value
  

AGPAT1 -1.775380 15.53613 -3.794783 0.001044925 HOXD1 4.103525 15.24342 3.789864 0.001057287 FSCN1 3.550966 15.32459 3.310087 0.003297495 STT3B 3.315899 14.04364 3.288376 0.003469572 POFUT1 1.942657 13.70937 3.229489 0.003981497 SFTPA1 3.459939 14.72384 3.223289 0.004039488 LOC56902 2.832716 13.73653 3.184775 0.004418548 CLEC12B 3.440431 18.44207 3.153483 0.004751719 MAP4K5 -2.168434 13.90378 -3.136730 0.004939974 GPR97 -3.919494 14.60409 -3.099695 0.005382097 adj.P.Val B AGPAT1 0.8360452 -4.426130 HOXD1 0.8360452 -4.426427 FSCN1 0.8360452 -4.456384 STT3B 0.8360452 -4.457782 POFUT1 0.8360452 -4.461588 SFTPA1 0.8360452 -4.461990 LOC56902 0.8360452 -4.464492 CLEC12B 0.8360452 -4.466530 MAP4K5 0.8360452 -4.467624 GPR97 0.8360452 -4.470047

I returned on GEO2R, and, for this serie, aparently theres no Diferencially expression bethwen groups. Ps. I wanna compare Non-habits whith habits.

ADD REPLYlink written 2 days ago by daniela.paola.s.p30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1291 users visited in the last hour