Missing samples (observations) in time series data
1
0
Entering edit mode
6.4 years ago

Hi all,

I am trying to find the differential expressed genes for a dataset where the samples are treated with a drug at three different time-points(baseline, 16 weeks, 52 weeks). There are no controls in the study. I am trying to use the limma in R and analyze this as they are paired samples. I am using F statistic to do this. I have one challenge here. Not all time points equal samples. Few observations are missing. For instance, I have baseline and week 16 and 52 week is missing. The easiest way to handle the missing observations I thought was to take the samples which has all three time points. But I think in statistical point of view this might not be correct way. Can anybody suggest what method to use to handle this issue. I am reading lot of papers. I came across imputation technique. I am not sure whether or not the technique would be applicable to this scenario.

I am attaching the sample data here

Accession   Title   time    timepoints  subjectid
GSM2352693  SUBJ.1720, SLE, baseline    baseline    3   SUBJ.1720
GSM2352694  SUBJ.1720, SLE, week16  week 16 3   SUBJ.1720
GSM2352695  SUBJ.1720, SLE, week52  week 52 3   SUBJ.1720
GSM2352696  SUBJ.0003, SLE, baseline    baseline    3   SUBJ.0003
GSM2352697  SUBJ.0003, SLE, week16  week 16 3   SUBJ.0003
GSM2352698  SUBJ.0003, SLE, week52  week 52 3   SUBJ.0003
GSM2352699  SUBJ.0065, SLE, baseline    baseline    2   SUBJ.0065
GSM2352700  SUBJ.0065, SLE, week52  week 52 2   SUBJ.0065
GSM2352701  SUBJ.1587, SLE, baseline    baseline    3   SUBJ.1587
GSM2352702  SUBJ.1587, SLE, week16  week 16 3   SUBJ.1587
GSM2352703  SUBJ.1587, SLE, week52  week 52 3   SUBJ.1587
GSM2352704  SUBJ.1028, SLE, baseline    baseline    3   SUBJ.1028
GSM2352705  SUBJ.1028, SLE, week16  week 16 3   SUBJ.1028
GSM2352706  SUBJ.1028, SLE, week52  week 52 3   SUBJ.1028
GSM2352707  SUBJ.0901, SLE, baseline    baseline    3   SUBJ.0901
GSM2352708  SUBJ.0901, SLE, week16  week 16 3   SUBJ.0901
GSM2352709  SUBJ.0901, SLE, week52  week 52 3   SUBJ.0901
GSM2352710  SUBJ.1544, SLE, baseline    baseline    3   SUBJ.1544
GSM2352711  SUBJ.1544, SLE, week16  week 16 3   SUBJ.1544
GSM2352712  SUBJ.1544, SLE, week52  week 52 3   SUBJ.1544
GSM2352713  SUBJ.0200, SLE, baseline    baseline    3   SUBJ.0200
GSM2352714  SUBJ.0200, SLE, week16  week 16 3   SUBJ.0200
GSM2352715  SUBJ.0200, SLE, week52  week 52 3   SUBJ.0200
GSM2352716  SUBJ.0032, SLE, baseline    baseline    3   SUBJ.0032
GSM2352717  SUBJ.0032, SLE, week16  week 16 3   SUBJ.0032
GSM2352718  SUBJ.0032, SLE, week52  week 52 3   SUBJ.0032
GSM2352719  SUBJ.1545, SLE, week16  week 16 2   SUBJ.1545
GSM2352720  SUBJ.1545, SLE, week52  week 52 2   SUBJ.1545

Th R code I am trying to use is

library(limma)
# limma
design <- model.matrix(~0 + eset_filtered$cohort)
colnames(design) <- levels(eset_filtered$cohort)
fit <- lmFit(eset_filtered, design)
contrast.matrix <- makeContrasts("week16-baseline","week52-baseline",levels = design) #name
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)
top.active <- topTable(fit2, adjust="BH", n=nrow(eset_filtered))

Any suggestions would be really appreciated. Thanks in advance.

Microarray Time-series Limma • 1.2k views
ADD COMMENT
0
Entering edit mode
6.4 years ago

The easiest way to handle the missing observations I thought was to take the samples which has all three time points. But I think in statistical point of view this might not be correct way.

Why do you say that? - it is actually the desirable situation to have data at all time-points for all samples. The problem is that you reduce your statistical power, which is perhaps what you meant (?).

Given the data that you've got, I would not restrict myself to just a single type of analysis (I've analysed, in the past, various datasets like you've got):

  • What changes from Baseline to 16 weeks? (simple comparison in Limma)
  • What changes from 16 weeks to 52 weeks? (simple comparison in Limma)
  • What changes from Baseline to 52 weeks? (simple comparison in Limma)
  • What differs between all time-points (ANOVA and F-test)?

I would also do paired tests in the form of the Wilcoxon signed-rank test between each time-point on a pairwise basis (3 comparisons).

Further, if you have drug response data: Given the genes that differ statistically significantly between Baseline and 16 or 52 weeks, can the levels of these genes in just the baseline samples predict drug response / non-response at 16 or 52 weeks? This could be answered via the construction of a logistic regression model with y as DrugResponse (at 16 or 52 weeks) and the x predictors as the baseline expression of the key genes. Then, once you select best predictors, perform ROC analysis on the final model.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 2419 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6