Question

Matched paired DEA analysis of BRCA samples using TCGAbiolinks, works for paired = FALSE but not paired = TRUE

2

Entering edit mode

5.8 years ago

aluesley1 ▴ 60

I am attempting to do a DEA analysis of matched paired BRCA normal tumour samples using TCGAbiolinks. When I do an analysis of my samples using this formula

dataDEGsMatched <- TCGAanalyze_DEA(mat1 = MatchedTPdataFilt,
                                   mat2 = MatchedNTdataFilt,
                                   Cond1type = "Tumor",
                                   Cond2type = "Normal",
                                   fdr.cut = 0.01,
                                   logFC.cut = 1.5,
                                   method = "glmLRT")

It works and returns a DEG, however, I do not believe it is matched paired. When I add in the line paired = TRUE I get the following error.

> dataDEGsMatched <- TCGAanalyze_DEA(mat1 = MatchedTPdataFilt,
+                                    mat2 = MatchedNTdataFilt,
+                                    Cond1type = "Tumor",
+                                    Cond2type = "Normal",
+                                    pipeline = "edgeR",
+                                    fdr.cut = 0.01,
+                                    logFC.cut = 1.5,
+                                    method = "glmLRT",
+                                    paired = TRUE)
Batch correction skipped since no factors provided
----------------------- DEA -------------------------------
there are Cond1 type Tumor in  113 samples
there are Cond2 type Normal in  113 samples
there are  14477 features as miRNA or genes 
I Need about  0 seconds for this DEA. [Processing 30k elements /s]  
Error in dimnames(x) <- dn : 
  length of 'dimnames' [2] not equal to array extent

I don't really understand what is causing this error. I have tried to ensure the matrices are both in the correct order so the corresponding columns are paired using -

MatchedTPdataFilt <- MatchedTPdataFilt[, order(colnames(MatchedTPdataFilt))]
MatchedNTdataFilt <- MatchedNTdataFilt[, order(colnames(MatchedNTdataFilt))]

But this doesn't fix the problem.

I also attempted to then change the column names for the matrices so the corresponding names match, but this then returned a different error.

colnames(MatchedTPdataFilt) <- (1:113)
colnames(MatchedNTdataFilt) <- (1:113)
> dataDEGsMatched <- TCGAanalyze_DEA(mat1 = sort(AllMatcheddataFilt[,samplesMatchedTP]),
+                                    mat2 = sort(AllMatcheddataFilt[,samplesMatchedNT]),
+                                    Cond1type = "Tumor",
+                                    Cond2type = "Normal",
+                                    pipeline = "edgeR",
+                                    fdr.cut = 0.01,
+                                    logFC.cut = 1.5,
+                                    method = "glmLRT",
+                                    paired = TRUE)
Error in names(x) <- value : 
  'names' attribute [7] must be the same length as the vector [1]

I have also checked that the dimensions for both matrices are the same -

> dim(MatchedNTdataFilt)
[1] 14477   113
> dim(MatchedTPdataFilt)
[1] 14477   113

I have read the manuals and user guides for TCGAbiolinks and haven't managed to shed any light on the issue, I would appreciate any help.

Thank you.

RNA-Seq TCGAbiolinks R • 3.2k views

ADD COMMENT • link updated 5.8 years ago by Kevin Blighe 87k • written 5.8 years ago by aluesley1 ▴ 60

0

Entering edit mode

Hello again, can you show how you create the MatchedTPdataFilt and MatchedNTdataFilt objects? I just want to re-create the error on my own system. Is it somewhat a continuation of the issue in your other question ( A: Matched Paired Tumour-Normal DEA of BRCA using data downloaded using TCGAbiolink )?

ADD REPLY • link 5.8 years ago by Kevin Blighe 87k

0

Entering edit mode

Hello! Yes it is a continuation from that. I used a workflow for TCGAbiolinks to process the matched data. I have used it before for unmatched normal-tumour samples and it has worked.

Here is what I have used -

Matched.Samples.Normal <- subset(prep.BRCA.normal, 
                                 select = colData(prep.BRCA.normal)$patient 
                                 %in% colData(prep.BRCA.tumour)$patient)

Matched.Samples.Tumour<- subset(prep.BRCA.tumour, 
                                select = colData(prep.BRCA.tumour)$patient 
                                %in% colData(prep.BRCA.normal)$patient)

dataPrepro_MatchedNormal <- TCGAanalyze_Preprocessing(object = Matched.Samples.Normal,
                                                          cor.cut = 0.6,    
                                                          datatype = "raw_count",
                                                          filename = "BRCAmatchednormal.rda")

dataPrepro_MatchedTumour <- TCGAanalyze_Preprocessing(object = Matched.Samples.Tumour,
                                                          cor.cut = 0.6,    
                                                          datatype = "raw_count",
                                                          filename = "BRCAtmatchedtumour.rda")

AllMatcheddataNorm <- TCGAanalyze_Normalization(tabDF = cbind(dataPrepro_MatchedNormal, dataPrepro_MatchedTumour),
                                               geneInfo = TCGAbiolinks::geneInfo,
                                               method = "gcContent")

AllMatcheddataFilt <- TCGAanalyze_Filtering(tabDF = AllMatcheddataNorm,
                                           method = "quantile",
                                           qnt.cut =  0.25)

save(AllMatcheddataFilt, file = paste0("BRCA_Matched_Filt_IlluminaHiSeq.rda"))

# selection of normal samples "NT"
samplesMatchedNT <- TCGAquery_SampleTypes(barcode = colnames(AllMatcheddataFilt),
                                   typesample = c("NT"))

MatchedNTdataFilt <- AllMatcheddataFilt[,samplesMatchedNT]

MatchedNTdataFilt <- MatchedNTdataFilt[, order(colnames(MatchedNTdataFilt))]

colnames(MatchedNTdataFilt) <- (1:113)

# selection of tumor samples "TP"
samplesMatchedTP <- TCGAquery_SampleTypes(barcode = colnames(AllMatcheddataFilt), 
                                   typesample = c("TP"))

MatchedTPdataFilt <- AllMatcheddataFilt[,samplesMatchedTP]

MatchedTPdataFilt <- MatchedTPdataFilt[, order(colnames(MatchedTPdataFilt))]

colnames(MatchedTPdataFilt) <- (1:113)

# Diff.expr.analysis (DEA) of matched paired samples
dataDEGsMatched <- TCGAanalyze_DEA(mat1 = MatchedTPdataFilt,
                                   mat2 = MatchedNTdataFilt,
                                   Cond1type = "Tumor",
                                   Cond2type = "Normal",
                                   fdr.cut = 0.01,
                                   logFC.cut = 1.5,
                                   method = "glmLRT",
                                   paired = TRUE)

ADD REPLY • link 5.8 years ago by aluesley1 ▴ 60

0

Entering edit mode

Okay, I have managed to reproduce the exact same error.... now let me see if I can solve it.

ADD REPLY • link 5.8 years ago by Kevin Blighe 87k

0

Entering edit mode

Posted an answer below

ADD REPLY • link 5.8 years ago by Kevin Blighe 87k

score 2 · Answer 1 · 2018-06-24

Hi aluesley1,

Firstly, you should not be modifying the colnames of your filtered objects, which means not executing the following lines:

colnames(MatchedTPdataFilt) <- (1:113)
colnames(MatchedNTdataFilt) <- (1:113)

So, from the save() command that you run, your code should be (nothing changed here, apart from not changing the colnames):

# selection of normal samples "NT"
samplesMatchedNT <- TCGAquery_SampleTypes(barcode = colnames(AllMatcheddataFilt),
                                   typesample = c("NT"))
MatchedNTdataFilt <- AllMatcheddataFilt[,samplesMatchedNT]
MatchedNTdataFilt <- MatchedNTdataFilt[, order(colnames(MatchedNTdataFilt))]

# selection of tumor samples "TP"
samplesMatchedTP <- TCGAquery_SampleTypes(barcode = colnames(AllMatcheddataFilt), 
                                   typesample = c("TP"))
MatchedTPdataFilt <- AllMatcheddataFilt[,samplesMatchedTP]
MatchedTPdataFilt <- MatchedTPdataFilt[, order(colnames(MatchedTPdataFilt))]

After investigating what was going wrong after that, I found a possible oversight in the TCGAanalyze_DEA() function that only appears when paired=TRUE. I modified this function and was able to run the command. The new function is located here: https://github.com/kevinblighe/BiostarsMisc/blob/master/TCGAanalyze_DEA_Fix_aluesley1.R (only line 31 was edited)

When you download it and load it into R with source("TCGAanalyze_DEA_Fix_aluesley1.R"), you can then run your command:

dataDEGsMatched.paired <- TCGAanalyze_DEA_Fix_aluesley1(mat1 = MatchedTPdataFilt,
                                   mat2 = MatchedNTdataFilt,
                                   Cond1type = "Tumor",
                                   Cond2type = "Normal",
                                   fdr.cut = 0.01,
                                   logFC.cut = 1.5,
                                   method = "glmLRT",
                                   paired = TRUE)

Disclaimer: This was modified for this specific example, so, don't use the modified function for other data.

-------------------------------------------

After having gone through the TCGAbiolinks code, I realised that it's not doing a proper paired analysis. All that it does is filter your dataset so that your 2 conditions have the same matched T-N (based on the shortened TCGA barcode), and then it just compares them with the standard test and same design formula as per the paired=FALSE. This is not a proper analysis because it's not adjusting for the within-patient effects that exist across a matched T-N pair.

Here's the proof:

dataDEGsMatched.unpaired <- TCGAanalyze_DEA_Fix_aluesley1(mat1 = MatchedTPdataFilt,
                                   mat2 = MatchedNTdataFilt,
                                   Cond1type = "Tumor",
                                   Cond2type = "Normal",
                                   fdr.cut = 0.01,
                                   logFC.cut = 1.5,
                                   method = "glmLRT",
                                   paired = FALSE)


head(dataDEGsMatched.paired)
           logFC   logCPM        LR       PValue          FDR
A2ML1  -4.400922 3.092592 113.93648 1.345505e-26 7.448902e-26
AADAC   4.113193 1.559164  82.50130 1.056005e-19 4.100799e-19
ABAT   -1.684259 5.996629  98.78378 2.816324e-23 1.312682e-22
ABCA10  3.628860 3.106490 339.55028 7.994882e-76 3.910199e-74

head(dataDEGsMatched.unpaired)
           logFC   logCPM        LR       PValue          FDR
A2ML1  -4.400922 3.092592 113.93648 1.345505e-26 7.448902e-26
AADAC   4.113193 1.559164  82.50130 1.056005e-19 4.100799e-19
ABAT   -1.684259 5.996629  98.78378 2.816324e-23 1.312682e-22
ABCA10  3.628860 3.106490 339.55028 7.994882e-76 3.910199e-74

...produces the exact same stats. So, when using this DEA function of TCGAbiolinks, simply filtering your own dataset to have matched samples prior to running the DEA and using paired=FALSE is the exact same as using the paired=TRUE parameter.

For breast cancer, I believe it was indeed 112 or 113 for matched breast T-N .

Kevin