Question: Matched paired DEA analysis of BRCA samples using TCGAbiolinks, works for paired = FALSE but not paired = TRUE
1
gravatar for aluesley1
9 months ago by
aluesley140
aluesley140 wrote:

I am attempting to do a DEA analysis of matched paired BRCA normal tumour samples using TCGAbiolinks. When I do an analysis of my samples using this formula

dataDEGsMatched <- TCGAanalyze_DEA(mat1 = MatchedTPdataFilt,
                                   mat2 = MatchedNTdataFilt,
                                   Cond1type = "Tumor",
                                   Cond2type = "Normal",
                                   fdr.cut = 0.01,
                                   logFC.cut = 1.5,
                                   method = "glmLRT")

It works and returns a DEG, however, I do not believe it is matched paired. When I add in the line paired = TRUE I get the following error.

> dataDEGsMatched <- TCGAanalyze_DEA(mat1 = MatchedTPdataFilt,
+                                    mat2 = MatchedNTdataFilt,
+                                    Cond1type = "Tumor",
+                                    Cond2type = "Normal",
+                                    pipeline = "edgeR",
+                                    fdr.cut = 0.01,
+                                    logFC.cut = 1.5,
+                                    method = "glmLRT",
+                                    paired = TRUE)
Batch correction skipped since no factors provided
----------------------- DEA -------------------------------
there are Cond1 type Tumor in  113 samples
there are Cond2 type Normal in  113 samples
there are  14477 features as miRNA or genes 
I Need about  0 seconds for this DEA. [Processing 30k elements /s]  
Error in dimnames(x) <- dn : 
  length of 'dimnames' [2] not equal to array extent

I don't really understand what is causing this error. I have tried to ensure the matrices are both in the correct order so the corresponding columns are paired using -

MatchedTPdataFilt <- MatchedTPdataFilt[, order(colnames(MatchedTPdataFilt))]
MatchedNTdataFilt <- MatchedNTdataFilt[, order(colnames(MatchedNTdataFilt))]

But this doesn't fix the problem.

I also attempted to then change the column names for the matrices so the corresponding names match, but this then returned a different error.

colnames(MatchedTPdataFilt) <- (1:113)
colnames(MatchedNTdataFilt) <- (1:113)
> dataDEGsMatched <- TCGAanalyze_DEA(mat1 = sort(AllMatcheddataFilt[,samplesMatchedTP]),
+                                    mat2 = sort(AllMatcheddataFilt[,samplesMatchedNT]),
+                                    Cond1type = "Tumor",
+                                    Cond2type = "Normal",
+                                    pipeline = "edgeR",
+                                    fdr.cut = 0.01,
+                                    logFC.cut = 1.5,
+                                    method = "glmLRT",
+                                    paired = TRUE)
Error in names(x) <- value : 
  'names' attribute [7] must be the same length as the vector [1]

I have also checked that the dimensions for both matrices are the same -

> dim(MatchedNTdataFilt)
[1] 14477   113
> dim(MatchedTPdataFilt)
[1] 14477   113

I have read the manuals and user guides for TCGAbiolinks and haven't managed to shed any light on the issue, I would appreciate any help.

Thank you.

rna-seq R tcgabiolinks • 560 views
ADD COMMENTlink modified 9 months ago by Kevin Blighe39k • written 9 months ago by aluesley140

Hello again, can you show how you create the MatchedTPdataFilt and MatchedNTdataFilt objects? I just want to re-create the error on my own system. Is it somewhat a continuation of the issue in your other question ( A: Matched Paired Tumour-Normal DEA of BRCA using data downloaded using TCGAbiolink )?

ADD REPLYlink written 9 months ago by Kevin Blighe39k

Hello! Yes it is a continuation from that. I used a workflow for TCGAbiolinks to process the matched data. I have used it before for unmatched normal-tumour samples and it has worked.

Here is what I have used -

Matched.Samples.Normal <- subset(prep.BRCA.normal, 
                                 select = colData(prep.BRCA.normal)$patient 
                                 %in% colData(prep.BRCA.tumour)$patient)

Matched.Samples.Tumour<- subset(prep.BRCA.tumour, 
                                select = colData(prep.BRCA.tumour)$patient 
                                %in% colData(prep.BRCA.normal)$patient)

dataPrepro_MatchedNormal <- TCGAanalyze_Preprocessing(object = Matched.Samples.Normal,
                                                          cor.cut = 0.6,    
                                                          datatype = "raw_count",
                                                          filename = "BRCAmatchednormal.rda")

dataPrepro_MatchedTumour <- TCGAanalyze_Preprocessing(object = Matched.Samples.Tumour,
                                                          cor.cut = 0.6,    
                                                          datatype = "raw_count",
                                                          filename = "BRCAtmatchedtumour.rda")

AllMatcheddataNorm <- TCGAanalyze_Normalization(tabDF = cbind(dataPrepro_MatchedNormal, dataPrepro_MatchedTumour),
                                               geneInfo = TCGAbiolinks::geneInfo,
                                               method = "gcContent")

AllMatcheddataFilt <- TCGAanalyze_Filtering(tabDF = AllMatcheddataNorm,
                                           method = "quantile",
                                           qnt.cut =  0.25)

save(AllMatcheddataFilt, file = paste0("BRCA_Matched_Filt_IlluminaHiSeq.rda"))

# selection of normal samples "NT"
samplesMatchedNT <- TCGAquery_SampleTypes(barcode = colnames(AllMatcheddataFilt),
                                   typesample = c("NT"))

MatchedNTdataFilt <- AllMatcheddataFilt[,samplesMatchedNT]

MatchedNTdataFilt <- MatchedNTdataFilt[, order(colnames(MatchedNTdataFilt))]

colnames(MatchedNTdataFilt) <- (1:113)

# selection of tumor samples "TP"
samplesMatchedTP <- TCGAquery_SampleTypes(barcode = colnames(AllMatcheddataFilt), 
                                   typesample = c("TP"))

MatchedTPdataFilt <- AllMatcheddataFilt[,samplesMatchedTP]

MatchedTPdataFilt <- MatchedTPdataFilt[, order(colnames(MatchedTPdataFilt))]

colnames(MatchedTPdataFilt) <- (1:113)

# Diff.expr.analysis (DEA) of matched paired samples
dataDEGsMatched <- TCGAanalyze_DEA(mat1 = MatchedTPdataFilt,
                                   mat2 = MatchedNTdataFilt,
                                   Cond1type = "Tumor",
                                   Cond2type = "Normal",
                                   fdr.cut = 0.01,
                                   logFC.cut = 1.5,
                                   method = "glmLRT",
                                   paired = TRUE)
ADD REPLYlink written 9 months ago by aluesley140

Okay, I have managed to reproduce the exact same error.... now let me see if I can solve it.

ADD REPLYlink written 9 months ago by Kevin Blighe39k

Posted an answer below

ADD REPLYlink written 9 months ago by Kevin Blighe39k
1
gravatar for Kevin Blighe
9 months ago by
Kevin Blighe39k
Republic of Ireland
Kevin Blighe39k wrote:

Hi aluesley1,

Firstly, you should not be modifying the colnames of your filtered objects, which means not executing the following lines:

colnames(MatchedTPdataFilt) <- (1:113)
colnames(MatchedNTdataFilt) <- (1:113)

So, from the save() command that you run, your code should be (nothing changed here, apart from not changing the colnames):

# selection of normal samples "NT"
samplesMatchedNT <- TCGAquery_SampleTypes(barcode = colnames(AllMatcheddataFilt),
                                   typesample = c("NT"))
MatchedNTdataFilt <- AllMatcheddataFilt[,samplesMatchedNT]
MatchedNTdataFilt <- MatchedNTdataFilt[, order(colnames(MatchedNTdataFilt))]

# selection of tumor samples "TP"
samplesMatchedTP <- TCGAquery_SampleTypes(barcode = colnames(AllMatcheddataFilt), 
                                   typesample = c("TP"))
MatchedTPdataFilt <- AllMatcheddataFilt[,samplesMatchedTP]
MatchedTPdataFilt <- MatchedTPdataFilt[, order(colnames(MatchedTPdataFilt))]

After investigating what was going wrong after that, I found a possible oversight in the TCGAanalyze_DEA() function that only appears when paired=TRUE. I modified this function and was able to run the command. The new function is located here: https://github.com/kevinblighe/BiostarsMisc/blob/master/TCGAanalyze_DEA_Fix_aluesley1.R (only line 31 was edited)

When you download it and load it into R with source("TCGAanalyze_DEA_Fix_aluesley1.R"), you can then run your command:

dataDEGsMatched.paired <- TCGAanalyze_DEA_Fix_aluesley1(mat1 = MatchedTPdataFilt,
                                   mat2 = MatchedNTdataFilt,
                                   Cond1type = "Tumor",
                                   Cond2type = "Normal",
                                   fdr.cut = 0.01,
                                   logFC.cut = 1.5,
                                   method = "glmLRT",
                                   paired = TRUE)

Disclaimer: This was modified for this specific example, so, don't use the modified function for other data.

-------------------------------------------

After having gone through the TCGAbiolinks code, I realised that it's not doing a proper paired analysis. All that it does is filter your dataset so that your 2 conditions have the same matched T-N (based on the shortened TCGA barcode), and then it just compares them with the standard test and same design formula as per the paired=FALSE. This is not a proper analysis because it's not adjusting for the within-patient effects that exist across a matched T-N pair.

Here's the proof:

dataDEGsMatched.unpaired <- TCGAanalyze_DEA_Fix_aluesley1(mat1 = MatchedTPdataFilt,
                                   mat2 = MatchedNTdataFilt,
                                   Cond1type = "Tumor",
                                   Cond2type = "Normal",
                                   fdr.cut = 0.01,
                                   logFC.cut = 1.5,
                                   method = "glmLRT",
                                   paired = FALSE)


head(dataDEGsMatched.paired)
           logFC   logCPM        LR       PValue          FDR
A2ML1  -4.400922 3.092592 113.93648 1.345505e-26 7.448902e-26
AADAC   4.113193 1.559164  82.50130 1.056005e-19 4.100799e-19
ABAT   -1.684259 5.996629  98.78378 2.816324e-23 1.312682e-22
ABCA10  3.628860 3.106490 339.55028 7.994882e-76 3.910199e-74

head(dataDEGsMatched.unpaired)
           logFC   logCPM        LR       PValue          FDR
A2ML1  -4.400922 3.092592 113.93648 1.345505e-26 7.448902e-26
AADAC   4.113193 1.559164  82.50130 1.056005e-19 4.100799e-19
ABAT   -1.684259 5.996629  98.78378 2.816324e-23 1.312682e-22
ABCA10  3.628860 3.106490 339.55028 7.994882e-76 3.910199e-74

...produces the exact same stats. So, when using this DEA function of TCGAbiolinks, simply filtering your own dataset to have matched samples prior to running the DEA and using paired=FALSE is the exact same as using the paired=TRUE parameter.

For breast cancer, I believe it was indeed 112 or 113 for matched breast T-N .

Kevin

ADD COMMENTlink modified 3 months ago • written 9 months ago by Kevin Blighe39k

Thank you so much for this. It's a bit frustrating and misleading that it's not a true paired analysis and I appreciate the time you have put into checking this is the case.

I'm going to have an attempt to perform the analysis myself using edgeR as it is part of my MSc research project and I would really like to be able to describe my method in my write up. Based on my current ability in R I may ask to use your data in the likely event that I am unsuccessful. If I do use your data I will reference in appropriately so long as you are happy for me to use it for this purpose.

ADD REPLYlink written 8 months ago by aluesley140

Okay, yes, we had been wondering how to best communicate this to the TCGAbiolinks developer - it is a clear flaw with the paired analysis only.

If you use DESeq2 or EdgeR, then there are many on this site who can lend support, of course. How are you aiming to get hold of the raw counts? Does TCGAbiolinks provide those?

ADD REPLYlink modified 6 months ago • written 8 months ago by Kevin Blighe39k

I contacted the developer about this and s/he responded. Apparently one of the collaborators on the TCGAbiolinks project was responsible for the paired analysis functionality but I have not heard anything back since then. No easy way to accept that your published work has had a bug this entire time, I suppose.

Welcome to bioinformatics, where we don't believe in proper testing.

ADD REPLYlink written 6 months ago by Kevin Blighe39k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1563 users visited in the last hour