Question

TCGABioLinks how to set Sample conditions?

1

Entering edit mode

6.1 years ago

David_emir ▴ 490

Hi All, Hope you all experts are doing great! I am currently working on TCGABiolink Package. (Please consider my knowledge of coding as entry level). I have two sets of barcodes: 1st set deals with Tumor samples without relapse history and 2nd set deals with tumor relapse. My experimental design is to compare the Differentially methylated regions Analysis in set A (Without Relapse) Vs Set B (Relapse).

I am following the tutorials present in TCGABiolink Package, Here they describe the Analysis b/w Tumor Vs Normal. Now my Question how to set Condition Type? Where should I mention about the sample status Like Relapsed or Not relapsed? should I prepare a separate? like a .CSV file or what is the best possible solution for declaring sample comditions? The basic code is as follows:

query <- GDCquery(project = CancerProject,
              data.category = "Transcriptome Profiling",
              data.type = "Gene Expression Quantification", 
              workflow.type = "HTSeq - Counts")

samplesDown <- getResults(query,cols=c("cases"))

dataSmTP <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "TP")

dataSmNT <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "NT")
dataSmTP_short <- dataSmTP[1:10]
dataSmNT_short <- dataSmNT[1:10]

queryDown <- GDCquery(project = CancerProject, 
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification", 
                      workflow.type = "HTSeq - Counts", 
                      barcode = c(dataSmTP_short, dataSmNT_short))

GDCdownload(query = queryDown,
            directory = DataDirectory)

dataPrep <- GDCprepare(query = queryDown, 
                       save = TRUE, 
                       directory =  DataDirectory,
                       save.filename = FileNameData)

dataPrep <- TCGAanalyze_Preprocessing(object = dataPrep, 
                                      cor.cut = 0.6,
                                      datatype = "HTSeq - Counts")                      

dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                      geneInfo = geneInfoHT,
                                      method = "gcContent") 

boxplot(dataPrep, outline = FALSE)

boxplot(dataNorm, outline = FALSE)

dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                  qnt.cut =  0.25)   

dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,dataSmTP_short],
                            mat2 = dataFilt[,dataSmNT_short],
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            method = "glmLRT")  

dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,samplesNT],
                        mat2 = dataFilt[,samplesTP],
                        Cond1type = "Normal",
                        Cond2type = "Tumor",
                        fdr.cut = 0.01 ,
                        logFC.cut = 1,
                        method = "glmLRT")

I am totally clueless about how to declare the sample conditions so that i can test the DGE and DMR analysis between these two sets.

Thanks a lot for your Help.

Have a great day ahead !!!

Dave (Confused)!

TCGABolinks TCGA DMR DGE • 3.9k views

ADD COMMENT • link updated 6.1 years ago by Mathias ▴ 90 • written 6.1 years ago by David_emir ▴ 490

score 1 · Answer 1 · 2018-03-16

1

Entering edit mode

6.1 years ago

Mathias ▴ 90

Does your example work? I recently started working with TCGAbiolinks aswell, and some functions throw errors, I haven't got the time to test it out for you though.

In any case:

dataSmTP <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "TP")

dataSmNT <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "NT")

This is the step where he specifies his tp (primary solid tumor) and nt (solid tissue normal) barcodes. If you didn't notice this, I think you haven't tried out reading help pages for functions. In R, just type ?some_function, and it will show you the documentation of the function. If you check this function, you will notice that in the example, they select barcodes, filtered on sampletype. Thus, if you have the barcodes available for your sample, you can just insert them manually in this step:

datasmTP <- c("barcode1", "barcode2", ...) # you can rename datasmTP ofc, but make sure to replace it everywhere
# same for NT

Also, that last TCGA_analyse_DEA example uses "samplesNT" and "samplesTP" - these aren't specified, and that code will not work I guess?

Otherwise: Do you know where to find the data you need? You can look up what clinical info is available here: https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-entity-list&anchor=clinical

Then, there's these functions: GDCquery_clinic().

In R, type ?GDCquery_clinic(), there's not a lot of documentation, but it's there. You should be able to download the clinical data based on the examples, and then select your cases/barcodes conditionally, based on a variable in the clinical info. Are you able to do that?

There's another function called GDCprepare_clinic, but I think this downloads the clinical data in the legacy format (not sure about this).

ADD COMMENT • link 6.1 years ago by Mathias ▴ 90

0

Entering edit mode

Thanks a lot for your suggestion, Mathias.heydt. It was indeed very helpful. I have set of samples which I have classified as Set A and Set B (Both are from Colon Cancer). Now the problem is, I just want to bypass the clinical classification since I am not classifying between Normal Vs tumor, instead, I am looking for Set B Vs Set A.. can you please let me know how to go about this. Thanks a lot for your help !!! (Sorry for my English - I am a non-native English speaker!!!) Sincerely, Dave

ADD REPLY • link 6.1 years ago by David_emir ▴ 490

1

Entering edit mode

Yeah, like I said in my answer, if you know the barcodes of your samples in set A and set B, you can just specify them; In the example they specify their sets with the TCGAquery_samples function. You don't need to do that. Just list the barcodes directly. dataSmTP is just a variable name, so you can choose your own, something like:

setA <- c(barcode1, barcode2,...)
setB <- c(barcode20, barcode21,...)

Just make sure that, wherever in the example dataSmTP was written, you change it to setA - the new name of this variable, and do the same for setB. You can choose the barcodes and 'make' your own 2 sets to compare.

But if you don't know all the sample barcodes, and you need to find them based on a clinical variable, then you'll want to download the clinical files and write some more code.

You can also add file or case filters in the repository tab of the (TCGA) GDC data portal: https://portal.gdc.cancer.gov/repository Look up the clinical or biospecimen variable you need in the dictionary: https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-entity-list&anchor=clinical And add a Case/biospecimen filter in the data portal based on the variables you need. You should then be able to figure out a way to produce a list of all the barcodes with the clinical/biospecimen traits you are interested in.

ADD REPLY • link 6.1 years ago by Mathias ▴ 90

0

Entering edit mode

I am following the tiagochst code here, and i am finding it hard to replicate the same. i have given the samples which are Left and right (SetA and SetB) as follows,

samplelist_left <- c("TCGA-A6-2671-01A", "TCGA-A6-2674-01A", "TCGA-A6-2674-01B", "TCGA-A6-2674-01A", "TCGA-A6-2675-01A", "TCGA-A6-2685-01A", "TCGA-A6-3807-01A", "TCGA-A6-3810-01B", "TCGA-A6-3810-01A", "TCGA-A6-3810-01A", "TCGA-A6-5656-01B") samplelist_right <- c("TCGA-4N-A93T-01A", "TCGA-5M-AAT4-01A", "TCGA-5M-AATE-01A", "TCGA-A6-2677-01A", "TCGA-A6-2677-01B", "TCGA-A6-2679-01A", "TCGA-A6-2680-01A", "TCGA-A6-2681-01A", "TCGA-A6-2683-01A", "TCGA-A6-2684-01A", "TCGA-A6-2684-01C", "TCGA-A6-2684-01A", "TCGA-A6-3808-01A", "TCGA-A6-4105-01A", "TCGA-A6-4107-01A", "TCGA-A6-5659-01A")

#getProbeInfo(mae)

group.col <- "definition" group1 <- samplelist_left group2 <- samplelist_right dir.out <- "result"

Sig.probes <- get.diff.meth(data = mae, group.col = group.col, group1 = group1, group2 = group2, minSubgroupFrac = 0.2, sig.dif = 0.3, diff.dir = "hypo", # Search for hypomethylated probes in group 1 cores = 4, dir.out = dir.out, pvalue = 0.01) but i am getting the follwing error, not able to understand where i am going wrong.

    Error in get.diff.meth(data = mae, group.col = group.col, group1 = group1,  : 
    In addition: Warning message:
In if (!group1 %in% unique(colData(data)[, group.col])) { :
  the condition has length > 1 and only the first element will be used

Please help me to understand where I am going wrong in defining the sample groups. I have succeeded in creating mea group as well but I cannot move forward. Please help. Sorry for bugging, Have a great day! Thanks Dave

ADD REPLY • link 6.1 years ago by David_emir ▴ 490

0

Entering edit mode

You're doing something totally different from the example, and different from your first question anyway. The MAE is a separate class, and in the example, they create it using summarizedExperiment class objects (lusc.exp and lusc.met) I'm not using these classes, so I can't look into it for you, but they name the parameters to something that is probably stored into the MAE: group.col <- "definition" group1 <- "Primary solid Tumor" group2 <- "Solid Tissue Normal" dir.out <- "result"

And you are feeding it variables directly.

I suggest you just delve into the documentation to figure stuff out. https://bioconductor.org/packages/release/bioc/manuals/ELMER/man/ELMER.pdf

ADD REPLY • link 6.1 years ago by Mathias ▴ 90