Question

Help With TCGAbiolinks package

0

Entering edit mode

2.7 years ago

daniela.paola.s.p ▴ 70

Wow to download the count expression table for each sample analized using TCGAbiolinks package ?

rnaseq TCGA tcgabiolinks R bioconductor • 2.0k views

ADD COMMENT • link 2.7 years ago by daniela.paola.s.p ▴ 70

1

Entering edit mode

What have you tried? Please read this to understand how to ask good questions: How To Ask Good Questions On Technical And Scientific Forums

ADD REPLY • link 2.7 years ago by Ram 43k

0

Entering edit mode

Hey thanks for your answer! I'm quite new in bioingormatics, so i tried to download the data from cbioportal, but i'm very confused because e read that data isnt totally complete. So, now i'm trying to learn how to download the data using tcgabiolinks package.

ADD REPLY • link 2.7 years ago by daniela.paola.s.p ▴ 70

1

Entering edit mode

so i tried to download the data from cbioportal,

Which webpage did you try to download from? What was the link you clicked on? If you want people to help you, they should be able to reproduce what you did

but i'm very confused because e read that data isnt totally complete.

Where did you read this? Did that source also state how the data is incomplete i.e. what is missing?

So, now i'm trying to learn how to download the data using tcgabiolinks package.

What have you tried? What code did you run?

ADD REPLY • link 2.7 years ago by Ram 43k

0

Entering edit mode

Hello, i'm so sorry for a uncomplete answer.

I've been download the data using this link : https://cbioportal-datahub.s3.amazonaws.com/hnsc_tcga.tar.gz

I read about some diferences bethween data obtained in cbioportal and GDC aqui:

"How is the cBioPortal for Cancer Genomics different from the Genomic Data Commons (GDC)? The cBioPortal is an exploratory analysis tool for exploring large-scale cancer genomic data sets that hosts data from large consortium efforts, like TCGA and TARGET, as well as publications from individual labs. You can quickly view genomic alterations across a set of patients, across a set of cancer types, perform survival analysis and perform group comparisons. If you want to explore specific genes or a pathway of interest in one or more cancer types, the cBioPortal is probably where you want to start.

By contrast, the Genomic Data Commons (GDC) aims to be the definitive place for full-download and access to all data generated by TCGA and TARGET. If you want to download raw mRNA expression files or full segmented copy number files, the GDC is probably where you want to start."

To learn how to download with tcga biolinks i'm using this tutotial: https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/query.html

ADD REPLY • link 2.7 years ago by daniela.paola.s.p ▴ 70

0

Entering edit mode

There is no need for apologies - we're all here to help each other out, and many of us (myself included) need help understanding how to ask for help properly.

I see - the statement you're referring to is a disclaimer that cBioPortal is not a primary data source. Getting it from TCGABiolinks is definitely better as a source closer to the raw data.

You're looking to download RNAseq data using the package, correct? You must have typed a few lines of code - how far have you gotten?

I can understand a bit of Spanish, but can't have a conversation in it. I have tried to avoid idioms/English specific phrases as much as I can, so please let me know if any of my statements are unclear.

ADD REPLY • link 2.7 years ago by Ram 43k

0

Entering edit mode

Off course. Thanks for all your support.

this is the code that i've been working:

###------------------------------------------------------

library(TCGAbiolinks)
library(dplyr)
library(DT)
library(SummarizedExperiment)
library(plyr)
library(limma)
library(biomaRt)


##  =================    TCGA_nonhabits    ====================

non_habits_listSamples <- c("TCGA-BA-5152", "TCGA-CN-A49A", "TCGA-CQ-7069",   "TCGA-P3-A5QF", "TCGA-P3-A6T6", "TCGA-QK-A6IH", "TCGA-CN-6013", "TCGA-CR-6472","TCGA-P3-A5QE", "TCGA-HD-8224", "TCGA-BB-7871", "TCGA-CQ-6220", "TCGA-CQ-7064",  "TCGA-F7-A624", "TCGA-P3-A6T2", "TCGA-HD-A4C1", "TCGA-D6-A6EN", "TCGA-CQ-7068", "TCGA-CV-6953", "TCGA-CV-7407", "TCGA-CV-A463", "TCGA-KU-A66T", "TCGA-MT-A7BN",  "TCGA-UF-A71E", "TCGA-UF-A7JO", "TCGA-UF-A7JT", "TCGA-CQ-A4C9", "TCGA-QK-A8Z7",  "TCGA-CV-A6JD", "TCGA-CN-A642", "TCGA-D6-A6EO", "TCGA-CV-6948", "TCGA-BA-5558",    "TCGA-QK-A64Z", "TCGA-CQ-7063", "TCGA-BB-A5HZ", "TCGA-CN-6018", "TCGA-CQ-7071",   "TCGA-CQ-A4CD", "TCGA-CR-7380", "TCGA-CV-6942", "TCGA-CV-6955", "TCGA-CV-7252",   "TCGA-CV-7416", "TCGA-CV-7425", "TCGA-CV-A45V", "TCGA-HD-A633", "TCGA-MT-A67F",   "TCGA-P3-A6T3", "TCGA-RS-A6TO", "TCGA-BB-A5HU", "TCGA-CR-6484", "TCGA-CV-7428",    "TCGA-CV-7095", "TCGA-CN-6994", "TCGA-CR-7379", "TCGA-CV-7090", "TCGA-CV-7253",  "TCGA-CV-7409", "TCGA-CV-7413", "TCGA-BA-5557", "TCGA-BB-4224", "TCGA-BB-7863",  "TCGA-C9-A47Z", "TCGA-C9-A480", "TCGA-CN-6996", "TCGA-CQ-5327", "TCGA-CQ-5329",  "TCGA-CQ-6229", "TCGA-CQ-7065", "TCGA-CQ-A4CE", "TCGA-CQ-A4CH", "TCGA-CR-6488", "TCGA-CR-7382", "TCGA-CV-5973", "TCGA-CV-5979", "TCGA-CV-6003", "TCGA-CV-6939",  "TCGA-CV-6959", "TCGA-CV-7104", "TCGA-CV-7238", "TCGA-CV-7243", "TCGA-CV-7255","TCGA-CV-7438", "TCGA-CV-A45P", "TCGA-CV-A465", "TCGA-CV-A6JT", "TCGA-CV-A6K0","TCGA-D6-6515", "TCGA-D6-A6EM", "TCGA-DQ-5624", "TCGA-HD-7831", "TCGA-HD-A6HZ", "TCGA-IQ-A61J", "TCGA-IQ-A6SG", "TCGA-MT-A67A", "TCGA-P3-A5QA", "TCGA-QK-A652", "TCGA-T2-A6WX", "TCGA-UP-A6WW", "TCGA-BA-A6DB", "TCGA-CN-4725", "TCGA-CN-4733", "TCGA-CN-4737", "TCGA-CR-7372", "TCGA-CR-7393", "TCGA-IQ-A61L", "TCGA-BA-6873",    "TCGA-H7-A6C4", "TCGA-DQ-5630", "TCGA-CQ-6222", "TCGA-CX-7085", "TCGA-CR-7391",  "TCGA-CN-6017", "TCGA-4P-AA8J", "TCGA-CQ-7067", "TCGA-CV-7236")

query.exp <- GDCquery(project = "TCGA-HNSC", 
                      legacy = TRUE,
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      platform = "Illumina HiSeq", 
                      file.type = "results",
                      barcode = non_habits_listSamples,
                      experimental.strategy = "RNA-Seq",
                      sample.type = c("Primary Tumor","Solid Tissue Normal"))
GDCdownload(query.exp)

non_habits_HNSC.exp <- GDCprepare(query = query.exp, save = TRUE,
                                  save.filename = "non_habits_HNSC_selectedExp.rda")

# get subtype information 
dataSubt <- TCGAquery_subtype(tumor = "HNSC")

# get clinical data 
dataClin <- GDCquery_clinic(project = "TCGA-HNSC","clinical") 


# Which samples are Primary Tumor
dataSmTP <- TCGAquery_SampleTypes(getResults(query.exp,cols="cases"),"TP") 

# which samples are solid tissue normal
dataSmNT <- TCGAquery_SampleTypes(getResults(query.exp,cols="cases"),"NT")

dataPrep <-TCGAanalyze_Preprocessing(object = non_habits_HNSC.exp, cor.cut = 0.6)                      

dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                      geneInfo = geneInfo,
                                      method = "gcContent")                
#filtrando os dados:
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                  qnt.cut =  0.25)   

######    
dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,dataSmNT],
                            mat2 = dataFilt[,dataSmTP],
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            method = "glmLRT")  

write.table(dataDEGs, "non_habits_HNSC_selected.txt", sep="\t")

TCGAVisualize_volcano(x = dataDEGs$logFC,
                      y = dataDEGs$FDR,
                      filename = "non_habits_HNSCselected_volcanoexp.png",
                      x.cut = 6,
                      y.cut = 10^-5,
                      names = rownames(dataDEGs),
                      color = c("black","red","darkgreen"),
                      names.size = 2,
                      xlab = " Gene expression fold change (Log2)",
                      legend = "State",
                      title = "Volcano plot (CIMP-high vs CIMP-low)",
                      width = 10)

**************************************************************************************

With This code i colected the DE spreadsheet. But i'm need to have the counts or logFC from each samples that i used. Can you understood me?

ADD REPLY • link updated 2.7 years ago by Ram 43k • written 2.7 years ago by daniela.paola.s.p ▴ 70

score 2 · Answer 1 · 2021-08-02

2

Entering edit mode

2.7 years ago

Ram 43k

I understand your statements but I'm not confident I understand your problem. Are you looking to get the normalized counts across samples?

I looked at the analysis functions page (your well-documented code helped a lot in understanding your workflow). and I see this step:

assay(BRCARnaseqSE,"raw_count")

which in your case would be:

res <- assay(GDCprepare(query.exp), "raw_count")

Maybe that would give you a matrix of raw counts that you can work with?

ADD COMMENT • link 2.7 years ago by Ram 43k

0

Entering edit mode

yes!! I need this information! Thank you Ram!! Now, another question, can you tell me iff this code can provide somehow the logFC individualy calculated or the counts off normal samples used ?

ADD REPLY • link 2.7 years ago by daniela.paola.s.p ▴ 70

0

Entering edit mode

Sorry, I don't think I can help you with that. If you opened a question over at https://support.bioconductor.org/ with your current code and asked this question, people there might be able to help you better.

ADD REPLY • link 2.7 years ago by Ram 43k

0

Entering edit mode

Ok ! no problem! thank you for all support!!! I'm going to use your tips and answers!

ADD REPLY • link 2.7 years ago by daniela.paola.s.p ▴ 70