Question

RNA-seq data analysis

0

Entering edit mode

3.4 years ago

Peter ▴ 20

Hello everyone,

This is my first analysis of RNA-seq data. I am using the TCGAbiolinks package. Initially, I am using the "TCGA-BRCA" project and I am using samples of healthy tissue and primary tumors.

I am downloading the data in HTSeq-FPKM-UQ, which are being stored in the variable "my_data". After downloading the data, I assign the corresponding groups. The TP vector stores the IDs of patients with a primary tumor, and the NT vector stores the IDs of normal patients.

My question is whether the following steps are adequate:

dataPrep <- TCGAanalyze_Preprocessing(object = my_data, cor.cut = 0.6)
dataFilt <- TCGAanalyze_Filtering(tabDF = dataPrep,
                                  method = "quantile", 
                                  qnt.cut =  0.25)
dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,dataSmNT],
                            mat2 = dataFilt[,dataSmTP],
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            method = "glmLRT")

After these commands, I have an output containing the logFC, p-value, FDR, and other values. I ask this question because I am not performing data normalization, as I am using the "HTSeq-FPKM-UQ" table, as I read that:

Fragments Per Kilobase of transcript per Million mapped reads upper quartile (FPKM-UQ) is a RNA-Seq-based expression normalization method. The FPKM-UQ is based on a modified version of the FPKM normalization method.

In addition, I would like to confirm that upregulated transcripts (FC greater than 1) are increased in the CTRL, applying this approach, right?

Thanks in advance!

R RNA-Seq • 1.2k views

ADD COMMENT • link updated 3.4 years ago by Hamid Ghaedi 3.2k • written 3.4 years ago by Peter ▴ 20

score 3 · Answer 1 · 2020-11-27

For differential expression analysis, most of the packages like edgeR - TCGABiolinks uses this package for DE analysis-and Deseq2 need raw un-normalized count (HTSeq count). IF you like to read more about why you need to use raw data see edgeR and Deseq2 user guide. The following will help you to get raw count. Once you get, you good to go for the rest of your analysis.

query_TCGA = GDCquery(
  project = "TCGA-BRCA",
  data.category = "Transcriptome Profiling", # parameter enforced by GDCquery
  experimental.strategy = "RNA-Seq",
  workflow.type = "HTSeq - Counts")

GDCdownload(query = query_TCGA)

my_data <- GDCprepare(query = query_TCGA, save = TRUE, save.filename = "exp.rda")