Question

Analyzing processed RNA-seq data: is this TMM normalized?

0

Entering edit mode

3.7 years ago

Ridha ▴ 130

Greetings! Hope everyone is doing great!

I have a question about TMM normalization(pardon me if it seems a simple question as I am still a newbie to this kind of analysis). I want to analyze a publically available RNA-seq dataset(GSE114686). The RNA-seq data is however not raw counts but already pre-processed using edgeR. The description of the pre-processing is a bit confusing for me. In the details of this pre-processing, the researchers mentioned the following :

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

"Sequenced reads were trimmed for adaptor sequence, and masked for low-quality sequence, then mapped to hg19 genome using STAR in bcbio nextgen package with parameters: analysis: RNAseq, aligner: star, trim_reads: read_through, adapters: [truseq, polya], quality_format: standard, strandedness: firststrand. Based on the feature count table generated from STAR alignment, we did differential expression analysis using edgeR. A filtering criterion of mean Count Per Million (CPM) > 1 of each time point in the vehicle-only controls was used, resulting in 13,227 transcripts, in the ProcessedData.csv. The count table was generated from edgeR analysis after TMM normalization Genome_build: hg19 Supplementary_files_format_and_content: [ProcessedData.csv] TMM-normalized feature counts from STAR alignment"

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

My questions are :

1) Based on the aforementioned details, is the data provided really normalized? Or was the normalization only used for filtering, and then raw counts were provided(I don't think the latter is true)? This concern comes to me because when I plotted the log CPM per sample without normalization(method=TMM), I find the boxplots a bit weird(https://ibb.co/rG4F65S) as I was expecting to see uniform expression distribution in these boxplots assuming this dataset is actually in the TMM scale.

2) If my concern is not true and these boxplots represent the normalized expression distributions, can I then just proceed using limma or edgeR without applying cpm function from edgeR, or are there any precautions to take when analyzing NOT raw data?

Thank you very much in advance!

code used

dg_list<-DGEList(counts = sora_df[,2:ncol(sora_df)],genes=sora_df[,1])

geneIDs<-sora_df[,1]
cpm_log2<-cpm(dg_list,log=T)
cpm_log2<-as.tibble(cpm_log2,rownames="geneIDs")

df.pivot <- pivot_longer(cpm_log2, # dataframe to be pivoted
                              cols = D1A1:S4C3, # column names to be stored as a SINGLE variable
                              names_to = "samples", # name of that new variable (column)
                              values_to = "expression") # name of new variable (column) storing all the values (data)
# plotting effect of cleaning 
ggplot(df.pivot) +
aes(x=samples, y=expression, fill=samples) +
geom_boxplot(trim = FALSE, show.legend = FALSE) +
stat_summary(fun = "median", 
           geom = "point", 
           shape = 95, 
           size = 10, 
           color = "black", 
           show.legend = FALSE) +
 labs(y="log2 expression", x = "sample",
   title="Log2 Counts per Million (CPM)",
   subtitle="Filtered, TMM normalized??") +
 theme_bw()+coord_flip()

rna-seq r • 1.4k views

ADD COMMENT • link updated 3.7 years ago by Kevin Blighe 88k • written 3.7 years ago by Ridha ▴ 130

score 0 · Answer 1 · 2021-02-09

0

Entering edit mode

3.7 years ago

Kevin Blighe 88k

Hello,

Greetings! Hope everyone is doing great!

We are doing fine, friend - thank you very much for asking. I trust that you are doing okay, too.

#######

The data provided in GSE114686_ProcessedData.csv.gz is unlogged, 'normalised' counts, and the normalisation method used was TMM via edgeR. It is frustrating why the authors would not just provide raw counts, as this would even mean a smaller file size due to no decimal places.

If you want to use these, I guess that you could go the CPM route, or just do a log2(x + 1) transformation. Either way, the ultimate step would be to proceed via the limma-trend pipeline.

Others may have other opinions.

Kevin

ADD COMMENT • link 3.7 years ago by Kevin Blighe 88k

0

Entering edit mode

Dear Kevin, Thank you very much for your help, I really appreciate it! I have done what you said and got weird results with only around 250 DEGs. Just to double-check that this is not a problem with the methods I used. I have 3 questions. The dataset(GSE114686) contains 84 samples that are normalized and in the TMM scale. I am not interested in all samples, only 6 are of interest to me so I subsetted the data and analyzed them going the logCPM route and the limma-trend pipeline WITHOUT calling the calcNormFactors command as shown below in the code. My questions now are:

1) TMM-normalization is a cross-sample normalization, meaning that these counts/numbers are relative, and accordingly taking specific samples to analyze while ignoring the other samples used to calculate TMM is not correct methodologically, right? So, the relatively low DEGs could be due to my wrong way of analysis?

2) Eventually I would like to compare the DEGs from this dataset with another dataset that is available with raw counts(thank God!) and I analyzed it using DESeq2. Is it methodologically sound to compare DEGs from the limma-trend pipeline(that is to my understanding is analyzed as if they are microarray data while they are RNA-seq, but not from raw counts) with deseq2 or should I analyze the other dataset of raw counts using limma-voom pipeline?(as this uses a more similar approach to limma-trend I think compared to deseq2??)

3) Is there a way that I can get raw counts via GEO of this dataset other than contacting the researchers because I really find it tedious and not robust(from what I read in Biocondcuter and here) to analyze already processed data.

Again, thank you very much for your help!

######################################################3 Code Used

# subsetting data
sora_df<-sora_df%>%select(starts_with(c("en","D3","S3"))) # **note that sora_df is the subsetted data that is in the TMM scale**.
# dge list
dg_list<-DGEList(counts = sora_df[,2:ncol(sora_df)],genes=sora_df[,1])
group<-factor(c("Ct","Ct","Ct","SOR","SOR","SOR"))
# storing cell group
dg_list$samples$group<-group
# Making logCPM2
lcpm <- cpm(dg_list, log=TRUE)
#making design matric
my_design<-model.matrix(~0+group)
colnames(my_design) <- gsub("group", "", colnames(my_design))
# fitting the model====
fit<-lmFit(lcpm,my_design) 
# making contrast matrix
contrast_mat<-makeContrasts(Diff=SOR-Ct,levels=my_design) 
fit2<-contrasts.fit(fit,contrasts=contrast_mat) # here now you estimate SOR-cONTROL
fit3<-eBayes(fit2,trend=TRUE)

ADD REPLY • link 3.7 years ago by Ridha ▴ 130

0

Entering edit mode

1) TMM-normalization is a cross-sample normalization, meaning that these counts/numbers are relative, and accordingly taking specific samples to analyze while ignoring the other samples used to calculate TMM is not correct methodologically, right? So, the relatively low DEGs could be due to my wrong way of analysis?

I would keep all samples throughout the process for this reason, i.e., the normalisation factors are originally determined based on all samples. limma-trend will be applying some extra processing, but I do not know enough about it right now to comment further. When you then get to the contrasts step in limma, you would then just select your groups of interest [for the contrast].

Are you sure that the cpm() function is not calculating normalisation factors again? - have you seen the normalized.lib.sizes parameter? I am not an EdgeR user.

2) Eventually I would like to compare the DEGs from this dataset with another dataset that is available with raw counts(thank God!) and I analyzed it using DESeq2. Is it methodologically sound to compare DEGs from the limma-trend pipeline(that is to my understanding is analyzed as if they are microarray data while they are RNA-seq, but not from raw counts) with deseq2 or should I analyze the other dataset of raw counts using limma-voom pipeline?(as this uses a more similar approach to limma-trend I think compared to deseq2??)

Perhaps a meta-analysis of the p-values would be appropriate?

3) Is there a way that I can get raw counts via GEO of this dataset other than contacting the researchers because I really find it tedious and not robust(from what I read in Biocondcuter and here) to analyze already processed data.

Sometimes, the auhours put the raw counts on some other website, like Zenodo or Figshare; however, they are often not available, and it is indeed frustrating. In many cases, you would have to re-process from the FASTQ files, if they are available, or indeed ask the authors directly.

ADD REPLY • link 3.7 years ago by Kevin Blighe 88k