Greetings! Hope everyone is doing great!
I have a question about TMM normalization(pardon me if it seems a simple question as I am still a newbie to this kind of analysis). I want to analyze a publically available RNA-seq dataset(GSE114686). The RNA-seq data is however not raw counts but already pre-processed using edgeR. The description of the pre-processing is a bit confusing for me. In the details of this pre-processing, the researchers mentioned the following :
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
"Sequenced reads were trimmed for adaptor sequence, and masked for low-quality sequence, then mapped to hg19 genome using STAR in bcbio nextgen package with parameters: analysis: RNAseq, aligner: star, trim_reads: read_through, adapters: [truseq, polya], quality_format: standard, strandedness: firststrand. Based on the feature count table generated from STAR alignment, we did differential expression analysis using edgeR. A filtering criterion of mean Count Per Million (CPM) > 1 of each time point in the vehicle-only controls was used, resulting in 13,227 transcripts, in the ProcessedData.csv. The count table was generated from edgeR analysis after TMM normalization Genome_build: hg19 Supplementary_files_format_and_content: [ProcessedData.csv] TMM-normalized feature counts from STAR alignment"
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
My questions are :
1) Based on the aforementioned details, is the data provided really normalized? Or was the normalization only used for filtering, and then raw counts were provided(I don't think the latter is true)? This concern comes to me because when I plotted the log CPM per sample without normalization(method=TMM), I find the boxplots a bit weird(https://ibb.co/rG4F65S) as I was expecting to see uniform expression distribution in these boxplots assuming this dataset is actually in the TMM scale.
2) If my concern is not true and these boxplots represent the normalized expression distributions, can I then just proceed using limma or edgeR without applying cpm function from edgeR, or are there any precautions to take when analyzing NOT raw data?
Thank you very much in advance!
code used
dg_list<-DGEList(counts = sora_df[,2:ncol(sora_df)],genes=sora_df[,1])
geneIDs<-sora_df[,1]
cpm_log2<-cpm(dg_list,log=T)
cpm_log2<-as.tibble(cpm_log2,rownames="geneIDs")
df.pivot <- pivot_longer(cpm_log2, # dataframe to be pivoted
cols = D1A1:S4C3, # column names to be stored as a SINGLE variable
names_to = "samples", # name of that new variable (column)
values_to = "expression") # name of new variable (column) storing all the values (data)
# plotting effect of cleaning
ggplot(df.pivot) +
aes(x=samples, y=expression, fill=samples) +
geom_boxplot(trim = FALSE, show.legend = FALSE) +
stat_summary(fun = "median",
geom = "point",
shape = 95,
size = 10,
color = "black",
show.legend = FALSE) +
labs(y="log2 expression", x = "sample",
title="Log2 Counts per Million (CPM)",
subtitle="Filtered, TMM normalized??") +
theme_bw()+coord_flip()
Dear Kevin, Thank you very much for your help, I really appreciate it! I have done what you said and got weird results with only around 250 DEGs. Just to double-check that this is not a problem with the methods I used. I have 3 questions. The dataset(GSE114686) contains 84 samples that are normalized and in the TMM scale. I am not interested in all samples, only 6 are of interest to me so I subsetted the data and analyzed them going the logCPM route and the limma-trend pipeline WITHOUT calling the calcNormFactors command as shown below in the code. My questions now are:
1) TMM-normalization is a cross-sample normalization, meaning that these counts/numbers are relative, and accordingly taking specific samples to analyze while ignoring the other samples used to calculate TMM is not correct methodologically, right? So, the relatively low DEGs could be due to my wrong way of analysis?
2) Eventually I would like to compare the DEGs from this dataset with another dataset that is available with raw counts(thank God!) and I analyzed it using DESeq2. Is it methodologically sound to compare DEGs from the limma-trend pipeline(that is to my understanding is analyzed as if they are microarray data while they are RNA-seq, but not from raw counts) with deseq2 or should I analyze the other dataset of raw counts using limma-voom pipeline?(as this uses a more similar approach to limma-trend I think compared to deseq2??)
3) Is there a way that I can get raw counts via GEO of this dataset other than contacting the researchers because I really find it tedious and not robust(from what I read in Biocondcuter and here) to analyze already processed data.
Again, thank you very much for your help!
######################################################3 Code Used
I would keep all samples throughout the process for this reason, i.e., the normalisation factors are originally determined based on all samples. limma-trend will be applying some extra processing, but I do not know enough about it right now to comment further. When you then get to the contrasts step in limma, you would then just select your groups of interest [for the contrast].
Are you sure that the
cpm()
function is not calculating normalisation factors again? - have you seen thenormalized.lib.sizes
parameter? I am not an EdgeR user.Perhaps a meta-analysis of the p-values would be appropriate?
Sometimes, the auhours put the raw counts on some other website, like Zenodo or Figshare; however, they are often not available, and it is indeed frustrating. In many cases, you would have to re-process from the FASTQ files, if they are available, or indeed ask the authors directly.