Hello all,
Apologies in advance for my lack of expertise and knowledge as I am a student bioinformatician. I have tried to understand other answers as well as the DESeq2 vignette and publications already so I have managed to run it successfully, I just have a problem with choosing the settings for my particular experiment.
At the moment I have RNA-seq data (ENSG IDs and raw counts) for leukaemia patients in 4 different primary cytogenetic groups inv(16) (n=17), t(8;21) (n=27), MLL (n=29) and Normal (n=19) - unfortunately I only have one tumour sample per patient and no matching normal tissue data. There is a lot of metadata on these patients - they are of different genders/races/ethnicities/FAB categories/protocols and some have other mutations involved. One variable which I am expected to use to form groups is Event status - I am expected to use 'Censored' patients as the control group and compare 'Relapsed' patients and patients who 'Failed Induction Therapy' to this group.
I have been working on the inv(16) patients first in which there are 5 Censored and 10 Relapsed patients. Two patients were removed as they were extreme outliers when viewed by PCA plot and sample distances heatmap. So here is my code so far:
#Setting up colData object
#Previously identified outliers C10 (4) and D1 (17) removed
ID<-as.matrix(colnames(inv16DF))
Gender<-as.matrix(inv16PD$Gender)
Race<-as.matrix(inv16PD$Race)
Ethnicity<-as.matrix(inv16PD$Ethnicity)
Event<-as.matrix(inv16PD$First.Event)
metadatainv16<-cbind(ID,Gender,Race,Ethnicity,Event)[-c(4,17),]
colnames(metadatainv16)<-c("Patient","Gender","Race","Ethnicity","Event")
#Setting up countData object
dds<-DESeqDataSetFromMatrix(inv16DF[,-c(4,17)], colData=metadatainv16, design=~Event)
dim(dds)
#Removing genes with sum total of 5 reads
dds<-dds[rowSums(counts(dds))>5,]
#Running DESeq2
dds<-DESeq(dds, fitType="parametric", betaPrior=TRUE, minReplicatesForReplace=7)
#Results in order of padj/how many genes < 0.05
resinv16<-results(dds, cooksCutoff=TRUE, independentFiltering=TRUE, alpha=0.05)
resinv16[order(resinv16$padj),]
table(resinv16$padj<0.05)
summary(resinv16)
#Merge with normalised counts and subset padj < 0.05
resdata<-merge(as.data.frame(resinv16), as.data.frame(counts(dds, normalized=TRUE)),
by="row.names"[order(resinv16$padj),]
names(resdata)[1]<-"Gene"
resdatainv16sig<-subset(resdata, resdata$padj<0.05)
#P value histogram
hist(resdata$pvalue, breaks=40, col="skyblue", xlab="P value", main="Histogram of P values for inv(16)")
#Dispersion plot
plotDispEsts(dds, main="Dispersion Plot inv(16)",ylim=c(1e-6, 1e1))
#Regularised log transformation for heatmap and PCA
rld<-rlogTransformation(dds)
#PCA ntd
ntd<-normTransform(dds)
plotPCA(ntd, intgroup="Event")
#PCA rld
plotPCA(rld, intgroup="Event")
#Sample Distance Heatmap
library(RColorBrewer)
library(pheatmap)
heatmapdat<-assay(rld)
dists<-dist(t(heatmapdat))
distmat<-as.matrix(dists)
colours <- colorRampPalette(rev(brewer.pal(9,"Blues")))(255)
pheatmap(distmat,
clustering_distance_rows=dists,
clustering_distance_cols=dists,
col=colours,
main="Sample-to-sample distances inv(16)")
#MA Plot
plotMA(resinv16, main="MA-plot for Event with inv(16)", ylim=range(resinv16$log2FoldChange, na.rm=TRUE))
#Count plot for gene with minimum padj
plotCounts(dds, gene=which.min(resinv16$padj), intgroup="Event")
Which gives the following outputs:
> summary(resinv16)
out of 28453 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 9, 0.032%
LFC < 0 (down) : 28, 0.098%
outliers [1] : 680, 2.4%
low counts [2] : 7463, 26%
(mean count < 5)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
> resinv16[order(resinv16$padj),]
log2 fold change (MAP): Event Relapse vs Censored
Wald test p-value: Event Relapse vs Censored
DataFrame with 28548 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000159189 129.04193 -2.959093 0.6150973 -4.810772 1.503483e-06 0.007669642
ENSG00000160791 89.11385 -2.126363 0.4185705 -5.080059 3.773175e-07 0.007669642
ENSG00000232422 59.67893 -2.824722 0.5868172 -4.813633 1.482112e-06 0.007669642
ENSG00000264063 40.50715 -2.257767 0.4602950 -4.905043 9.340686e-07 0.007669642
ENSG00000234699 13.63526 -2.632995 0.5546159 -4.747420 2.060274e-06 0.008407977
... ... ... ... ... ... ...
ENSG00000267778 0.8509069 -0.1971960 0.4453060 -0.4428325 0.6578869 NA
ENSG00000267785 0.6669036 -0.4189023 0.4848863 -0.8639187 0.3876326 NA
ENSG00000267791 0.6508162 0.0626245 0.4256238 0.1471358 0.8830248 NA
ENSG00000267793 4.9778852 -0.6359067 0.5809850 -1.0945320 0.2737217 NA
ENSG00000267799 4.6509096 0.7252028 0.5571731 1.3015754 0.1930616 NA
and these graphical outputs:
https://postimg.cc/gallery/2qhrndj0u/
I have a few questions despite trying my best to learn how to optimise the settings as my results seem to be quite poor.
1: I am unsure how to tell if variables (gender/race/ethnicity/etc.) significantly contribute to my results. I could possibly include them in design=
to account for variation here. Can't see any clear connection between these variables and how the patients cluster or appear in PCA plots.
2: The patients don't cluster at all, which may be related to my first point, and I think this is a major problem when trying to find DEGs, however I am still expected to group patients according to event status despite highlighting this issue. I've tried many combinations of samples and I've tried to edit the design, eg. design=~Gender+Race+Ethnicity+Event
or different combinations of this.The PCA plot and clustering within the sample distances heatmap clustering does not seem to improve upon choosing only individuals of the same race and ethnicity. One thought I had was to manually pick out samples from the PCA plot which look similar, for eg. 3 Censored Patients and 3 Relapsed patients, although I know this has negative implications for the power of the analysis and would probably be majorly biased. It did produce more DEGs, and better clustering. Eg:
> summary(resinv16)
out of 25289 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 3337, 13%
LFC < 0 (down) : 3239, 13%
outliers [1] : 256, 1%
low counts [2] : 4903, 19%
(mean count < 6)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
and produce the following graphical outputs:
https://postimg.cc/gallery/3al8ck7om/
I don't know how to approach this analysis. Could anyone give me advice on this? Are the patients in the Censored/Relapsed/Induction Failure groups considered biological replicates?
3: I have tried minReplicatesforReplace=Inf
and cooksCutoff=FALSE
when I observed hundreds and thousands of outliers - is this correct? I have also made boxplots with these settings and found no other major outlying samples so assume I don't need to remove samples due to this. This is the code:
par(mar=c(8,5,2,2))
boxplot(log10(assays(dds)[["cooks"]]), range=0, las=2, main="Cook's Distances inv(16)")
and boxplot:
https://postimg.cc/image/5g72qg1px/
4: Are the settings appropriate for this analysis and amount of samples?
I apologise once again for any redundant questions I am asking and a massive thanks in advance for any help given!