Hello,
I am doing RNAseq of primary tumor and metastases in bone and lung, but I am still confused in how unravel the DEGs exclusive for one organ and vice versa. For example, how should I find the top 25 genes exclusive for bone and the top 25 genes exclusive for lung and plot them in heatmap in order to visualize in a single heatmap these 50 genes (25 for bone and 25 for lung) for the three different conditions (primary, bone and lung)?
Do to this I did a super long code, that I don't know if is correct or not. Can you tell me if this is correct and if there is a shorter way to do what I want to do?
Thank you so much in avdance.
Here my code so far:
dds <- DESeqDataSetFromMatrix(countData=RawCounts,
colData=sampleTable,
design= ~ Tropism)
dds <- dds[rowSums(counts(dds)) >= 10, ]
colData(dds)$Tropism <- relevel(colData(dds)$Tropism,
ref="Breast")
dds <- estimateSizeFactors(dds)
dds <- DESeq(dds)
res.shr.bone <- lfcShrink(dds, coef = "Tropism_Bone_vs_Breast",type="apeglm")
res.shr.lung <- lfcShrink(dds, coef = "Tropism_Lung_vs_Breast",type="apeglm")
resFix.bone <- res.shr.bone[!is.na(res.shr.bone$padj),]
resFix.lung <- res.shr.lung[!is.na(res.shr.lung$padj),]
keys <- keys(org.Hs.eg.db)
s <- AnnotationDbi::select(org.Hs.eg.db,keys=keys,
columns=c("ENSEMBL","SYMBOL","ENTREZID"))
s <- s[!is.na(s$ENSEMBL),c("ENSEMBL","SYMBOL","ENTREZID")]
resDF.bone <- as.data.frame(resFix.bone)
resDF.lung <- as.data.frame(resFix.lung)
resDF.bone$ENSEMBL <- rownames(resDF.bone)
resDF.lung$ENSEMBL <- rownames(resDF.lung)
library(plyr)
resDF.bone <- join(resDF.bone,s,by=c("ENSEMBL"))
resDF.bone <- resDF.bone[!duplicated(resDF.bone$ENSEMBL),]
resDF.lung <- join(resDF.lung,s,by=c("ENSEMBL"))
resDF.lung <- resDF.lung[!duplicated(resDF.lung$ENSEMBL),]
# Here I take up and down gene for bone and lung:
resString.bone <- resDF.bone[abs(resDF.bone$log2FoldChange) >= 1 &
(resDF.bone$padj < 0.01),]
resString.lung <- resDF.lung[abs(resDF.lung$log2FoldChange) >= 1 &
(resDF.lung$padj < 0.01),]
resString.bone.up <- resDF.bone[resDF.bone$log2FoldChange >= 1 &
(resDF.bone$padj < 0.01),]
resString.lung.up <- resDF.lung[resDF.lung$log2FoldChange >= 1 &
(resDF.lung$padj < 0.01),]
resString.bone.down <- resDF.bone[resDF.bone$log2FoldChange <= -1 &
(resDF.bone$padj < 0.01),]
resString.lung.down <- resDF.lung[resDF.lung$log2FoldChange <= -1 &
(resDF.lung$padj < 0.01),]
# In this way I have up and down for each, then I do intersection to obtain shared one:
#Select Shared Genes
library(dplyr)
shared.up<- as.data.frame(intersect(resString.bone.up$SYMBOL, resString.lung.up$SYMBOL))
colnames(shared.up)[1] <- "SYMBOL"
shared.down<- as.data.frame(intersect(resString.bone.down$SYMBOL, resString.lung.down$SYMBOL))
colnames(shared.down)[1] <- "SYMBOL"
# Select exclusive genes for bone and lung
exclusive_bone <- anti_join(resString.bone, shared.up, by = 'SYMBOL')
exclusive_lung <- anti_join(resString.lung, shared.up, by = 'SYMBOL')
exclusive_bone <- anti_join(exclusive_bone, shared.down, by = 'SYMBOL')
exclusive_lung <- anti_join(exclusive_lung, shared.down, by = 'SYMBOL')
# And finally I do normalization and plotting:
#Get normalized counts for the top 25 variable genes
topGenesBone <- exclusive_bone[order(exclusive_bone$padj)[1:25], c("ENSEMBL", "SYMBOL")]
topGenesLung <- exclusive_lung[order(exclusive_lung$padj)[1:25], c("ENSEMBL", "SYMBOL")]
topGenes <- rbind(topGenesBone,topGenesLung)
normalized_counts_top <- counts(dds, normalized = T)[topGenes$ENSEMBL,]
#Identify the samples corresponding to the reference group "Breast" "Primary"
reference_group_samples <- colnames(dds)[dds$Tropism == "Breast" & dds$Condition == "Primary"]
#Calculate mean expression for each gene in the reference group
mean_reference_group <- rowMeans(normalized_counts_top[, reference_group_samples])
#Calculate log2FC for each gene relative to the mean of the reference group
log2FC_top <- normalized_counts_top
for (sample in colnames(log2FC_top)) {
log2FC_top[, sample] <- log2(normalized_counts_top[, sample] / mean_reference_group)
}
rownames(log2FC_top) <- topGenes$SYMBOL
#Customize the appearance of the heatmap
pheatmap(log2FC_top,
annotation_legend = TRUE,
cluster_rows = TRUE,
cluster_cols = FALSE,
show_colnames = TRUE,
color = colorRampPalette(c("navy", "white", "firebrick3"))(100),
fontsize = 10,
border_color = "white",
main = "Top 50 Differentially Expressed Genes",
angle_col = '45)
Can you please make the question clear? Most people will not go through this wall of text and unformatted code (
10101
button is your friend). Just show the experimental design and groups, and then one can tell you how to code that up in DESeq2.I've formatted their post for them but making the question clear is all up to them.
Sorry if the question was not clear. I have three condition (Breast, Bone and Lung) and I want to know which DEG are exclusive for bone and which one for lung, since Breast is my reference group. And once I have these exclusive DEGs I want to plot the top 25 per condition (bone and lung) in the same heatmap. Here the design of the sample table
Not that Im an authority on this but surely before comparing primary v mets you should be comparing normal v cancerous in each tissue to give a baseline. Then compare primary v mets, to see how the baseline has changed?
Otherwise, you are comparing 2 altered conditions (primary + mets), without any negative control (non-cancerous) so the differences in gene expression seen may be the result of differences naturally found between those tissues before cancer was present.
Stupid analagy:
First:
Second
Just an idea. Could be way off.
Thank you very much for the answer, however I only care about expression in the cancer cells, since my data are coming from xenograft models.
However, what I need help for is make a for cycle that for each gene present in resString.bone check if the same gene in resDF.lung has a log2foldchange between -0.2 and 0.2, and if it is true I want to save it in a new dataframe. Thank you so much
Most advise to avoid For Loops in R (don't ask me why, unexpected results and behaviours, poor efficiency I think). I would use tidyverse functions. Something along the lines of what you are already doing
Also, it is my preference to have one large object containing all data with an additional column for tissue. so that I can extract from the same object each time to avoid confusion. Then rather than creating a separate object for each tissue, I can consider grouping by tissue. In this case the grouping isn't doing much but it might be useful for other transformations.
Yeah, I heard that for cycle are not super suggested in R. I can't use tibble since my bone df has a different size than lung df.. However the 2 conditions are super similar so I would like to filter it with more stringent parameters. can you help me with the for cycle?
Really not sure how to tackle it with a for loop. Probably needs a nested for loops with nested if loops as well (no thanks!). I suggest posting on stackoverflow a simple representation of the problem if you insist on using a for loop
You could also post on bioconductor if you'd like to try solving it with dplyr (recommended).
Regarding the different sized dataframes, I'm assuming you mean one or the other has more rows or columns, in which case a full join will work.
Moreover, I see I have several genes with pvalue smaller than 10^-300, is this fine? Should I set I threshold?
P-values are not corrected for the multiple comparisons inherent in differential gene expression analysis. FDR (aka qvalue) is used for selecting deferentially expressed genes (i.e. genes where FDR<0.05). Bear in mind these values are nothing more than selection cut-offs. There is nothing inherently right about 0.05 - or any other cutoff. The cutoff is arbitrary. With noisy data with few results, the cut-off has to be more generous and the results will need additional evidence.
on top of this I have many genes with a padj equal to zero, is this coming from an error or it's just that the value is too small for RStudio? How should I resolve it?
That's normal. Anything lower than the lower limit of R's numerical value range is rounded to 0. No resolution is required. Just be careful if you are plotting a volcano plot. I think 0's are ignored... but I'm not sure about this.
See Jared Andrew's answer on the same query
Please do not add answers unless you're answering the top level question. Instead, use
Add Comment
orAdd Reply
as appropriate. I've moved your post to the right location this time, please be more careful in the future.