Good day everyone,
Before even stating my problem, i would love to thank you all for your help as i am just learning bioinformatics.
What i have.. Samples gotten from different GEO accessions (so basically different studies) that i would love to compare withe my own samples(WT and KO, 3 replicates each). I am thinking that my own samples are going through stem development and so to know the stage, i am using PCA plot to see where it clusters with this publicly available data.
Where i am.. As you can imagine this has been a hassle. I am attempting to use limma to remove the batch effect. My sample metadata has the samples, GEO accession(e.g GSE1245) as the batch effect and another column representing the stem development stage(2i, lif etc). It's not working my samples cluster on the far right by themselves!
Here is my code as performing deseq2(I also tried vst):
mat_rlog <- assay(rld)
mm_rlog <- model.matrix(~Stem_Development, colData(rld))
mat_rlog <- limma::removeBatchEffect(mat_rlog, batch=rld$GEO, design=mm_rlog)
assay(rld) <- mat_rlog
plotPCA(rld, intgroup = c("Stem_Development"))
Weirdly, after i made the bar plot for the library sizes (colsum of each sample) i noticed that my own samples(WT, KO) were higher than the other samples (all 3 replicates for each sample). Please help me... what could the problem be? Is it the confounding from the GEO and stem development?... should i remove the stem development column and change my dds code to ~1 which by the way this is what i have now...
dds <- DESeqDataSetFromMatrix(countData = filtered_counts,
colData = sample_info,
design = ~Stem_Development)
dds
Thank you