I would be very grateful for some assistance processing my first real-date RNA seq workflow.
I'm very new at processing RNA seq experimental data, I've been following the Bioconductor DESeq2 workflow, (inc. tximeta and Salmon - https://www.bioconductor.org/packages/devel/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html) on existing lab sample data which has always had a very clear Treated vs. Untreated under a common condition (e.g., light exposure, pain etc). Both groups have some number of repeats. I've been very comfortable going through all the stages.
Now, however, I am conducting my first very "real-world" RNA seq workflow and I'm struggling to find some meaning behind the PCA plots that I've generated. The experiment seems fairly routine: all iPSC stem cells are put in a state of "pain" and they are treated with one of 16 treatments (which includes a DMSO control and 4 marketed pain treatments). Treatments were performed as four independent replicates. In total, we got back 48 samples of sequenced reads; not every library prep for RNA sequencing was successful, it's more like 3 repeats per treatment. The RNAseq metadata looks similar to:
ID -- Well-ID -- BatchNumber -- RINValue -- Drug <and a="" few="" others="" concerning="" concentrations="" etc="">?
1 -- A1 -- 1 -- 8.9 DrugA
2 -- A2 -- 1 -- 9.4 DrugA
3 -- A3 -- 1 -- 7.8 DrugA
4 -- A4 -- 1 -- 9.2 DrugB
5 -- A5 -- 1 -- 8.2 DrugB
.. --- etc...
48 -- E9 -- 2 -- 9.0 DMSO
Some experimental assumptions I'm making before I investigate further:
1) expression profiles from the DMSO treated cell is seen as the cell's natural response to pain. Cells in "pain" treated with DMSO do not get "better". 2) each drug with the exception to DMSO is expected to elicit a pain-relieving response - we've identified molecular level responses to treatment of pain on these cells by monitoring activation-potential levels and results from other molecular benchtop assays (I'm not the experimentalist!) 3a) we have four drugs marketed as pain abortive medication, we expect those to separate from the response the DMSO treatment would give and potential from some or all of the other drugs. 3b) the 11 coincidental drugs (non-pain medication) should separate from the response of DMSO (given that they showed pain treating therapeutic properties) but whether they separate from the 4 traditional pain treatments is unknown.
My steps so far: -I checked QC - looks really good. -I aligned to a reference genome using Salmon. -I followed the precise workflow mentioned above - importing data at the transcript level using tximeta and I used the metadata mentioned above, transcript-to-gene mapping and then performed an analysis of differential expression using DESeq2.
As I can't think of any obvious confound, my formula for this was simply "~ Drug"; including DMSO as a "Drug". I have also tried a linear combination of RINValue and BatchNumber. Adding these makes little to no visible difference to the PCA.
Despite exploring the various PCA plots (log2, rlog, vsd - transformations, and also trying Generalised Linear Model PCA) my the first two PCs have fairly low variance (on vsd with ~Drug, PC1 = 14% and PC2 = 6%) and the distribution of the 16 treatments (48 data points) seem to have no pattern. The three data points for DMSO were evenly positioned across PC1, making it impossible to separate some of the drugs from the DMSO.
I hope I've provided enough information to facilitate the answering of these newbie questions:
1) Not being part of this research field my search engine efforts have failed. I don't know of any reliable publications with the results of RNA seq experiments where the researcher tested a good number of drugs and displayed their PC analysis stages. It would be nice to see how they separated out groups of interest. Any suggestions would be most welcome.
2) Given the consistent condition across all cell samples ("in pain") and the single treatment is "~ Drug" an appropriate formula?
3) In terms of defining clustered groups in the PCA plot, I thought I could actually look to see if samples can be separated using the higher-level information, for example: -Drug designed for the treatment of pain? -Drug class/category?
Would this be a reasonable approach? Whether a pattern appears or not, the rational needs to be there and I ask now, because classifying/grouping drugs can be notoriously difficult.
Update Given the helpful comments below and considerably more time reading up on how PCA is used in RNAseq, I now understand that it's used to try and explain the variance between the samples themselves, not necessary to do with the transcriptomic-impact a treatment has on a sample.