Question: PCA separation of drugs from an RNA seq experiment.
gravatar for TonyCN
4 months ago by
UK/Oxford/University of Oxford
TonyCN50 wrote:

I would be very grateful for some assistance processing my first real-date RNA seq workflow.

I'm very new at processing RNA seq experimental data, I've been following the Bioconductor DESeq2 workflow, (inc. tximeta and Salmon - on existing lab sample data which has always had a very clear Treated vs. Untreated under a common condition (e.g., light exposure, pain etc). Both groups have some number of repeats. I've been very comfortable going through all the stages.

Now, however, I am conducting my first very "real-world" RNA seq workflow and I'm struggling to find some meaning behind the PCA plots that I've generated. The experiment seems fairly routine: all iPSC stem cells are put in a state of "pain" and they are treated with one of 16 treatments (which includes a DMSO control and 4 marketed pain treatments). Treatments were performed as four independent replicates. In total, we got back 48 samples of sequenced reads; not every library prep for RNA sequencing was successful, it's more like 3 repeats per treatment. The RNAseq metadata looks similar to:

ID -- Well-ID -- BatchNumber -- RINValue -- Drug <and a="" few="" others="" concerning="" concentrations="" etc="">?

1 -- A1 -- 1 -- 8.9 DrugA

2 -- A2 -- 1 -- 9.4 DrugA

3 -- A3 -- 1 -- 7.8 DrugA

4 -- A4 -- 1 -- 9.2 DrugB

5 -- A5 -- 1 -- 8.2 DrugB

.. --- etc...

48 -- E9 -- 2 -- 9.0 DMSO

Some experimental assumptions I'm making before I investigate further:

1) expression profiles from the DMSO treated cell is seen as the cell's natural response to pain. Cells in "pain" treated with DMSO do not get "better". 2) each drug with the exception to DMSO is expected to elicit a pain-relieving response - we've identified molecular level responses to treatment of pain on these cells by monitoring activation-potential levels and results from other molecular benchtop assays (I'm not the experimentalist!) 3a) we have four drugs marketed as pain abortive medication, we expect those to separate from the response the DMSO treatment would give and potential from some or all of the other drugs. 3b) the 11 coincidental drugs (non-pain medication) should separate from the response of DMSO (given that they showed pain treating therapeutic properties) but whether they separate from the 4 traditional pain treatments is unknown.

My steps so far: -I checked QC - looks really good. -I aligned to a reference genome using Salmon. -I followed the precise workflow mentioned above - importing data at the transcript level using tximeta and I used the metadata mentioned above, transcript-to-gene mapping and then performed an analysis of differential expression using DESeq2.

As I can't think of any obvious confound, my formula for this was simply "~ Drug"; including DMSO as a "Drug". I have also tried a linear combination of RINValue and BatchNumber. Adding these makes little to no visible difference to the PCA.

Despite exploring the various PCA plots (log2, rlog, vsd - transformations, and also trying Generalised Linear Model PCA) my the first two PCs have fairly low variance (on vsd with ~Drug, PC1 = 14% and PC2 = 6%) and the distribution of the 16 treatments (48 data points) seem to have no pattern. The three data points for DMSO were evenly positioned across PC1, making it impossible to separate some of the drugs from the DMSO.

I hope I've provided enough information to facilitate the answering of these newbie questions:

1) Not being part of this research field my search engine efforts have failed. I don't know of any reliable publications with the results of RNA seq experiments where the researcher tested a good number of drugs and displayed their PC analysis stages. It would be nice to see how they separated out groups of interest. Any suggestions would be most welcome.

2) Given the consistent condition across all cell samples ("in pain") and the single treatment is "~ Drug" an appropriate formula?

3) In terms of defining clustered groups in the PCA plot, I thought I could actually look to see if samples can be separated using the higher-level information, for example: -Drug designed for the treatment of pain? -Drug class/category?

Would this be a reasonable approach? Whether a pattern appears or not, the rational needs to be there and I ask now, because classifying/grouping drugs can be notoriously difficult.

Update Given the helpful comments below and considerably more time reading up on how PCA is used in RNAseq, I now understand that it's used to try and explain the variance between the samples themselves, not necessary to do with the transcriptomic-impact a treatment has on a sample.

rna-seq • 295 views
ADD COMMENTlink modified 4 months ago • written 4 months ago by TonyCN50
gravatar for swbarnes2
4 months ago by
United States
swbarnes29.3k wrote:

1) I'd start with believing your data. If your data shows no large differences between treatments, then that might be the truth

2) There might be solid DE genes even if they don't drive any of the top PCs.

Are you running your PCA with blind = TRUE? I think if you run it with a design of ~ Drug with blind = FALSE, it's going to try and remove the effect of drug from your PCA, which is the opposite of what you want.

I'd leave batch in your design(make sure it's a factor, not a number), I would not include RIN, but if PCA shows a strong outlier, you can check its RIN to see if you have evidence of a botched RNA prep.

ADD COMMENTlink modified 4 months ago • written 4 months ago by swbarnes29.3k

Thank you. Those comments were very helpful.

1) I just managed to find an RNA seq drug screening publication with over 200 drug library preps. The distribution of their data points was very similar (i.e., immediate pattern) to my own. It was interesting to read "...PCA of gene expression for the 227 libraries revealed no shared principal components underlying a large portion of the gene expression variance for most drugs assayed, suggesting a diverse set of transcriptomic responses." Given, the above, I think I need to brush up on what the distances the PCA plots are measuring and how that fits with a DE gene expression formula. As I mentioned earlier, until today I've only seen clear case vs control separations in PCAs.

2) I've set it to blind=TRUE and I've left batch in as a factor. Thanks on the points concerning "batch" and "RIN" - noted.

ADD REPLYlink written 4 months ago by TonyCN50
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1306 users visited in the last hour