Question

RNA seq analysis using linear regression

0

Entering edit mode

5.2 years ago

evelyn ▴ 230

Hello All,

I want to do RNA seq analysis for multiple files together without any replication and control. I do not want to perform a comparison with control, I want to basically run a linear regression. I have input files (RNA paired end) for 50 varieties. There are 15 chemical compounds concentration each for all 50 varieties. I am using salmon to create TPM values and then I want to test for regression relationship between independent variable (each chemical compound which is a vector of all 50 varieties) and dependent variable as TPM values from all the varieties. I have generated the TPM values in quant.sf files using salmon and I need some help about the steps to move forward to perform linear regression analysis in R as this is my first time doing it.

Thank you in advance for any suggestion/comment/guidance!

RNA-Seq • 2.4k views

ADD COMMENT • link updated 5.2 years ago by josephbrown • 0 • written 5.2 years ago by evelyn ▴ 230

1

Entering edit mode

TPM values for each sample would be a vector of size ~25K (number of genes). How do you plan on regressing that with a vector of 15 values (per sample again)?

ADD REPLY • link 5.2 years ago by Ram 45k

0

Entering edit mode

Thank you! It is what I need but I am not sure what I said is the best way to get a linear relationship for TPM values and chemical compounds.

ADD REPLY • link 5.2 years ago by evelyn ▴ 230

score 0 · Answer 1 · 2020-04-17

(I'm new at this, so keep that in mind and please correct me if I'm wrong.)

I would recommend using the tximeta package to import your data and analyzing with DESeq2 to start.

If the concentration is encoded in your metadata table as integer, DESeq2 writes in the console:

the design formula contains one or more numeric variables with integer values, specifying a model with increasing fold change for higher values.

(With numeric values, it does not output this message.)

I believe this is what you're asking for?

Here's how I would solve that:

You'll need to make a metadata table for your data to use with tximeta (including a column of file paths to the quant.sf files:

library(tidyverse)
library(tximeta)
library(DESeq2)

meta <- tribble(
  ~sample, ~drug, ~concentration, ~files,
  "sampA", "A",   1,              "filepath",
  "sampB", "A",   2,              "filepath",
  "sampC", "A",   3,              "filepath",
  "sampD", "A",   4,              "filepath",
  "sampE", "A",   5,              "filepath",
  "sampF", "B",   2,              "filepath",
  "sampG", "B",   3,              "filepath",
  "sampH", "B",   4,              "filepath",
  "sampI", "B",   5,              "filepath",
  "sampJ", "B",   6,              "filepath")

Using a combined dataset:

se_combined <- tximeta(meta)
gse_combined <- summarizeToGene(se_combined)

dds_combined <- DESeqDataSet(gse_combined, design = ~ drug + concentration + drug:concentration)
dds_combined <- DESeq(dds_combined)

# Effect of concentration on drug A:
results(dds_combined, name = "concentration")

# Effect of concentration on drug B:
results(dds_combined, contrast = list(c("concentration", "drugB.concentration")))

From the DESeq2 vignette:

The key point to remember about designs with interaction terms is that, unlike for a design ~genotype + condition, where the condition effect represents the overall effect controlling for differences due to genotype, by adding genotype:condition, the main condition effect only represents the effect of condition for the reference level of genotype (I, or whichever level was defined by the user as the reference level). The interaction terms genotypeII.conditionB and genotypeIII.conditionB give the difference between the condition effect for a given genotype and the condition effect for the reference genotype.

and

The condition effect for genotype II (or III) is obtained by adding the main condition effect and the interaction term for that genotype.

Unfortunately, in the ?results help, it says to use name for continuous variables, which is fine for the effects of drugA, but the effects of drugB used contrast. You might need to break up your dataset by drug to get around this.

?results:

name
the name of the individual effect (coefficient) for building a results table. Use this argument rather than contrast for continuous variables, individual effects or for individual interaction terms. The value provided to name must be an element of resultsNames(object).

To get around using contrast, subset your data and change the design formula:

dds_drug_X <- dds_combined[, dds_combined$drug == "X"]
design(dds_drug_X) <- ~ concentration

dds_drug_X <- DESeq(dds_drug_X)
results(dds_drug_X, name = "concentration")

I would further explore the vignettes for all the necessary steps. The authors of these tools comment here regularly, so maybe they'll chime in with more info.