Hello,
I am new to R and DESeq2 and have been experiencing problems with inputting raw data and creating the metadata to construct a valid dds. Here is my code:
# DESeq2 Analysis
# Load in libraries
library(tidyverse)
library(DESeq2)
library(RColorBrewer)
library(SummarizedExperiment)
# Load in data file in the .csv format
setwd("~/Documents/Bioengineering Research/DESeq2 Analysis/Test 1")
all_ountdata <- read.csv("Test 1 Raw Count Data.csv", header = TRUE)
countdata <- as.matrix(all_countdata[,-1], header = TRUE, row.names = 1)
head(countdata)
metadata <- read.csv("Test 1 Metadata DESeq2.csv", header = TRUE)
head(metadata)
# Reorder data
idx = match(colnames(countdata), rownames(metadata))
reordered_metadata = metadata[idx,]
# Analysis with DESeq2 -------------------------------------------------------
# Initiate DESeq2 Object
dds <- DESeqDataSetFromMatrix(countData = countdata, colDat = reordered_metadata,
design = ~Sample)
The file format of my raw count data was in Excel that I exported as a CSV. Because I was experiencing problems creating the data frame for the metadata on my own, I manually created a metadata file that I also exported as a CSV for input.
The original raw count data includes 2000 rows with row names of the respective genes and 2 column, one for each sample. One sample has the raw counts of cells expressing high FOXP3 levels and the other is for low FOXP3 levels. There is no wt control group. (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE140102).
The metadata file I created has two columns, one labeled samples (so has two rows: Sample 1 and Sample 2). The other column is FOXP3 Expression (so has two rows: low and high).
When I try running the above code, I receive the error: "Error in DESeqDataSet(se, design = design, ignoreRank) : variables in design formula cannot contain NA: Sample"
I have been unable to find information regarding this error on the Bioconductor support page or elsewhere thus far, and any help regarding this issue would be much appreciated. Thank you!
Yes, I understand the data isn't good enough for normal DESeq2 analysis. My task is only to recapitulate the findings of the study that published this data. It's supposed to be kind of a test run for my first time doing DESeq2 that my post-doc gave me. So, any help with this would help me understand the general work flow of DESeq2 and be much appreciated.
The error message should help you to diagnose the problem (in fact, it diagnoses the problem for you):
I can see that. However, I do not understand what I am doing wrong. If you understand the error I am making it would be nice if you could share it rather than stating what R has already told me and what I have already tried to fix on my own.
It means that there are NA values in
reordered_metadata$Sample
, but there cannot be. You will have to trace back a few steps in order to understand why.What swbarnes2 is saying is important, too, i.e., you should really follow a tutorial first. In the past, when I was learning, I typically followed a tutorial, studied the input and output of each command, and commented my own code. Then, it became easier to adapt these tutorials to other / new datasets.
For DESeq2, I even have a very simple introduction indirectly via one of my own packages: https://bioconductor.org/packages/release/bioc/vignettes/EnhancedVolcano/inst/doc/EnhancedVolcano.html#quick-start
I'd learn on a tutorial dataset, not this. I'm not sure this will run with only two samples.