How do I correctly format my limma/eBayes code?
2
0
Entering edit mode
11 months ago
John • 0

Here is part of my data table, which I will call "df":

    Subject.ID    Sex           V1           V2           V3           V4           V5
1   GTEX-1117F female   4.24944534   0.18307358   1.99276843   0.21785110   0.74777407
2   GTEX-1128S male    3.286585483  0.165944088  1.843983844  0.444455046  0.307311939
3   GTEX-11EMC female   3.31947343   0.27846061   1.46519089   0.33753984   0.91708799
4   GTEX-11GSP female  3.232353768  0.215492230  1.778208576  0.166458372  0.644225472
5   GTEX-11TTK male    2.934705200  0.256406854  1.712815854  0.256165277  0.948301812
6   GTEX-11ZUS male     3.77188558   0.18434378   2.43189035   0.37373172   1.14339342
7   GTEX-12WSD female   4.22032995   0.21933892   0.93900085   0.08687886   1.06901468
8   GTEX-12ZZX female   3.43616184   0.39473363   2.73205207   0.33525457   1.51399598
9   GTEX-1313W male     3.66334462   0.39418485   2.47248777   0.53545580   0.64634702
10  GTEX-131XW female  2.956614288  0.276317993  1.977096712  0.167743280  1.528071165

How do I correctly format the following code to account for the kind of dataframe I'm working with? I'm using sex as the factors to be interacted. Here is what I have so far:

design <- model.matrix(~ Sex, data = df)
fit <- lmFit(df, design)
fit <- eBayes(fit)
topTable(fit)

The second line gives me the error Expression object should be numeric, instead it is a data.frame with n non-numeric columns (and I know why I'm getting this error; I'm not sure how to handle the error, though).

Note: I also asked on SE: https://bioinformatics.stackexchange.com/questions/21169/how-do-i-correctly-format-my-limma-ebayes-code-reposting-because-previous-post

R limma eBayes • 2.4k views
ADD COMMENT
0
Entering edit mode

Try subsetting df so it's df[,-c(1,2)] - that will exclude the non-numeric columns

ADD REPLY
0
Entering edit mode

Gives the error "object 'Sex' not found" if I do it like that.

Doing lmFit(data.matrix(new_dataset[,-c(1,2)]), design) gives the error "row dimension of design doesn't match column dimension of data object", and doing fit <- lmFit(new_dataset[,-c(1,2)], design[,-c(1,2)]) gives the same "Expression object should be numeric, instead it is a data.frame with n non-numeric columns" error.

ADD REPLY
0
Entering edit mode

John, I am the author of the limma package. The format of your data is a bit mysterious. Can you explain it a bit more? limma operates on gene expression matrices where the rows are genes and the columns are samples, but your data seems to be the other way around.

How many rows and columns does your data.frame df have? What do the columns V1, V2 represent? How many V columns are there?

It might help if you explained how you created the data.frame df in the first place, i.e., what was the original file that you read into R.

ADD REPLY
0
Entering edit mode

Hello, I sent you a direct email because I do not want the entirety of my code (or even more than what's necessary) in public forums.

ADD REPLY
0
Entering edit mode

You're shotgunning limma/TPM posts on the exact same issue here and on Bioinfo StackExchange. Please stop that and focus content on a single thread. It's just double and tripple effort for the same underlying issue. Read the manuals, use standarf file formats as it recommends (expression matrix) and follow the best practices.

ADD REPLY
2
Entering edit mode

I made OP aware of this on a discord channel off forum. They're new and on my suggestion, they added links to the cross-posts.

ADD REPLY
0
Entering edit mode

perfect, thanks!

ADD REPLY
2
Entering edit mode
11 months ago
Gordon Smyth ★ 7.2k

The error problems you are having are not really to do with limma. The problem is that you have misformated your data into a form that could not be input to any analysis software tool.

You simply need an expression matrix containing counts or log-expression values with rows for genes and columns for samples (the standard format used by all Bioconductor packages for the past 20 years). The matrix should not be a data.frame and it should not contain any annotation columns, just expression.

Subject annotation should be in a separate data.frame and, indeed, Sex could just be factor, it doesn't even need to be in a data.frame.

GTEX provides expression values in one file and subject annotation in another, which means that the data are already in essentially the right format when you read them into R in the first place. All you need to do is not change the format so much.

It is not correct to insert subject annotation into the same data.frame as the expression values. It is not correct to transpose the expression matrix. It is obviously not correct to remove all the gene names because what would an analysis mean with no gene IDs?

Given the data.frame you have, it would be easy to run limma by:

y <- as.matrix(df[,-c(1,2)]
row.names(y) <- df$Subject.ID
y <- t(y)
fit <- lmFit(y, design)

etc. However it would be far better for you to process the original GTEX data files more simply and correctly in the first place. Hopefully these general comments will be enough to push you on the right path.

Finally, let me note that GTEX provides read count files, so there is no need for you to analyse log(TPM+1) as you seem to be doing. You could do a regular limma or edgeR analysis with regular read counts.

ADD COMMENT
0
Entering edit mode

Okay, I finally got it. Here's what was done:

df <- t(ogTable)
df <- as.data.frame(df)
colnames(df) <- ogTable$Name
df[['Subject.ID']] <- rownames(df)

And then I did what Gordon Smyth suggested.

ADD REPLY
0
Entering edit mode
11 months ago
LauferVA 4.2k

This code block works with DESeq2, it will NOT work as is for you. For one thing, it looks like your counts have the patients as rows, whereas DESeq2 puts patients in columns, and each row is a gene. But, the overall approach should be highly similar.

# load any libraries you will require:
libs<-c('DESeq2', 'dplyr')
lapply(libs, library, character.only = TRUE)

# enter file names
setwd('/dir/containing/your/data')
MdFname<-'pathToMdFoo.tsv'
CtsFname<-'pathToCtsFoo.tsv'

# load the files
Md <- read.table(MdFname, fill = TRUE, header = TRUE, sep = "\t")       # Read in raw metadata
Cts <- read.table(CtsFname, fill = TRUE, header = TRUE, sep = "\t")         # Read in raw counts

# Metadata processing:  ************************ relevant for you
row.names(Md)<-Md$PtID; Md$PtID<-NULL       # Re-assign PtID column to be row.names of the metadata object.

# Assign each MetaData column to an appropriate data format:
Md <- Md %>% mutate_if(is.character,as.factor)       # from package dplyr, v.1.1.2. ***** relevant for you ******
Md <- Md %>% mutate_if(is.integer,as.numeric)

# Md$ChemoRad<-Md$Radiation * Md$Tmz        # Craft an interaction variable, if needed.
# Md$Condition<-as.factor(paste(Md$Col1, Md$Col2, sep='-') ) # Paste together factor variables to make a 'Condition' variable, if needed.

# For all factor variables in your metadata table, assign a reference level to avoid confusion regarding effect direction.
Md$Grade<-relevel(as.factor(Md$Grade), ref='2')
Md$Treatment<-relevel(as.factor(Md$Treatment), ref='Neither')
Md$IdhStatus<-relevel(as.factor(Md$IdhStatus), ref='Wt')
Md$Codel<-relevel(as.factor(Md$Codel), ref='Negative')
Md$Sex<-relevel(as.factor(Md$Sex), ref='Female')
# End of metadata processing ^.

# Now, process the Count Data as well ***** relevant for you ******
row.names(Cts)<-Cts$gene_name; Cts$gene_name<-NULL. # Like your data, we have metadata in what should be the count matrix. In my case, though, it is the gene names, not the patient IDs, and you have sex.
Cts$Sex<-NULL # You'd need to add something like this; I dont have it in mine. ***** relevant for you ******

# Finally, harmonize the metadata to the count data by trimming counts to the samples found in the metadata
Cts<-Cts[row.names(Md)]   # this will also order the columns of the Cts matrix in the same order as the rows of your Md. # possibly relevant for you.

# Port the harmonized counts and metadata into DESeq2:
if (all(row.names(Md) == colnames(Cts)) == TRUE ) { ddsInitial<-DESeqDataSetFromMatrix(countData = Cts, colData = Md, design=~1) }  # MUST return true, or cannot (and should not) proceed.

I gave you a bit of code you did not ask for. My hope is in time, you will understand why I use it if you return to this.

Good luck!

ADD COMMENT
0
Entering edit mode

I can see what my version of the "Md" variable would be, but I'm not seeing what I would use for the Cts (counts) variable. Beforehand, I combined all variables I need (TPM's of the genes, subject ID's and gender) into one variable (which is df in what I posted).

And just to be clear, I do need to use limma (tmk it looks like dds is a different process).

ADD REPLY

Login before adding your answer.

Traffic: 1421 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6