salmon output to linear models in R
1
0
Entering edit mode
4.6 years ago
evelyn ▴ 230

Hello Everyone,

I want to run linear models in R using salmon output using the TPM values (dependent variable) to check for a possible regression relationship with another independent variable.

I used to import:

dir <- "path of salmon output directory"
list.files(dir)

It lists following salmon output files:

"593_transcripts"
"594_transcripts"
"595_transcripts"
"596_transcripts"
"597_transcripts"
"598_transcripts"
"599_transcripts"
"600_transcripts"

Then I loaded the file for independent variable which looks like:

sample dose
593   0.067
594   0.563
595   4.789
596  10.12
597   12.67
598    0.783
599    5.465
600    6.234

Then to perform linear regression, I want to access TPM values from transcript quant.sf files to do:

model <- lm(TPM ~ dose)

I need help to perform the linear regression part. Thank you in advance!

R • 3.0k views
ADD COMMENT
2
Entering edit mode
4.6 years ago
ashish ▴ 680

Starting with a TPM expression matrix, if you convert the data into long format (tidy) then you can use lm the way you have mentioned. Here is an example with dummy data:

 # create TPM matrix with 100 genes and 5 samples
df <- data.frame(names = paste("gene", 1:100, sep = "_"), 
                               matrix(rnorm(500, 100, 3),
                                      nrow = 100,ncol = 5))
colnames(df) <- c("genes", paste("dose", 0:4, sep = "_"))
head(df)
   genes    dose_0    dose_1    dose_2    dose_3    dose_4
1 gene_1 101.20585 100.89322  99.97335  99.00884  97.33674
2 gene_2  98.53161  96.90215 102.41959 104.73646  97.76265
3 gene_3  96.44946 102.37975  95.46167 104.92316 106.92175
4 gene_4  93.86651  96.95756 102.77468 100.03604 102.40414
5 gene_5  98.92690 100.13105 101.63389  99.56012  98.30926
6 gene_6 101.38702  99.75266 103.63846 100.90292  99.08118

# convert data into long format
library(reshape2)
df1 <- melt(df, id.vars = "genes")
colnames(df1) <- c("genes", "dose", "tpm")
head(df1)
   genes   dose       tpm
1 gene_1 dose_0 101.20585
2 gene_2 dose_0  98.53161
3 gene_3 dose_0  96.44946
4 gene_4 dose_0  93.86651
5 gene_5 dose_0  98.92690
6 gene_6 dose_0 101.38702

# create lm, dose_0 will be assumed as reference
model <- lm(tpm ~ dose, data = df1)
summary(model)

Call:
lm(formula = tpm ~ dose, data = df1)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5597 -2.0167  0.2412  1.9823 10.3174 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 100.25062    0.29414 340.821   <2e-16 ***
dosedose_1   -0.43665    0.41598  -1.050    0.294    
dosedose_2    0.09950    0.41598   0.239    0.811    
dosedose_3    0.23839    0.41598   0.573    0.567    
dosedose_4    0.07475    0.41598   0.180    0.857    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.941 on 495 degrees of freedom
Multiple R-squared:  0.0061,    Adjusted R-squared:  -0.001931 
F-statistic: 0.7595 on 4 and 495 DF,  p-value: 0.552
ADD COMMENT
0
Entering edit mode

Thank you! So I have TPM values for each sample for dose located in different files. Differentially expressed "Genes" are not same for all doses. I was thinking to run it separately for each dose. The TPM values in each file are located as:

Name    Length  EffectiveLength TPM NumReads

Can you please help me to make a matrix for TPM values for each dose wit respect to the differentially expressed genes.

ADD REPLY
0
Entering edit mode

You need to explain more about your experimental design. What are trying to achieve and are all these doses related like different doses of the same substance? If that is the case have you tried finding DE genes using all the samples together like we do in time-series expression analysis. Also what kind of tpm matrix do you expect - one with DE genes from every condition or one with DE genes which are common in all dose contrasts.

ADD REPLY
0
Entering edit mode

Thank you! My experiment comprises of multiple varieties (15). The are different doses which are not related to each other and are not time related either (10 doses of different substances). I am running salmon for all 15 varieties together which will give 15 quant.sf files containing TPM values for each variety. Then I want to check if ether is a linear relationship between all the varieties and each dose. So, I would need a TPM matrix for all 15 varieties with respect to each does. I think there will be 10 such TPM matrices for each of different dose. Then I want to run linear model say for all variety TPM matrix with dose 1. and do the same thing for rest of the doses. Please let me know if I missed something that you needed to know.

ADD REPLY
1
Entering edit mode

So you have 15 varieties and 10 different doses and want to create 10 matrices each representing TPM values across 15 varieties for every individual dose. TPM values for each variety are stored in quant.sf files. I haven't used salmon so I am not familiar with the output file structure. If you want to create a TPM matrix by joining columns from different files you can modify this bash command according to your needs:

# assuming all 15 quant.sf files are in the same directory and are tab delimited
# paste all the files side by side, select the required columns and then remove the header
paste *sf | cut -f 1,4,9,14,19,24,29,34,39,44,49,54,59,64,69 | grep -v "TPM"  > dose_1_tpm.tab

# now load this in R and specify column names as different varieties
df <- read.table("dose_1_tpm.tab", header = FALSE, row.names = 1)
colnames(df) <- paste("variety", 1:15,  sep = "_")

You can run this in a for loop to get tpm matrices for each dose quickly

ADD REPLY
0
Entering edit mode

Thank you! I will try it.

ADD REPLY
0
Entering edit mode

Thank you! I have done it using:

paste *sf | cut -f 1,4,9,14,19,24,29,34,39,44,49,54,59,64,69 | grep -v "TPM"  > dose_1_tpm.tab
df <- read.delim("dose_1_tpm.tab", header=FALSE)
colnames(df) <- c("gene", "variety1", "variety2", "variety3", "variety4", "variety5", "variety6")
head(df)
library(reshape2)
df1 <- melt(df, id.vars = "gene")
colnames(df1) <- c("gene", "variety", "tpm")
model <- lm(tpm ~ variety, data = df1)
summary(model)

and got:

> summary(model)

Call:
lm(formula = tpm ~ variety, data = df1)

Residuals:
   Min     1Q Median     3Q    Max 
   -29    -29    -29    -17 188694 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            2.943e+01  3.188e+00   9.233   <2e-16 ***
variety1               -3.062e-09  4.508e+00   0.000        1    
variety2               -2.737e-09  4.508e+00   0.000        1    
variety3                4.412e-10  4.508e+00   0.000        1    
variety4               -3.149e-09  4.508e+00   0.000        1    
variety5               -1.765e-09  4.508e+00   0.000        1    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 587.6 on 203850 degrees of freedom
Multiple R-squared:  6.034e-24, Adjusted R-squared:  -2.453e-05 
F-statistic: 2.46e-19 on 5 and 203850 DF,  p-value: 1

I think there is something wrong with p values.

ADD REPLY
0
Entering edit mode

Have you checked the input data, pleaste post the code you used to create the input.

ADD REPLY
0
Entering edit mode

Here it is using salmon:

salmon quant -i index -l A -1 R1_paired.fastq -2 R2_paired.fastq -o transcript

Produced transcript directories for each variety, each of which has a quant.sf file.

Can you please explain cut -f 1,4,9,14,19,24,29,34,39,44,49,54,59,64,69. Why you used so many numbers and what are they relating to? I tried finding about cut -f and it says we can get specific columns. Based on the code you shared, I am not quite sure if I understood it. I used six file to test first but I still used all the numbers you shared. Can that affect the analysis?

ADD REPLY
0
Entering edit mode

Numbers in front of cut are column numbers. paste *sf put all files with extension sf side by side as single file, meaning column1-5 are of variety1_quant.sf then column 6-10 are variety2_quant.sf and so on. Now using cut we can select the columns containg tpm values which are 4th column (var 1), 9th column (var 2) etc. I wrote those numbers for 15 files. For six files you should write upto the number 29.

ADD REPLY
0
Entering edit mode

Got it, thank you! Even after changing the numbers I still got the same p values.

ADD REPLY
0
Entering edit mode

You will have to post a reproducible example with all the R code you used. Can you also check tpm values in the the quant.sf files and the melted tpm matrix . You can also do it like this: The output should be all true and zero false.

# quant_1 is quant.sf file of variety 1, df1 is the melted tpm matrix
table(quant_1$TPM == df1$tpm[df1$variety == "variety_1"])
ADD REPLY
0
Entering edit mode

Thank you! The above code gives

FALSE  TRUE 
28340  6365
ADD REPLY
0
Entering edit mode

So there is problem in creating the matrix. Can you just create a tpm matrix in excel, read and melt it into long format in R and do the lm.

ADD REPLY
0
Entering edit mode

Thank you! I have tried it in excel but the TPM value format automatically changes when I copy it to excel, e.g., 268.499447 changes to 268.499.447

I tried to change the decimal places but it is not changing.

ADD REPLY
0
Entering edit mode

Try paste as values instead of just paste. if this doesn't work do this before pasting ctrl+a > right click > format cells > numbers > ok

ADD REPLY
0
Entering edit mode

Thank you! quant.sf files are opened in text editor like Xcode/bbedit. when I copy these to excel from the editor then this happens. I tried different ways but still not working.

ADD REPLY
0
Entering edit mode

Okay, I got the data done and now the code gives

 TRUE 
33976

But the p values are still the same as before.

ADD REPLY
0
Entering edit mode

Please post the model summary.

ADD REPLY
0
Entering edit mode

@ashish, I tried using paste command on more files but it does not give the expected output. Can you suggest why it is so? It is giving Name column from multiple *sf files in the output file in multiple columns which is not what I need.

ADD REPLY
0
Entering edit mode

Maybe you are writing the wrong column numbers, name is the first column and TPM is 4th column so there has to be a difference of 5 between two column numbers like 4,9,14,19 etc.

ADD REPLY
0
Entering edit mode

Thank you! I made matrix df using TPM values for all varieties in linux. Can you help to run linear regression for all genes across all varieties for dose?

ADD REPLY
0
Entering edit mode

Just melt the matrix in long format and do lm using tpm ~ dose + variety or tpm ~ dose + variety + dose:variety (will also check for interaction effects).

ADD REPLY
0
Entering edit mode

Thank you, I do not want to add variety as a factor. Just a linear regression for each gene across all varieties.

ADD REPLY

Login before adding your answer.

Traffic: 814 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6