Hello Everyone,
I want to run linear models in R using salmon output using the TPM
values (dependent variable) to check for a possible regression relationship with another independent variable.
I used to import:
dir <- "path of salmon output directory"
list.files(dir)
It lists following salmon output files:
"593_transcripts"
"594_transcripts"
"595_transcripts"
"596_transcripts"
"597_transcripts"
"598_transcripts"
"599_transcripts"
"600_transcripts"
Then I loaded the file for independent variable which looks like:
sample dose
593 0.067
594 0.563
595 4.789
596 10.12
597 12.67
598 0.783
599 5.465
600 6.234
Then to perform linear regression, I want to access TPM values from transcript quant.sf files to do:
model <- lm(TPM ~ dose)
I need help to perform the linear regression part. Thank you in advance!
Thank you! So I have TPM values for each sample for dose located in different files. Differentially expressed "Genes" are not same for all doses. I was thinking to run it separately for each dose. The TPM values in each file are located as:
Can you please help me to make a matrix for TPM values for each dose wit respect to the differentially expressed genes.
You need to explain more about your experimental design. What are trying to achieve and are all these doses related like different doses of the same substance? If that is the case have you tried finding DE genes using all the samples together like we do in time-series expression analysis. Also what kind of tpm matrix do you expect - one with DE genes from every condition or one with DE genes which are common in all dose contrasts.
Thank you! My experiment comprises of multiple varieties (15). The are different doses which are not related to each other and are not time related either (10 doses of different substances). I am running salmon for all 15 varieties together which will give 15 quant.sf files containing TPM values for each variety. Then I want to check if ether is a linear relationship between all the varieties and each dose. So, I would need a TPM matrix for all 15 varieties with respect to each does. I think there will be 10 such TPM matrices for each of different dose. Then I want to run linear model say for all variety TPM matrix with dose 1. and do the same thing for rest of the doses. Please let me know if I missed something that you needed to know.
So you have 15 varieties and 10 different doses and want to create 10 matrices each representing TPM values across 15 varieties for every individual dose. TPM values for each variety are stored in quant.sf files. I haven't used salmon so I am not familiar with the output file structure. If you want to create a TPM matrix by joining columns from different files you can modify this bash command according to your needs:
You can run this in a
for
loop to get tpm matrices for each dose quicklyThank you! I will try it.
Thank you! I have done it using:
and got:
I think there is something wrong with p values.
Have you checked the input data, pleaste post the code you used to create the input.
Here it is using salmon:
Produced transcript directories for each variety, each of which has a quant.sf file.
Can you please explain cut -f 1,4,9,14,19,24,29,34,39,44,49,54,59,64,69. Why you used so many numbers and what are they relating to? I tried finding about cut -f and it says we can get specific columns. Based on the code you shared, I am not quite sure if I understood it. I used six file to test first but I still used all the numbers you shared. Can that affect the analysis?
Numbers in front of
cut
are column numbers.paste *sf
put all files with extension sf side by side as single file, meaning column1-5 are of variety1_quant.sf then column 6-10 are variety2_quant.sf and so on. Now usingcut
we can select the columns containg tpm values which are 4th column (var 1), 9th column (var 2) etc. I wrote those numbers for 15 files. For six files you should write upto the number 29.Got it, thank you! Even after changing the numbers I still got the same p values.
You will have to post a reproducible example with all the R code you used. Can you also check tpm values in the the quant.sf files and the melted tpm matrix . You can also do it like this: The output should be all true and zero false.
Thank you! The above code gives
So there is problem in creating the matrix. Can you just create a tpm matrix in excel, read and melt it into long format in R and do the
lm
.Thank you! I have tried it in excel but the TPM value format automatically changes when I copy it to excel, e.g., 268.499447 changes to 268.499.447
I tried to change the decimal places but it is not changing.
Try paste as values instead of just paste. if this doesn't work do this before pasting ctrl+a > right click > format cells > numbers > ok
Thank you! quant.sf files are opened in text editor like Xcode/bbedit. when I copy these to excel from the editor then this happens. I tried different ways but still not working.
Okay, I got the data done and now the code gives
But the p values are still the same as before.
Please post the model summary.
@ashish, I tried using paste command on more files but it does not give the expected output. Can you suggest why it is so? It is giving Name column from multiple *sf files in the output file in multiple columns which is not what I need.
Maybe you are writing the wrong column numbers, name is the first column and TPM is 4th column so there has to be a difference of 5 between two column numbers like 4,9,14,19 etc.
Thank you! I made matrix df using TPM values for all varieties in linux. Can you help to run linear regression for all genes across all varieties for dose?
Just melt the matrix in long format and do
lm
usingtpm ~ dose + variety
ortpm ~ dose + variety + dose:variety
(will also check for interaction effects).Thank you, I do not want to add variety as a factor. Just a linear regression for each gene across all varieties.