TPM from StringTie
0
1
Entering edit mode
4.2 years ago
gozrom ▴ 80

I have extracted all the TPM values from gtf files generated by StringTie for all replicates, however Those TPM values are per transcript and not per gene.

Now I have one huge csv file with 12 replicates and their corresponding TPM values and I want to make the TPM values per gene to use it in a subsequent visualization.

File looks like this:

 X1    TPM transcript_id ref_gene_name  TPM.1 transcript_id.1 ref_gene_name.1  TPM.2 transcript_id.2
<int>  <dbl> <chr>         <chr>          <dbl> <chr>           <chr>            <dbl> <chr>
1     1  2.60  MSTRG.1.1     <NA>           3.78  MSTRG.1.1       <NA>             4.22  MSTRG.1.1
2     2 NA     MSTRG.1.1     <NA>          NA     MSTRG.1.1       <NA>            NA     MSTRG.1.1
3     3  2.01  MSTRG.2.1     <NA>           1.17  MSTRG.2.1       <NA>             1.48  MSTRG.2.1
4     4 NA     MSTRG.2.1     <NA>          NA     MSTRG.2.1       <NA>            NA     MSTRG.2.1
5     5  0.402 ENSMUST00000~ Gm10568        0.316 ENSMUST0000019~ Gm10568          0.183 ENSMUST0000019~
6     6 NA     ENSMUST00000~ Gm10568       NA     ENSMUST0000019~ Gm10568         NA     ENSMUST0000019~
7     7  0.253 ENSMUST00000~ Gm7357         0.    ENSMUST0000020~ Rp1              2.66  ENSMUST0000018~
8     8 NA     ENSMUST00000~ Gm7357        NA     ENSMUST0000020~ Rp1             NA     ENSMUST0000018~
9     9 NA     ENSMUST00000~ Gm7357        NA     ENSMUST0000020~ Rp1              0.    ENSMUST0000019~
10    10  0.182 ENSMUST00000~ Gm6119        NA     ENSMUST0000020~ Rp1             NA     ENSMUST0000019~
... with 1,135,291 more rows,


Not sure, how to do that, if it's possible at all...

I guess it can be a for loop that runs on each ref_gene_name and sums up all the TPM from the TPM column before but I need it to run on all ref_gene_columns and create appropriate columns in a new data frame, and then export the new data frame to csv file. The code it's just to illustrate the idea, it doesn't mean it is the correct code....

df <- as.data.frame.matrix(df)
i=2
for i  to i=file$ref_gene_name$end
{
if ref_gene_name$i == ref_gene_name$(i+1)
df$gene$i <- file$ref_gene_name$i
df$condition1.TPM <- file$TPM$i + file$TPM$(i+1) i+1 if df$gene$i == file$TPM$(i+1) df$condition1.TPM <- df$condition1.TPM + file$TPM$(i+1) df$gene$i <- file$ref_gene_name$i }  Any help is appreciated, thank you. RNA-Seq • 3.9k views ADD COMMENT 0 Entering edit mode For me it is hard to fully understand the data format and what you tried, but I can give a generic advice. Give a look at the R function aggregate. If you have a simple structure with all the transcripts and genes and tmp in a single dataframe and want to sum TPM of the same gene, just try something like this: aggregate(df$TPM,by=list(df$gene_name))  ADD REPLY 0 Entering edit mode Thanks, that seems simpler than what I wrote, I tried aggregate but got an error: Error in match.fun(FUN) : 'length(genes_list$ref_gene_name)' is not a function, character or symbol If I run length(genes_list$ref_gene_name) as is it gives me the length of the specific column. but when I do it through aggregate TEST <- aggregate(gene_list$TPM,by=list(gene_list$ref_gene_name), FUN = length(gene_list$ref_gene_name))


I get an error.

0
Entering edit mode

Figured the error

when I substitute the FUN argument to any of a functional definition it works, but it only aggregate gene names without showing TPM values...

I need both I need the sum of all the TPM values from all the transcripts specific to each gene, and also the gene list

0
Entering edit mode

Can you please tell me how you filltered out TPM values from stringtie output?

0
Entering edit mode

Hi, I would also be interested in the same (but actually at the transcript level). Is there a convenient way to extract all the TPM values for all transcripts for all samples to feed in to Ballgown DE analysis? Thank you very much.

0
Entering edit mode

I think you can use -A flag when you do the counting