Question: TPM from StringTie
0
gravatar for gozrom
11 months ago by
gozrom30
gozrom30 wrote:

I have extracted all the TPM values from gtf files generated by StringTie for all replicates, however Those TPM values are per transcript and not per gene.

Now I have one huge csv file with 12 replicates and their corresponding TPM values and I want to make the TPM values per gene to use it in a subsequent visualization.

File looks like this:

X1 TPM transcript_id ref_gene_name TPM.1 transcript_id.1 ref_gene_name.1 TPM.2 transcript_id.2 <int> <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
1 1 2.60 MSTRG.1.1 <na> 3.78 MSTRG.1.1 <na> 4.22 MSTRG.1.1
2 2 NA MSTRG.1.1 <na> NA MSTRG.1.1 <na> NA MSTRG.1.1
3 3 2.01 MSTRG.2.1 <na> 1.17 MSTRG.2.1 <na> 1.48 MSTRG.2.1
4 4 NA MSTRG.2.1 <na> NA MSTRG.2.1 <na> NA MSTRG.2.1
5 5 0.402 ENSMUST00000~ Gm10568 0.316 ENSMUST0000019~ Gm10568 0.183 ENSMUST0000019~ 6 6 NA ENSMUST00000~ Gm10568 NA ENSMUST0000019~ Gm10568 NA ENSMUST0000019~ 7 7 0.253 ENSMUST00000~ Gm7357 0. ENSMUST0000020~ Rp1 2.66 ENSMUST0000018~ 8 8 NA ENSMUST00000~ Gm7357 NA ENSMUST0000020~ Rp1 NA ENSMUST0000018~ 9 9 NA ENSMUST00000~ Gm7357 NA ENSMUST0000020~ Rp1 0. ENSMUST0000019~ 10 10 0.182 ENSMUST00000~ Gm6119 NA ENSMUST0000020~ Rp1 NA ENSMUST0000019~ ... with 1,135,291 more rows,

Not sure, how to do that, if it's possible at all...

I guess it can be a for loop that runs on each ref_gene_name and sums up all the TPM from the TPM column before but I need it to run on all ref_gene_columns and create appropriate columns in a new data frame, and then export the new data frame to csv file. The code it's just to illustrate the idea, it doesn't mean it is the correct code....

df <- as.data.frame.matrix(df)
i=2     
for i  to i=file$ref_gene_name$end
{
if ref_gene_name$i == ref_gene_name$(i+1)
df$gene$i <- file$ref_gene_name$i
df$condition1.TPM <- file$TPM$i + file$TPM$(i+1)
i+1
if df$gene$i == file$TPM$(i+1)
df$condition1.TPM <- df$condition1.TPM + file$TPM$(i+1)
df$gene$i <- file$ref_gene_name$i
}

Any help is appreciated, thank you.

rna-seq • 988 views
ADD COMMENTlink modified 11 months ago • written 11 months ago by gozrom30

For me it is hard to fully understand the data format and what you tried, but I can give a generic advice. Give a look at the R function aggregate. If you have a simple structure with all the transcripts and genes and tmp in a single dataframe and want to sum TPM of the same gene, just try something like this:

aggregate(df$TPM,by=list(df$gene_name))
ADD REPLYlink written 11 months ago by Fabio Marroni2.1k

Thanks, that seems simpler than what I wrote,

I tried aggregate but got an error:

Error in match.fun(FUN) : 'length(genes_list$ref_gene_name)' is not a function, character or symbol If I run length(genes_list$ref_gene_name) as is it gives me the length of the specific column.

but when I do it through aggregate

TEST <- aggregate(gene_list$TPM,by=list(gene_list$ref_gene_name), FUN = length(gene_list$ref_gene_name))

I get an error.

ADD REPLYlink written 11 months ago by gozrom30

Figured the error

when I substitute the FUN argument to any of a functional definition it works, but it only aggregate gene names without showing TPM values...

I need both I need the sum of all the TPM values from all the transcripts specific to each gene, and also the gene list

ADD REPLYlink modified 11 months ago • written 11 months ago by gozrom30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1313 users visited in the last hour