Question

Merge TPM for transcripts to genes Kallisto

0

Entering edit mode

8.4 years ago

frida.danielsson ▴ 50

Hi,

I have an output file from Kallisto with RNA transcripts and their corresponding TPM:s from Kallisto, to enable comparison with previous results (mass spectrometry and FPKM values on gene level) I would like to merge all transcripts that belong to the same gene and just summarize the TPM:s for each gene. I have ran BiomaRt to generate a table with all the transcript id:s and corresponding gene ID:s (ensembl) and I now wonder what would be the fastest way to just sum all TPM:s that are linked to the same ENSG ID, please help!

TPM FPKM Kallisto RNA-Seq software-error • 6.8k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.4 years ago by frida.danielsson ▴ 50

0

Entering edit mode

Thanks! No I haven't considered that, will check it out.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by frida.danielsson ▴ 50

0

Entering edit mode

Do you know the quickest way to do this in R (I mean which function to use..)?

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by frida.danielsson ▴ 50

0

Entering edit mode

I've updated my answer with a simple R solution.

ADD REPLY • link 8.4 years ago by andrew.j.skelton73 6.5k

Ram · Answer 1 · 2015-12-07

Have you considered Salmon? It works on a similar methodology, and can output gene, and transcript level counts. Specific to your question, your methodology seems right, if you sum the TPMs of ensembl transcripts a, b, and c, for associated ensembl gene x, then that will work.

Edit: R example.

foo <- data.frame(gene=c(rep("A",3),
                         rep("B",2),
                         rep("C",1),
                         rep("D",4)),
                  transcript=c(paste0("A", 1:3),
                               paste0("B", 1:2),
                               paste0("C", 1),
                               paste0("D", 1:4)))
doo <- data.frame(SampleA = sample(1:100, 10),
                  SampleB = sample(1:100, 10),
                  SampleC = sample(1:100, 10))
rownames(doo) <- foo$transcript

out <- lapply(unique(foo$gene),
              function(x) {
                tmp       <- foo[foo$gene == x,]
                tmp_count <- doo[match(tmp$transcript,
                                       rownames(doo)),]
                tmp_out   <- colSums(tmp_count)
                return(tmp_out)
              })

gene_counts <- matrix(unlist(out), 
                      ncol  = ncol(doo), 
                      byrow = T)
rownames(gene_counts) <- unique(foo$gene)
colnames(gene_counts) <- colnames(doo)