Question

How to use FPKM on DESeq processed data?

0

Entering edit mode

2.4 years ago

Leo • 0

Hello,

I am trying to redo what a paper did. For this I must perform FPKM on DESeq data.

I have tried:

dds <- DESeqDataSetFromMatrix(mrna_df, colData = mrna_meta, design = ~ Condition)
fpkmres <- fpkm(dds)

ddsDESeq <- DESeq(dds)
fpkmres <- fpkm(dds)

dds_SizeFactors <- estimateSizeFactors(dds)
fpkmres <- fpkm(dds_SizeFactors)

But all of the three above options give me the following warning:

Error in fpkm(dds) : rowRanges(object) has all ranges of zero width.
the user should instead supply a column, mcols(object)$basepairs,
which will be used to produce FPKM values

I was wondering if someone has an example or a template where DESeq2 is used together with FPKM.

Thanks in advance

FPKM R deseq2 • 2.5k views

ADD COMMENT • link 2.3 years ago by Leo • 0

0

Entering edit mode

Please edit your post and add a link to the paper you're trying to replicate.

ADD REPLY • link 2.4 years ago by Ram 43k

score 0 · Answer 1 · 2022-01-04

0

Entering edit mode

2.4 years ago

ATpoint 82k

The problem here is that the DESeqDataSet has no information on how long the genes are, so it cannot calculate FPKM. Either use fpm() or get the gene lengths from databases and put it into the DESeqDataSet. The length info should actually provided by the preprocessing pipeline. How did you get the counts?

ADD COMMENT • link 2.4 years ago by ATpoint 82k

0

Entering edit mode

it's not the complete preprocessing, but I think this should provide enough information. How can I change this script to include the gene lengths?

mrna_query <- GDCquery(project = "TCGA-KIRC",
                       data.category = "Transcriptome Profiling",
                       data.type = "Gene Expression Quantification",
                       workflow.type = "HTSeq - Counts",
                       experimental.strategy = "RNA-Seq")

GDCdownload(mrna_query, method = "api", files.per.chunk = 100,
            directory = "~/Desktop/TCGA/mRNA/")

mrna_df <- GDCprepare(mrna_query, directory = "~/Desktop/TCGA/mRNA/")

mrna_df <- assay(mrna_df)

mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl") #,host="useast.ensembl.org")

mrna_attributes <- getBM(attributes=c("external_gene_name",
                                      "ensembl_gene_id",
                                      "gene_biotype"),
                         filters = c("ensembl_gene_id"),
                         values = rownames(mrna_df),
                         mart = mart)

mrna_df <- mrna_df[which(rownames(mrna_df) %in% mrna_attributes$ensembl_gene_id),]
mrna_df$Gene_id <- mrna_attributes$external_gene_name

ADD REPLY • link 2.4 years ago by Leo • 0

0

Entering edit mode

I would probably add transcript_start and transcript_end to the query, then calculate the length for each transcript based on that, and then average (median is probably best) the obtained lengths for all the transcripts per gene (because most genes have > 1 transcript). I guess that is sufficient for your purpose. Then feed this into DESeq2. Or calcualte fpm() and then divide each gene (so the output of fpm) by the calculated length in kb (so basepair length * 0.001), as after all FPKM is just FPM / length, see https://github.com/mikelove/DESeq2/blob/master/R/helper.R#L279

ADD REPLY • link 2.4 years ago by ATpoint 82k

0

Entering edit mode

hi! thanks for your help. I have been trying to do this and I think it worked! However, the FPKM and the FPM results aren't exactly the same.. Is this correct? Can I neglect the difference or should it be exactly the same. There is always some difference in the range of 1e4.

What I did is this:

mrna_attributes <- getBM(attributes=c("external_gene_name",
                                      "ensembl_gene_id",
                                      "gene_biotype","chromosome_name",
                        "transcript_start", "transcript_end"),
                         filters = c("ensembl_gene_id"),
                         values = rownames(mrna_df),
                         mart = mart)
mrna_attributes$length <- mrna_attributes$transcript_end - mrna_attributes$transcript_start
mrna_attributes$medlength <- NA

for (gene in unique(mrna_attributes$external_gene_name)) {
  mrna_attributes$medlength[which(mrna_attributes$external_gene_name == gene)] <- median(mrna_attributes$length[which(mrna_attributes$external_gene_name == gene)])
}

uniqueattributes <- mrna_attributes[!duplicated(mrna_attributes$external_gene_name),]
mrna_df <- mrna_df[which(rownames(mrna_df) %in% uniqueattributes$ensembl_gene_id),]


dds <- DESeqDataSetFromMatrix(mrna_df, colData = mrna_meta, design = ~ Condition)


df <- data.frame(chr=uniqueattributes$chromosome_name, start=uniqueattributes$transcript_start, end=uniqueattributes$transcript_end,gene_symbols = uniqueattributes$external_gene_name)
test<-makeGRangesListFromDataFrame(df, names.field='gene_symbols')
rowRanges(dds)<- GRangesList(test)

uniqueattributes$medlength2 <- uniqueattributes$transcript_end - uniqueattributes$transcript_start

fpkmm <-fpkm(dds)
fpmm <- fpm(dds)
nom <- fpmm/(uniqueattributes$medlength2*0.001)

can you see where it went wrong maybe?

ADD REPLY • link 2.3 years ago by Leo • 0