Question: Get FPKM across replicates without doing a differential expression
gravatar for crouch.k
5 months ago by
crouch.k10 wrote:


I am trying to perform a within-sample analysis of expression data using FPKM - essentially I want to be able to rank my genes. I am not doing differential expression.

I have 3 biological replicates of my sample. What I would like to be able to do is generate an FPKM for each gene that is representative of the three replicates and rank based on that.

I have three approaches:

  1. Run Cuffquant on the three alignments separately and ask Cuffnorm to treat them as replicates. This would be my ideal, but I would need something to compare to in order to get Cuffnorm to run. Could I just compare my replicates to themselves?
  2. Run Cuffquant on the three alignments separately, ask Cuffnorm to treat them as separate samples (giving me FPKM values per sample) and then take a mean
  3. Run Cuffquant on the merged alignments. This won't account in any way for variation between the replicates.

I have mostly discounted 3 as pooling the data seems to defeat the purpose of replicating. So my questions are:

For approach 1, does anyone know how a good way of generating representative FPKM values from replicates without doing a differential expression? Happy to use software other than Cufflinks.

For approach 2, is taking a mean valid? Cuffnorm is using a more sophisticated normalisation method, but perhaps I don't need this here - my feeling is that even if library sizes are different between the replicates, the rank order should still be similar and that is what I am interested in. If this is a sensible approach, is mean a good measure of central tendency here, or should I consider something else?

Does anyone have any other approaches that I haven't thought of?

Thanks for your help!

rna-seq dna-seq fpkm • 361 views
ADD COMMENTlink modified 5 months ago by Kevin Blighe15k • written 5 months ago by crouch.k10
gravatar for Kevin Blighe
5 months ago by
Kevin Blighe15k
London / Brazil
Kevin Blighe15k wrote:


The way that you're aiming to do it seems likely to result in some problems later down the line. From what I can infer, you have the raw data FASTQ files, right? I would just make your life a lot simpler by using an 'alignment-free' count abundance program like Kallisto, which counts reads in your FASTQ (single or paired) over a reference transcriptome in FASTA. For human, better to download the GENCODE reference transcriptome (look for the heading 'Fasta files').

Once you get raw counts over the GENCODE transcripts, which totals just over 199,000 transcripts, you can easily read these into DESeq2 or edgeR for further analyses. There, you'll easily be able to rank the genes and do other analyses. Note, however, that these programs don't use FPKM by default - FPKM has in fact lost appeal for RNA-seq data in recent years because it doesn't deal well with large count abundances.

Hope that this helps!


ADD COMMENTlink written 5 months ago by Kevin Blighe15k


Thanks for the reply.

Yes, I am aware of the pitfalls of FPKM and have used DESeq2 for differential expression for a long time.

In this case, I am working with a non-model organism that has a very unusual genome architecture. Specifically, genes are transcribed in large polycistronic units to make a pre-RNA which is then spliced into mature mRNAs, kind of like huge bacterial operons. I am using an alignment-based method because I also have an interest in the fate of the intergenic pre-RNA and Kallisto will miss this. That is a separate analysis and not related to this question, but I am just re-using that alignment for this. in any event I have never been able to make Kallisto work well for organisms in this clade anyway.

The hypothesis I am trying to investigate here is that there is some spatial regulation of transcription - that is genes that are closer to certain features in the genome are more abundant than genes that are further away. My feeling is that in order to compare different genes within one sample like this I need to normalise for gene length which as far as I am aware DESeq2 doesn't do - for this I would have to use TPM or FPKM. Hence the question.

ADD REPLYlink written 5 months ago by crouch.k10

Hey, it sounds very interesting. So, it's not quite a bacterium but it's expression patterns behave in that way? I recently analysed bacterial cDNA and used a few different programs including RockHopper, Velvet/Oases, MaSuRCA, and a customised method. The customised method may be of interest to you:

I used TopHat2/Cufflinks to identify existing and novel transcripts but I configured it in such a way that it could identify likely operons in the data. It seemed to pick out quite a few. I then used the function gffread over my assembled GTF in order to producre a new reference transcriptome (gffread takes a GTF and genome FASTA and then produces FASTA sequence for each transcript in the GTF). I then used Kallisto and DESeq2 to do a very simple analysis using these.

Given the seemingly unique situation that you have, it may be a great opportunity to actually write a new method/program. Also, you may consider updating to HISAT2 (as I should also have done)!

ADD REPLYlink written 5 months ago by Kevin Blighe15k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 603 users visited in the last hour