Question

Get FPKM across replicates without doing a differential expression

0

Entering edit mode

6.6 years ago

crouch.k ▴ 30

Hi,

I am trying to perform a within-sample analysis of expression data using FPKM - essentially I want to be able to rank my genes. I am not doing differential expression.

I have 3 biological replicates of my sample. What I would like to be able to do is generate an FPKM for each gene that is representative of the three replicates and rank based on that.

I have three approaches:

Run Cuffquant on the three alignments separately and ask Cuffnorm to treat them as replicates. This would be my ideal, but I would need something to compare to in order to get Cuffnorm to run. Could I just compare my replicates to themselves?
Run Cuffquant on the three alignments separately, ask Cuffnorm to treat them as separate samples (giving me FPKM values per sample) and then take a mean
Run Cuffquant on the merged alignments. This won't account in any way for variation between the replicates.

I have mostly discounted 3 as pooling the data seems to defeat the purpose of replicating. So my questions are:

For approach 1, does anyone know how a good way of generating representative FPKM values from replicates without doing a differential expression? Happy to use software other than Cufflinks.

For approach 2, is taking a mean valid? Cuffnorm is using a more sophisticated normalisation method, but perhaps I don't need this here - my feeling is that even if library sizes are different between the replicates, the rank order should still be similar and that is what I am interested in. If this is a sensible approach, is mean a good measure of central tendency here, or should I consider something else?

Does anyone have any other approaches that I haven't thought of?

Thanks for your help!

RNA-Seq DNA-seq FPKM • 2.0k views

ADD COMMENT • link updated 6.6 years ago by Kevin Blighe 87k • written 6.6 years ago by crouch.k ▴ 30

score 0 · Answer 1 · 2017-09-28

0

Entering edit mode

6.6 years ago

Kevin Blighe 87k

Hey,

The way that you're aiming to do it seems likely to result in some problems later down the line. From what I can infer, you have the raw data FASTQ files, right? I would just make your life a lot simpler by using an 'alignment-free' count abundance program like Kallisto, which counts reads in your FASTQ (single or paired) over a reference transcriptome in FASTA. For human, better to download the GENCODE reference transcriptome (look for the heading 'Fasta files').

Once you get raw counts over the GENCODE transcripts, which totals just over 199,000 transcripts and their isoforms, you can easily read these into DESeq2 or edgeR for further analyses. There, you'll easily be able to rank the genes and do other analyses. Note, however, that these programs don't derive FPKM values by default - FPKM has in fact lost appeal for RNA-seq data in recent years.

Hope that this helps!

Kevin

ADD COMMENT • link 5.6 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi

Thanks for the reply.

Yes, I am aware of the pitfalls of FPKM and have used DESeq2 for differential expression for a long time.

In this case, I am working with a non-model organism that has a very unusual genome architecture. Specifically, genes are transcribed in large polycistronic units to make a pre-RNA which is then spliced into mature mRNAs, kind of like huge bacterial operons. I am using an alignment-based method because I also have an interest in the fate of the intergenic pre-RNA and Kallisto will miss this. That is a separate analysis and not related to this question, but I am just re-using that alignment for this. in any event I have never been able to make Kallisto work well for organisms in this clade anyway.

The hypothesis I am trying to investigate here is that there is some spatial regulation of transcription - that is genes that are closer to certain features in the genome are more abundant than genes that are further away. My feeling is that in order to compare different genes within one sample like this I need to normalise for gene length which as far as I am aware DESeq2 doesn't do - for this I would have to use TPM or FPKM. Hence the question.

ADD REPLY • link 6.6 years ago by crouch.k ▴ 30