Interpretation of RNA seq pipeline in a paper
4
1
Entering edit mode
18 months ago

The paper I refer to is: https://doi.org/10.1038/s41467-018-08083-z

A couple of snippets from the results & methods section:

"We first adopt a highly stringent pair-wise comparison method5, in which a gene with a much higher RPKM (reads per kilobase million) value in one domain than all the other 8 domains (log2FC > 1, false discovery rate (FDR) < 0.05) is considered as domain specific."

"Reads were mapped to the Arabidopsis Information Resource TAIR10 reference genome build with TopHat2 (version 2.0.9) and BOWTIE (version 2.1.0) allowing up to two mismatches68 after filtering the low-quality reads (PHRED quality score < 20). The gene locus expression levels were calculated based on mapping outputs after removing reads mapped to ribosomal RNAs and transfer RNAs using Cuffdiff2 (version 2.1.1)69, and expression levels were normalized to the RPKM unit using edgeR with significant expression cutoff value set to RPKM > 1. Differential expression was assessed with edgeR and the cutoff value was >2-fold change in expression with Benjamini–Hochberg adjusted FDR < 0.01. We used Cuffdiff2 to quantify the abundance of annotated isoforms. For the identification of domain-specifically expressed genes, three methods were performed independently. Pair-wise comparison was carried out using the differential expression assessed with edgeR."

I will address my queries as parts:

1) As far as I understand, edgeR uses TMM to normalise across libraries and fits the normalised raw counts to a NB distribution (I hope I got that part right) before testing for significance across samples and/or treatment. From the methods section in the above paper, all I gather is, the authors have somehow integrated the use of RPKM along with the edgeR package. Could someone clarify this for me?

2) And as far as I have been able to tell from the plethora of posts here and elsewhere, there doesn't seem to be any situation where FPKM/RPKM are to be used for sound qualitative/quantitative analysis of RNA seq data. Am I wrong to say this? Or are there situations where converting to FPKM units would give us some that edgeR and deseq2 don't?

3) Is this a suitable extension of what the authors of edgeR originally intended the package to be used for? I ask because I am in the middle of some analysis myself and this paper serves as one reference point for analysis.

4) Is it proper to use FPKM after normalising to library sizes (as done in edgeR)? I refer to What the FPKM? A review of RNA-Seq expression units (https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/)

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments.

.... 4) does this imply that post between-sample normalisation, one could possibly utilise FPKM values for analysis? This seems to me what the authors might have done in the paper.

RNA-Seq edgeR R • 628 views
2
Entering edit mode
18 months ago
colin.kern ▴ 970

As others have said, the description is not specific enough to know for sure what they did, however there is an rpkm function in the edgeR package which returns a count table with the TMM-normalized counts further normalized by gene length, so I would guess that's what they did.

In response to your question #2, it makes sense to me why they used RPKMs here. They wanted to set an expression cut-off to filter genes with expression too low to be biologically significant. Without normalizing by gene lengths, the cut-off would be biased towards keeping longer genes that have similar transcript abundances as shorter genes which would be filtered.

5
Entering edit mode
18 months ago
shawn.w.foley ★ 1.2k

I'm not familiar with edgeR specifically, but I know that while DESeq2 uses fits raw counts to a NB distribution when calculating differential expression, it includes a function to simply calculate and export RPKM/FPKM. ATPoint raises a good point about proceeding with caution, but my interpretation of the methods is that edgeR was used to calculate the RPKM expression, and that was used as a filtering criteria in addition to log2FC and FDR.

It sounds like they only called genes "significant" if they met three criteria: RPKM > 1 & |log2FC| > 1 & FDR < 0.01.

Personally, I've used similar filtering criteria (after generating a density curve of RPKM/FPKM to empirically define our cutoff for expression). The logic is that you can have substantial fold changes in genes going from 1 read to 10 reads, but in a 50M read experiment that's well within the realm of noise, therefore we filtered those genes out of the analysis.

0
Entering edit mode

Out of interest, how did you fit this curve? I mean just a histogram and by-eye cutoff or something automated like mclustBIC?)

2
Entering edit mode

I simply run plot(density(log10(fpkm))) in R. In my experience a bimodal curve is generated with a trough right near 1 FPKM. I always use this as a sanity check, I do have several datasets where the trough appears new 2 or 3 FPKM and adjust my filtering criteria accordingly.

I've gone through the exercise once or twice to get an exact minimum value, but it's so close to the integer value that I round (it also makes the methods section a bit more intuitive than using a cutoff of 1.034 FPKM).

2
Entering edit mode
18 months ago
ATpoint 49k

Method sections tend to be a mess for computational methods. It is impossible to say what exactly they did without access to the script they used. Maybe they mean CPM when saying RPKM and CPM would be the default normalized counts of edgeR, who knows. This reads like a wetlab person wrote this or at least changed what the bioinformatician wrote. I suggest you download raw data and perform analysis yourself, never trust published results :-) If you can reproduce the findings => good, if not => be careful and perform confirmation experiments or verify with independent datasets.

2
Entering edit mode
18 months ago

Granted that mine is a guess since I haven't read the paper...

the authors have somehow integrated RPKM along with the edgeR package

Mmm... no, I don't think so... They used edgeR for detecting differentially expressed genes and they considered as biologically relevant genes with FDR < 0.01 and logFC > 1 (or more likely |logFC| > 1?).

Independently of the differential analysis, they normalized gene expression to RPKM and they considered as biologically relevant genes with RPKM > 1. This normalization can be used (with some assumption, simplification etc) to tell, for example, how many genes are expressed in different cell types as the authors did in Fig 1d.

2
Entering edit mode

Just to clarify, the methods specify >2-fold change, which would be a |log2FC| > 1, not >2 (unless I misunderstood your answer).

1
Entering edit mode

You are right, well spotted - I corrected my answer

0
Entering edit mode

I understand. However, I have a line here from the paper which adds to my confusion: "We first adopt a highly stringent pair-wise comparison method5, in which a gene with a much higher RPKM (reads per kilobase million) value in one domain than all the other 8 domains (log2FC > 1, false discovery rate (FDR) < 0.05) is considered as domain specific.".

Could you kindly offer an opinion regarding this? (the question has been updated to reflect this additional line from the paper)