Gene expression RNASeq
1
0
Entering edit mode
2.9 years ago

Hi,

I'm trying to do a box plot for gene expression from BulkRNAseq. This is the pipeline I followed: STAR->stringtie. The raw counts are normalized by DESEQ2 (disease vs controls). The normalized counts are used to plot the graph.

I'm plotting a gene expression of a gene A (disease and controls). There are 150samples in disease and 30samples in controls, the normalized counts vary from 0 to 3000 in disease, and 0 to 30 in controls.

The distribution is not normal (there are a lot of samples showing ranges around 0 to 10 and very less samples show ranges above 10). How can I make a better box plot?(What can be considered as an outlier)

enter image description here

enter image description here

enter image description here

Thanks, Vinisha

DESEQ2 RNAseq • 1.4k views
ADD COMMENT
1
Entering edit mode
2.9 years ago
Papyrus ★ 2.9k

You can use VST or rlog counts extracted from DESeq2 functions (example in vignette), this will compress the values (because they are in log scale) and you won't see outliers most probably. Regarding showing the distribution of the data, try using violin plots (you could use ggplot2 with geom_violin()), or violin plots with boxplots on top of them (examples here), or boxplots with data points on top of them, etc.

ADD COMMENT
0
Entering edit mode

@papyrus Thanks for the input. I tried VST and the box plot looked like below. The higher values in disease group are taken as outliers but that might be important for the analysis to interpret a biological difference. What do I do in this case?

enter image description here

ADD REPLY
0
Entering edit mode

Nobody's going to be able to answer that for you. You have the biological knowledge and metadata for these samples. Are there any differences in those outlier samples that might be meaningful? In what other ways do they differ from the other disease samples?

ADD REPLY
0
Entering edit mode

Agree. In a more general setting, I would consider removing outliers for plotting only when it is clear that they are NOT driving your differential expression result.

For your gene, it looks like the differences in expression may be particularly driven by the outliers (the median of the boxplot for the control looks similar or even higher than for the disease), so that if you were to remove those samples the gene would not be differentially expressed. If this is the case, I would plot the figure with outliers, and then look for a biological explanation.

Edit: of course, this refers to seeing an outlier pattern in a particular gene. If you were seeing the same outlier pattern relating to the same samples across all genes, you would probably have to address this issue in prior steps of the pipeline (sample filtering, model adjustment during testing..)

ADD REPLY

Login before adding your answer.

Traffic: 2044 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6