Hi,
I'm trying to do a box plot for gene expression from BulkRNAseq. This is the pipeline I followed: STAR->stringtie. The raw counts are normalized by DESEQ2 (disease vs controls). The normalized counts are used to plot the graph.
I'm plotting a gene expression of a gene A (disease and controls). There are 150samples in disease and 30samples in controls, the normalized counts vary from 0 to 3000 in disease, and 0 to 30 in controls.
The distribution is not normal (there are a lot of samples showing ranges around 0 to 10 and very less samples show ranges above 10). How can I make a better box plot?(What can be considered as an outlier)
Thanks, Vinisha
@papyrus Thanks for the input. I tried VST and the box plot looked like below. The higher values in disease group are taken as outliers but that might be important for the analysis to interpret a biological difference. What do I do in this case?
Nobody's going to be able to answer that for you. You have the biological knowledge and metadata for these samples. Are there any differences in those outlier samples that might be meaningful? In what other ways do they differ from the other disease samples?
Agree. In a more general setting, I would consider removing outliers for plotting only when it is clear that they are NOT driving your differential expression result.
For your gene, it looks like the differences in expression may be particularly driven by the outliers (the median of the boxplot for the control looks similar or even higher than for the disease), so that if you were to remove those samples the gene would not be differentially expressed. If this is the case, I would plot the figure with outliers, and then look for a biological explanation.
Edit: of course, this refers to seeing an outlier pattern in a particular gene. If you were seeing the same outlier pattern relating to the same samples across all genes, you would probably have to address this issue in prior steps of the pipeline (sample filtering, model adjustment during testing..)