My PCA plot is weird: PC1 with 98.684% variance and PC2 with 0.582% variance
1
0
Entering edit mode
3 months ago

Hi all, I'm an undergrad learning how to process some data given to me by my mentor. This is bulk RNA-seq data, aligned by STAR and assigned using featureCounts. I pre-filtered using rowSums(counts(dds))>=10 and ran DESeq and set an alpha=0.05. However, I got a weird-looking MA graph with diagonal lines. I'm not too sure why it looks like that. The MA plot was plotted using plotMA(). Even weirder is that my PCA plot has PC1 with 98.684% variance and PC2 with 0.582% variance. I'm quite confused and not sure how to go from here. I already ran vst(dds,blind=FALSE) before using the data in the plotPCA() function. My biggest question is, how likely is that PC1 is able to actually capture 98% of the variation and have this not be due to some technical artefact? What can I do to make sure this is an actual biological difference? And if my PCA plot is incorrect, what steps can I take to check what went wrong?

I'm still learning and always happy to learn more but I am quite new to bulk RNA-seq so I would appreciate any guidance on what I can do to troubleshoot this. I am more than happy to provide any details about the data. Thank you.

PCA plot made with plotPCAMA plot made with plotMA

r pca rna-seq visualization • 1.8k views
ADD COMMENT
0
Entering edit mode

It is very unlikely unless something bad (very bad) happened during samples preparation: extraction, library preparation and sequencing

ADD REPLY
0
Entering edit mode

Hi, sorry what do you mean by that? As my results are likely caused by something very bad happening during sample preparation?

ADD REPLY
0
Entering edit mode

Check the RNA quality (RIN); check if RNA is completely degraded in one group of samples compared to the other.

Ensure that the same library preparation and sequencing protocols have been used for all samples.

Explain the type of samples you are comparing. There could be a biological explanation for why PC1 in the PCA plot accounts for 98.684% of the variance, assuming all samples are independent biological replicates and not multiple sequencing runs of RNA extracted from just two samples (one per treatment).

ADD REPLY
1
Entering edit mode

One group is an immortalized cell line. The other is the same cell type, but derived from stem cells. I only have access to the fastq files. Is there any other way to check RNA quality?

ADD REPLY
0
Entering edit mode

Is there any other way to check RNA quality?

You should ask the person who prepared the RNA samples

One group is an immortalized cell line. The other is the same cell type, but derived from stem cells. I only have access to the fastq files. Is there any other way to check RNA quality?

That could explain the large variance you see in PC1. However, you should discuss this with your mentor, as I am not an expert in this specific field. I primarily work with RNA-seq on bacteria.

Perhaps this task was given to you simply as an exercise to learn how to process RNA-seq data.

ADD REPLY
0
Entering edit mode

According to the MA plot, maybe try set a higher threshold for filtering? i.e. your rowSums(counts(dds))>=10 part

ADD REPLY
0
Entering edit mode

Hmm, I tried 20, 30. Still had the same result.

ADD REPLY
0
Entering edit mode

This is not the issue, from my experience.

ADD REPLY
0
Entering edit mode
11 days ago
Kevin Blighe ★ 90k

Your principal components analysis bi-plot indicates that the first principal component explains a very high proportion of the variance in your dataset. This situation can occur due to a strong biological signal, but it is also common with technical artifacts, such as batch effects, sample outliers, or differences in library preparation. Given that your groups consist of an immortalized cell line versus a stem cell-derived line of the same type, a large biological difference is plausible, as these cell types may exhibit distinct transcriptional profiles.

To determine if the variance captured by the first principal component is biological rather than artifactual, perform the following checks. First, generate a heatmap of sample-to-sample distances to visualize overall similarities:

library(pheatmap)
vsd <- vst(dds, blind = TRUE)
sampleDists <- dist(t(assay(vsd)))
pheatmap(as.matrix(sampleDists))

This may reveal if one group clusters tightly while the other does not. Next, examine the loadings for the first principal component to identify genes driving the separation. Use the base R function for principal components analysis instead of the DESeq2 function:

pca <- prcomp(t(assay(vsd)), scale. = FALSE)
loadings <- pca$rotation[, 1]
topGenes <- names(sort(abs(loadings), decreasing = TRUE))[1:100]

Annotate these top genes and assess if they relate to known biological differences between immortalized and stem cell-derived lines, such as proliferation or differentiation pathways.

To rule out technical artifacts, since you have FASTQ files, run quality control with FastQC and aggregate results using MultiQC:

fastqc *.fastq.gz -o fastqc_results
multiqc fastqc_results/

Inspect for low quality, adapter content, or per-base sequence biases. Also, review STAR alignment logs for mapping rates; uneven rates across groups suggest technical issues.

For the mean-average plot showing diagonal lines, this pattern often arises from genes with low counts or zeros in one group, leading to extreme log fold changes. Increase your pre-filtering threshold to rowSums(counts(dds)) >= 50 and re-run DESeq2. If the pattern persists, plot dispersions:

plotDispEsts(dds)

High dispersion may indicate normalization problems.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 4467 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6