Question

what do the principle components represent in this paper? I am confused with this RNA seq analysis of variation.

0

Entering edit mode

7.5 years ago

ac1025 • 0

Hello, I am attempting to understand Figure 2 , A.... in this study, mbio.asm.org/content/6/4/e00749-15.abstract

I am trying to understand what the principle components represent in relation to the large amount of RNA seq data. Each gene has variation relative to the normalized PA01 strain. The data points include a large set of 150 transcriptomes of clinical isolates and 50 of the lab strain in different environmental conditions.

So how is an entire transcription map made into a single point and what are the 3 primary components representing?

Any help would be greatly appreciated. Thanks!!!!

rna-seq • 1.3k views

ADD COMMENT • link updated 7.5 years ago by Persistent LABS ▴ 750 • written 7.5 years ago by ac1025 • 0

1

Entering edit mode

@Presistent Labs gives a good explanation of the PCA procedure, but doesn't quite describe what these PC's are representing...

The entire transcripome is not represented by one point... It is being represented by 3 (PC1, PC2, and PC3). And in fact it is truly represented by (# of strains) - 1 points . However we cannot plot in 200 dimensional space (see @Persistent Labs answer below), so we use the three points that shows the most difference between the 202 strains.
Strictly speaking the first 3 PC's represent "hidden/psuedo" variables that account for the most variation between the strains. In a biological sense, what I just said is useless... To get an idea of what biological phenomena is we'd have to look at the genes are expressed across each PC. Each PC likely represents a class(es) of genes being co-expressed/co-regulated. Ideally, this would be visibile based on where the strain came from (i.e. clinical, environmental condition, etc...)

ADD REPLY • link 7.5 years ago by ejm32 ▴ 450

0

Entering edit mode

I'm just looking for a ballpark explanation. I was told it might represent protein clusters....??

ADD REPLY • link 7.5 years ago by ac1025 • 0

score 2 · Answer 1 · 2016-10-28

Hi

PCA is used for dimension reduction so a higher dimensional feature space can be viewed easily in a lower dimension space. In your case the input dataset to PCA was a feature matrix of dimension (1121 X 202) i.e. 1121 observations/genes and 202 features/strains (151 clinical samples and 51 PA14 strains). It is very difficult to see the data in such higher dimensional space (202 D). So can we convert this 202D data to another frame with fewer dimension so that we can better understand data. PCA rotates the input feature space to a new PCA space which is also of 202 dimension. However the first few dimensions/principal components (PC1, PC2 and PC3) of these new PCA space are sufficient enough to capture all information of input space.

In your case first three principal components (PC1 to PC3) displayed account for ~47% of the total variance of the data. So remaining PC4 to PC200 covers rest 53% of data.

So in the Principal component space (Fig.2A), there are 202 points corresponding to 202 samples. Then each point is color coded based on its class type. The coordinate of each point in PCA space (PC1,PC2, PC3) can easily be derived from input space using rotation matrix.

Hope it is cleared.

Thanks

Priyabrata

Persistent LABS