Question

Data Reduction-Statistical Technique

0

Entering edit mode

11.7 years ago

jackuser1979 ▴ 890

 id         sno1  sno2  sno3  sno4
gene1     23.42 23.4   88.8   98.21
gene2      0     0     99.7   95.5
gene3     77.4  100    44.4   65.6
gene4      0     0     0      0
gene5     100   100   100    100
 :
 :
gene16000 58.3  33.8   78.8  56.6

I have 16000 rows (which represents each gene id) and columns (from different samples-sno1, sno2,sno3 & sno4) which is given in percentage. I want to compare those four samples:(i)how many of genes occur (i.e 100%) between samples (between sno1 & sno2 and between sno3 & sno4) and in all the samples. Eventhough I reduced data rows containg only 0 (for absent) and 100 (present), I come around 1000 rows. I would like to know, if there is any statistical technique (like normalization) to reduce the data dimension, so that it will be easy for heatmap generation.

statistics • 2.3k views

ADD COMMENT • link updated 11.7 years ago by Sean Davis 26k • written 11.7 years ago by jackuser1979 ▴ 890

1

Entering edit mode

Why do you want to reduce the number of rows or columns? You just want a smaller heatmap? around 1000 rows and 4 dimensions actually seems very manageable.

ADD REPLY • link 11.7 years ago by Damian Kao 16k

0

Entering edit mode

Yes, I want a smaller heatmap.

ADD REPLY • link 11.7 years ago by jackuser1979 ▴ 890

0

Entering edit mode

I could be completely off base, but maybe you could aggregate by biological function (using the gene ontology). I think it will be difficult to reduce your dataset in an unbiased way.

ADD REPLY • link 11.7 years ago by Zev.Kronenberg 12k

score 1 · Answer 1 · 2012-08-07

1

Entering edit mode

11.7 years ago

Sean Davis 26k

If you simply want to make a heatmap of genes and samples, you can use variance across samples to rank genes by variation and then choose as many as you like to get the picture you want.

Perhaps your problem is one of visualization, though. If that is the case, try making a larger heatmap (large pdf, for example) or use an interactive viewer such as Java Treeview, TIGR MEV, or Partek.

If you don't care so much about the individual genes (which is what you are suggesting when you invoke dimensionality reduction), you could try PCA, NNMF (non-negative matrix factorization), k-means clustering, self-organizing maps to generate "pseudogenes" that could then be plotted. Note that if you have only four samples in your dataset, such dimensionality reduction techniques may be of limited value.

ADD COMMENT • link 11.7 years ago by Sean Davis 26k

0

Entering edit mode

I like the idea of using variance across sample to rank genes and producing heat map. Could you explain a bit.

ADD REPLY • link 11.7 years ago by jackuser1979 ▴ 890

0

Entering edit mode

Simply calculate the variance for each gene and then choose the N genes with the largest variance. N can be whatever you like.

ADD REPLY • link 11.7 years ago by Sean Davis 26k

0

Entering edit mode

I have calculated variance for each gene.The maximum value of variance is 3000 and minimum value of variance is 180. I have selected threshold of 2500 value to select list of genes and I plotted the heatmap (the list of genes comes now comes around 50). Is there any criteria in selecting threshold?

ADD REPLY • link 11.7 years ago by jackuser1979 ▴ 890

0

Entering edit mode

I do not know of a general approach that fits all situations.

ADD REPLY • link 11.7 years ago by Sean Davis 26k

0

Entering edit mode

Through ANOVA test we get variance for each gene. Can I use mean square between groups as threshold (Mean square between group comes around 2500) to filter top ranking genes?.

ADD REPLY • link 11.7 years ago by jackuser1979 ▴ 890

0

Entering edit mode

If you have groups and want to find genes that are different between groups, I suggest doing a statistical test to find differentially-expressed genes. Then plot the genes with the most significant p-values.

ADD REPLY • link 11.7 years ago by Sean Davis 26k