Question

How To Visualize Large Set Of Data(Fpkm Of Genes, 80 Data)

2

Entering edit mode

10.8 years ago

l0o0 ▴ 220

I have 80 rna-seq datas from different tissue at different treatment. After Tophat and Cufflinks, i retrieve each gene's fpkm from gene.fpkm_tracking file produced by cufflinks. There is about 20,000 genes in one sample.

I have no ideas to visualize the data in a clear, comprehensive way. I want to display these data in one graph. Any advice is appreciate! Thanks in advanced!

visualization • 8.6k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 10.8 years ago by l0o0 ▴ 220

3

Entering edit mode

What is it you want to visualize? Do you want to visualize how samples can be classified based on expression markers? Then try a unsupervised hierarchical clustering heatmap of the 500 genes with the most variance across all your samples. Do you want to visualize what sample groups there are in your data? Make a 3d scatter plot of the first 3 principal components (PCA-analysis). Or do you want to visualize the genes that are differentially expressed across the samples? Use these genes to make a clustering heatmap

ADD REPLY • link 10.8 years ago by Irsan ★ 7.8k

0

Entering edit mode

Thank you for your reply! I just want to visualize the distribution of 20000 genes' fpkm value from 80 pieces of data. I will try 3d scatter plot of sample, fpkm and gene id.

ADD REPLY • link 10.8 years ago by l0o0 ▴ 220

1

Entering edit mode

Sounds like a heatmap would work for you.

ADD REPLY • link 10.8 years ago by Devon Ryan 104k

0

Entering edit mode

If you want to see the distribution of fpkm values in each sample you want to make a histogram/1d density plot of the logged (!) fpkm values of each sample. That way you will get a feel about what the mean, median, variance, minimum and maximum fpkm values you have in each sample. To me, it sounds like you do not exactly know what you want with the data. I advice you to very clearly set your goals. Begin with very high level goals and define more specific subgoals

ADD REPLY • link 10.8 years ago by Irsan ★ 7.8k

Ram · Answer 1 · 2014-12-30

2

Entering edit mode

9.6 years ago

Alex Reynolds 35k

Given that you are working with tissues, the heatmap approach may work well. Here is an example of a visualization I did for expression data for human fetal and adult tissues, for a set of genes of interest:

< image not found >

Here, we show the relative FPKM values for different fetal tissue expression data for BCL6. Expression is relatively enriched in thymus, but there is signal elsewhere, also.

As another example, here is a heatmap of expression data for CEBPA:

< image not found >

The expression data shown here suggest more tissue specificity.

The methodology condenses expression data for a set of tissues from various timepoints. You can explore other genes at the Gene Expression Atlas here: https://expressionatlas.org

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Alex Reynolds 35k

0

Entering edit mode

Whoa, that's cool. Where does the data come from? The (clean and pretty) website doesn't have the details about data generation. I guess it's RNA-Seq but where did you find a human fetus !? stamlab.org also doesn't have specifics...

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by karl.stamm 4.1k

0

Entering edit mode

It is RNA-seq data. Kyle has a publication in review. When there's a citation, I'll add it to this post.

ADD REPLY • link 9.6 years ago by Alex Reynolds 35k

score 1 · Answer 2 · 2013-09-24

You have 80 data sets with 20,000 genes each, thus you want to visualize 1.6 million data points. Consider that this may be more data than pixels on most computer screens, and assess the value of needing to see all of the data in a single plot - especially if much of the data is unchanged between samples. Often, a first step is examining the data for variance in gene expression across your conditions, and performing some kind of data reduction. What percentage of genes are relatively unchanged across the 80 samples? What is the fraction of genes changed in any particular sample? If you have genes as rows, and conditions as columns, you might find that some conditions contribute large numbers of gene expression change, while other contribute very little. How you plot the data will depend on the questions you want to bring to it, but if you can apply a filter to weed out rows with little variance across gene expression, you could shrink down your 20,000 genes to something that would fit into a heat map (i.e. I would say 80 conditions by 1000 genes or fewer is reasonable). Other than that you might first tackle some summary stats to characterize what samples are contributing what properties to your matrix, or try something like PCA.

score 0 · Answer 3 · 2013-09-24

0

Entering edit mode

10.8 years ago

Chris Cabanski ▴ 330

You could bin the FPKM values and plot the kde curve (smooth histogram) of each sample. 80 samples (curves) may result in a lot of overplotting, so it may be useful to combine the samples into groups (treatment/tissue) and draw one curve per group. This is a simple way to check if the distributions are different between groups. Another option is to use boxplots.

A few different examples of these types of graphs are shown in the GENCODE paper.

ADD COMMENT • link 10.8 years ago by Chris Cabanski ▴ 330

0

Entering edit mode

Yeah! the 80 samples can be grouped into different groups. I will have a try! Thank you for your reply.

ADD REPLY • link 10.8 years ago by l0o0 ▴ 220