Question: EDA of TCGA RNASeqV2 data
gravatar for msrtrs
12 months ago by
msrtrs0 wrote:

I am currently working with the TCGA dataset and have examined the data using different machine learning algorithms. I noticed that samples from a surrounding region are often misclassified (e.g. READ and COAD). Now I would like to investigate these results further with EDA.

Unfortunately, I am not very familiar with highly dimensioned genome data. For this reason I have done the following:

To get a visual impression I have grouped the data by cancer type over all samples by calculating their mean value. This resulted in a 33 x 20500 matrix, with the i-th row and the j-th column corresponding to the mean of the j-th gene and the i-th cancer type across all samples. Afterwards, I selected the cancers with the most misclassifications among each other and visualized them with a scatter plot. Here is a zoomed-in version of the plot:

READ means vs. COAD means

Additionally I calculated the Pearson r value, which resulted in 0.99 (a high value, which I expected after seeing the plot)

After that, I used the same data to create an MA-Plot that looks like this:

MA-Plot COAD means and READ means

I would like to argue that the misclassifications originate from the correlation of the mean values of the genes regarding the cancer type by using the above plots.

To be on the safe side, I would like to know if these visualizations are meaningful or if the correlation and low differential expression shown by the plots are just by chance. Furthermore, I'm currently asking myself whether there are other useful tests and visualizations that I can conduct to confirm the conclusion.

rna-seq next-gen • 200 views
ADD COMMENTlink modified 12 months ago • written 12 months ago by msrtrs0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 871 users visited in the last hour