I am currently working with the TCGA dataset and have examined the data using different machine learning algorithms. I noticed that samples from a surrounding region are often misclassified (e.g. READ and COAD). Now I would like to investigate these results further with EDA.
Unfortunately, I am not very familiar with highly dimensioned genome data. For this reason I have done the following:
To get a visual impression I have grouped the data by cancer type over all samples by calculating their mean value. This resulted in a 33 x 20500 matrix, with the i-th row and the j-th column corresponding to the mean of the j-th gene and the i-th cancer type across all samples. Afterwards, I selected the cancers with the most misclassifications among each other and visualized them with a scatter plot. Here is a zoomed-in version of the plot:
Additionally I calculated the Pearson r value, which resulted in 0.99 (a high value, which I expected after seeing the plot)
After that, I used the same data to create an MA-Plot that looks like this:
I would like to argue that the misclassifications originate from the correlation of the mean values of the genes regarding the cancer type by using the above plots.
To be on the safe side, I would like to know if these visualizations are meaningful or if the correlation and low differential expression shown by the plots are just by chance. Furthermore, I'm currently asking myself whether there are other useful tests and visualizations that I can conduct to confirm the conclusion.