Hi friends
I want to make a correlation matrix in R for HT-Seq data and a continuous variable (each patient has one value), to see the correlation of genes with the variable. I have HT-Seq TCGA data. I wonder what the input file looks like? if rows are genes and columns are patients, so, where is the space for my continuous variable?
Thanks
Hi Kevin Thanks for responding. Acctually I did not try because I have problem with how to make my input file? what will be in the columns and rows?
i am comparing differentially expressed genes with the cognitions variable of TCGA patients.
I dont understand this part of your answer: "check your continuous variables for any outlier values, which could bias the correlation." do you mean that the continous variable may have outlier? or rna seq data has outlier?
alos in this part: "I would be correlating this variable to the normalised and transformed HTseq data using Pearson correlation (or Spearman if your dataset is small)" is this necessary to normalize the HT-seq data? as it follows negative binomial distribution and not normal distribution? if yes, is the log2 transformation a good way to normalize ht-seq data?
Thanks
Hi. You have already process the data, correct?; or you just have HTseq raw counts?
No, you should not use the logged HTseq raw counts. You need to normalise and transform this data, and then use the transformed data for the correlation.
Either or both can contain outliers.
, but this is why you need to run some test examples on your computer so that you can understand the correct input format.
---------
Thanks!
I have raw count HT-seq data. so, if I shoul not use logged HT-seq data (as you mentioned) how should I normalize? For normalization, I usually do log2 transformation ?
Also, How should I deal with outlier? can I remove it manually?
For normalisation and transformation, try DESeq2, which is user friendly: https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html.
You will also find much other help online for this.
At the most basic level, you can check for an outlier sample via a PCA bi-plot that is generated on the normalised + transformed count data. For the continuous variable that you have, you can check for an outlier by simply plotting the values and generating summary statistics like mean, median, IQR, range, standard deviation, and more.