Question

correlation matrix for HTSeq data

0

Entering edit mode

3.6 years ago

Rob ▴ 170

Hi friends

I want to make a correlation matrix in R for HT-Seq data and a continuous variable (each patient has one value), to see the correlation of genes with the variable. I have HT-Seq TCGA data. I wonder what the input file looks like? if rows are genes and columns are patients, so, where is the space for my continuous variable?

Thanks

RNA-Seq • 814 views

ADD COMMENT • link 3.6 years ago by Rob ▴ 170

score 2 · Answer 1 · 2020-09-28

2

Entering edit mode

3.6 years ago

Kevin Blighe 87k

Can you confirm that you have already tried a few scenarios on how to do this, please? At the simplest level, you just need the cor() function in the stats package in R. I think that it's feasible that you test its usage in relation to the data that you have.

By the way, I would be correlating this variable to the normalised and transformed HTseq data using Pearson correlation (or Spearman if your dataset is small). Also, you should check your continuous variables for any outlier values, which could bias the correlation.

Kevin

ADD COMMENT • link 3.6 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin Thanks for responding. Acctually I did not try because I have problem with how to make my input file? what will be in the columns and rows?

i am comparing differentially expressed genes with the cognitions variable of TCGA patients.

I dont understand this part of your answer: "check your continuous variables for any outlier values, which could bias the correlation." do you mean that the continous variable may have outlier? or rna seq data has outlier?

alos in this part: "I would be correlating this variable to the normalised and transformed HTseq data using Pearson correlation (or Spearman if your dataset is small)" is this necessary to normalize the HT-seq data? as it follows negative binomial distribution and not normal distribution? if yes, is the log2 transformation a good way to normalize ht-seq data?

Thanks

ADD REPLY • link 3.6 years ago by Rob ▴ 170

0

Entering edit mode

Hi. You have already process the data, correct?; or you just have HTseq raw counts?

alos in this part: "I would be correlating this variable to the normalised and transformed HTseq data using Pearson correlation (or Spearman if your dataset is small)" is this necessary to normalize the HT-seq data? as it follows negative binomial distribution and not normal distribution? if yes, is the log2 transformation a good way to normalize ht-seq data?

No, you should not use the logged HTseq raw counts. You need to normalise and transform this data, and then use the transformed data for the correlation.

I dont understand this part of your answer: "check your continuous variables for any outlier values, which could bias the correlation." do you mean that the continous variable may have outlier? of rna seq data has outlier?

Either or both can contain outliers.

Acctually I did not try because I have problem with how to make my input file? what will be in the columns and rows?

, but this is why you need to run some test examples on your computer so that you can understand the correct input format.

---------

Thanks!

ADD REPLY • link 3.6 years ago by Kevin Blighe 87k

0

Entering edit mode

I have raw count HT-seq data. so, if I shoul not use logged HT-seq data (as you mentioned) how should I normalize? For normalization, I usually do log2 transformation ?

Also, How should I deal with outlier? can I remove it manually?

ADD REPLY • link 3.6 years ago by Rob ▴ 170

1

Entering edit mode

For normalisation and transformation, try DESeq2, which is user friendly: https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html.

You will also find much other help online for this.

At the most basic level, you can check for an outlier sample via a PCA bi-plot that is generated on the normalised + transformed count data. For the continuous variable that you have, you can check for an outlier by simply plotting the values and generating summary statistics like mean, median, IQR, range, standard deviation, and more.

ADD REPLY • link 3.6 years ago by Kevin Blighe 87k