Question

Find all genes that significantly correlate with a particular geneA?

0

Entering edit mode

7.0 years ago

tolgaturant ▴ 20

Hello,

I want to find all genes that positively or negatively correlate with a particular gene A in an RNA-seq dataset. Can I use the normalized expression values of gene A as a continuous variable and perform expression profiling using DESeq2. Would gene length and GC content be an issue in such a case?

Thank you,

DESeq2 Continuous variable Correlation RNA-Seq • 3.5k views

ADD COMMENT • link updated 7.0 years ago by theobroma22 ★ 1.2k • written 7.0 years ago by tolgaturant ▴ 20

1

Entering edit mode

Your question is not pretty clear. Are you trying to say that you have dosage levels of a gene across differential samples in different conditions in your RNA-Seq? Or you want to simply find from your differential expression output results , genes that have higher correlation with others?

For the former, you need to use in your model matrix the levels of the gene which is having different dosage levels and use it to regress the model and find the differential expression across your conditions of interest. This way you can understand the effect of dosage of that gene on your samples across conditions and how the transcriptome is affected.

If you simply want to understand which are the co-expressed genes for our partcular gene of interest that is also differentially expressed, then it is a slightly different approach. You need to project all your DE genes either normalized value tpm or fpkm in a pca and compute the distances of the genes between your gene of interest and the other genes with KNN methods. Then you can actually know. This gives which are co-expressed.

However you can also take a look at WGCNA.

You have to be a bit clear with thquestion.

ADD REPLY • link 7.0 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Thank you for your answer. It is the first one. I am interested in gene to gene correlation without considering any case vs control model. For example in a clinical dataset from a uniform cohort, what genes are high when gene of interest is high or low when gene of interest is low. My question is can DESeq2 normalization be used for gene to gene correlation similar to case vs control designs.

ADD REPLY • link 7.0 years ago by tolgaturant ▴ 20

0

Entering edit mode

Why no just calculate Pearson/Spearman correlations between your gene A and all other genes and get those with a satisfactory R^2 coefficient or p-value?

ADD REPLY • link 7.0 years ago by WouterDeCoster 47k

0

Entering edit mode

That's exactly what I want to do. But should gene length and GC content be an issue?

ADD REPLY • link 7.0 years ago by tolgaturant ▴ 20

1

Entering edit mode

I don't think that's a real issue. You just want to know if gene A goes up, which other genes go up as well. If these genes are longer they'll have more reads but that's not relevant, it's about the correlation.

Of course, to correct for gene length you could use TPM values. For GC content there is not much you can do.

A simple correlation could be sufficient, but I think a more robust framework for your analysis would be WGCNA, and then find the cluster of genes which (anti)correlate to gene A.

ADD REPLY • link 7.0 years ago by WouterDeCoster 47k

0

Entering edit mode

Thank you very much, I will try WGCNA and compare with DESeq2!

ADD REPLY • link 7.0 years ago by tolgaturant ▴ 20

1

Entering edit mode

to be honest simple correlation will work, but why do you then want to use DESeq2? what is the purpose of using the DeSeq2 here? I do not see anything. If your simple understanding is to find which genes correlated and anti-correlate with your gene A then just do what @WouterDeCoster said of computing the "Pearson/Spearman correlations between your gene A and all other genes and get those with a satisfactory R^2 coefficient or p-value"

Or conversely, you can use WGCNA to find the clusters of co-expressed genes there the genes that will cluster with your gene A will be the best fit that correlates and anticorrelate with changes of expression of that gene A.

DESeq2 is used for differential expression. Where you will normalize the counts and then use a model matrix that takes your desire design matrix including the factors that need to be used for the differential expression. I do not see the need here unless you have 2 conditions where you have to see the differences. If you are thinking of gene length and all use the TPM or FPKM values which are normalized for gene length. Or you can also use the normalization used in deseq2 before doing any DE analysis and use that normalized counts to do what is said by @WouterDeCoster for correlation. There are normalization functions in DESeq2 which you can use but not the tool for differential expression. Your query needs to be clearly set.

ADD REPLY • link 7.0 years ago by ivivek_ngs ★ 5.2k

1

Entering edit mode

I want to use DESeq2 for normalization across samples. Plus I have other comparisons that included 2 conditions where I used DESeq2 on the same dataset. So if it is possible to use DESeq2 for gene to gene correlation I want to use DESeq2 normalization here as well.

ADD REPLY • link 7.0 years ago by tolgaturant ▴ 20

score 0 · Answer 1 · 2017-04-05

0

Entering edit mode

7.0 years ago

theobroma22 ★ 1.2k

You can use the variables plot of PCA. Also, using a 3D version will help to better discern the correlation since a 2D plot can be somewhat misleading.

ADD COMMENT • link 7.0 years ago by theobroma22 ★ 1.2k

0

Entering edit mode

Thank you. Are you referring to principle component analysis? Plotting and reducing the data would also help but I am looking for pvalues and estimates for each gene in the dataset. I am assuming some genes will significantly correlate with geneA and others will not. My questions is if such model is reliable considering each gene will have different length and GC content? I adapted below code from Bioconductor

se<-summarizeOverlaps(features=ebg, reads=bamfiles1,mode="Union", singleEnd=TRUE, ignore.strand=TRUE)
se_dds<-DESeqDataSet(se, design=~normalizedExpressionGeneA)
deseq_se<-DESeq(se_dds)

ADD REPLY • link 7.0 years ago by tolgaturant ▴ 20

0

Entering edit mode

Typically GC content and length are used to normalize the data. Gene correlation is typically based on the normalized expression values. So, if you want to correlate genes based on RNA-seq data first you normalize the data, then you can build a correlation network or other method using the normalized expression values. Other than PCA you can also build the correlation using PLS, or partial least squares. So, the GC content and length were already accounted for in the normalization process and are no longer necessary in your correlation model.

ADD REPLY • link 7.0 years ago by theobroma22 ★ 1.2k

0

Entering edit mode

Hi theobroma22 , That's why I am asking. As far as I know DESeq2 normalizes across samples but does not normalize for gene length or GC content with the rationale that every gene is modeled separately. There are other packages which do normalize across genes, such as EDAseq or FPKM. Should I use them in this case?

ADD REPLY • link 7.0 years ago by tolgaturant ▴ 20

0

Entering edit mode

Typically GC content and length are used to normalize the data.

Would be nice if you could provide a reference for this, as far as I know common tools such as edgeR and DESeq2 don't normalize for these factors.

ADD REPLY • link 7.0 years ago by WouterDeCoster 47k

0

Entering edit mode

Sorry for not being clear. I was considering for microarray platforms one has to consider these biases to normalize the data. However, it seems the OP was considering GC content and length as it pertains to gene expression; higher GC content can be correlated to higher gene expression. So, for this case yes it would be prudent to normalize based on these factors. Although, this approach may "over-normalize" the data among samples since geneA has the same GC content and length across the samples. This being said, even for RNA-seq one could use quantile normalization on the GC content and length, and then use this information to "smooth" the variation among samples to account for these biases. But, your approach to just use Pearson's correlation is well-accepted, so no requirement is needed to account for these biases. The OP's question was somewhat confusing from the beginning.

ADD REPLY • link 7.0 years ago by theobroma22 ★ 1.2k

0

Entering edit mode

higher GC content can be correlated to higher gene expression

That's simply incorrect. Higher GC content would hamper amplification during library prep, as such showing a lower expression value.

ADD REPLY • link 7.0 years ago by WouterDeCoster 47k

0

Entering edit mode

http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.0040180

ADD REPLY • link 7.0 years ago by theobroma22 ★ 1.2k

1

Entering edit mode

GC content normalization can be performed because of the biases during PCR amplification and library prep. So if a gene is detected less because it has very high or very low GC content, that might be problem when compared to other genes. The GC content normalization is not done to correct for the higher physiological expression of a gene when compared to the same gene artificially engineered to contain lower GC but the same coding sequence.

ADD REPLY • link 7.0 years ago by tolgaturant ▴ 20

0

Entering edit mode

Again I was initially referring to GC content bias for gene expression quantification in microarray platforms. Thanks.

ADD REPLY • link 7.0 years ago by theobroma22 ★ 1.2k

0

Entering edit mode

Definitely interesting.

But if the expression actually is higher then there is no reason to try and correct for that. That's not how normalisation works. You would correct for "technical" differences such as gene length (gene getting relatively more reads) and amplification bias (fragment getting amplified less). Not for true biological differences.

ADD REPLY • link 7.0 years ago by WouterDeCoster 47k

0

Entering edit mode

smoothening and normalization is a different thing. If you have GC bias then post-normalization use the GC values in your design matrix and regress you model and try to see how the expression of genes changes across conditions. But you do not have conditions here so that will be irrelevant out here. Only thing is take normalized data either TPM/FPKM or normalized counts and use the pearson/spearman correlation or gene A with other genes and compute the coeff. The one which is having higher or above certain threshold will be correlated others below will not be. That's how it should be done.

ADD REPLY • link 7.0 years ago by ivivek_ngs ★ 5.2k