Question: Differential expression gene analysis
0
gravatar for Uday Madappa
4 months ago by
Uday Madappa40
National Institute of Technology, Calicut, India
Uday Madappa40 wrote:

I input a count matrix which does not contain any gene id information in it to deseq. I obtained the result table.

  1. Now, how can i obtain the row numbers corresponding to a particular p value, as the row numbers corresponds to gene id in the original data set. Please help.

  2. How can i verify that the subset of genes selected by deseq2 is the best subset?

rna-seq deseq2 • 380 views
ADD COMMENTlink modified 4 months ago by h.mon16k • written 4 months ago by Uday Madappa40

How will you get any info if you are using a count matrix without gene id?

ADD REPLYlink written 4 months ago by arup410

Like i already said, row 1 corresponds to first gene in the original dataset. Likewise row 20530 corresponds to 20530th gene. Okay, can you tell me how to feed a data set in excel format to deseq2? Thanks.

ADD REPLYlink written 4 months ago by Uday Madappa40
1

Why you removed the gene name in the first place? If you are sure about 20530 will map to 20530th gene then merge the gene_id with res.

ADD REPLYlink written 4 months ago by arup410

I need to train a classifier using the output of deseq2 which is why i decided to go with row numbers that way it is easy for me to extract them. how do you suggest i go ahead?

ADD REPLYlink written 4 months ago by Uday Madappa40

what about the 2nd part of my question? GSEA perhaps?

ADD REPLYlink written 4 months ago by Uday Madappa40

your earlier question states that you've got gene ids present in your excel files. Just add them back into your dataset. Presumably your counts are stored in a DGEList (or similar). So you can just add them into the genes entry of that DGEList, or put them back into the rownames. (ps, GSEA will not validate anything about a classifier; all you can do is split up your dataset and cross-validate or compare to a related study or get back to the bench with your selected gene sets)

ADD REPLYlink modified 4 months ago • written 4 months ago by russhh3.6k
1
gravatar for h.mon
4 months ago by
h.mon16k
Brazil
h.mon16k wrote:
  1. Now, how can i obtain the row numbers corresponding to a particular p value, as the row numbers corresponds to gene id in the original data set. Please help.

As arup said, you don't need to remove the gene names. If you want to subset by p-value:

dds <- DESeqDataSetFromMatrix( countData = counts, 
                               colData = coldata, 
                               design = ~ treatment )
dds <- DESeq( dds )
res <- results( dds, contrast = c( "treatment", "A", "B" ), alpha = 0.05 )

select <- res[( res$padj < 0.05 & !is.na( res$padj ) ), ]

counts( dds )[ rownames( select ), ]
  1. How can i verify that the subset of genes selected by deseq2 is the best subset?

Define "best".

ADD COMMENTlink written 4 months ago by h.mon16k

Thanks for that sir. I still have a couple of clarifications!!

  1. I intend to do differential expression analysis based on pairs ie; untreated vs treated of the same sample. How do you suggest i go ahead?

  2. I had a total of 20530 genes in the data set. Out of which almost 12000 are up or down regulated. What about the remaining 8000 odd genes?

Thanks in advance.

ADD REPLYlink modified 4 months ago • written 4 months ago by Uday Madappa40
1

page 42 of the limma use'rs guide will give you an appropriate design

ADD REPLYlink written 4 months ago by russhh3.6k

Thanks, I'll go through that. Any thoughts about the other questions?

ADD REPLYlink written 4 months ago by Uday Madappa40

no - I didn't see any valid questions. I think you should get the basics done well first.

ADD REPLYlink written 4 months ago by russhh3.6k

May be I'm not making myself clear enough to you.

  1. Which criteria is used to decide whether a gene is up or down regulated?

  2. I had a total of 20530 genes in the data set. Out of which almost 12000 are up or down regulated. What about the remaining 8000 odd genes? Do they fail to reject the null hypothesis? If so, what is the definition of null hypothesis?

I'm pretty sure this is defined somewhere in the manual. Please spare me the time.

ADD REPLYlink written 4 months ago by Uday Madappa40

You define the null hypothesis and you encode it within your contrasts and your p-value thresholds. Since you haven't posted a line of code, I have absolutely no idea how you selected which genes are up or downregulated. In h.mons code genes were filtered after fdr adjustment using a threshold of 0.05.

ADD REPLYlink written 4 months ago by russhh3.6k
#create a dge list
dds <- DGEList(counts = rkich, genes = genes)

#initialize DESeqDataSet
countData <- dds$counts

condition <- factor(rep(c("Tum","Norm"),each=25))

dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~ condition)

#main
dds <- DESeq(dds)

res <- results(dds, alpha = 0.05)

summary(res)

out of 19325 with nonzero total read count

adjusted p-value < 0.05

LFC > 0 (up)     : 5988, 31% 

LFC < 0 (down)   : 5990, 31% 

outliers [1]     : 0, 0% 

low counts [2]   : 4, 0.021% 

(mean count < 0)

[1] see 'cooksCutoff' argument of ?results

[2] see 'independentFiltering' argument of ?results

Where can i define the null hypothesis?

ADD REPLYlink modified 4 months ago by genomax52k • written 4 months ago by Uday Madappa40
2

In the line res <- results(...) you defined the null hypothesis; this was done implicitly by DESeq2. Based on your design, (dds <- ... ~ condition), and that you haven't specified any alternative contrast of interest inyour workflow, DESeq2 has assumed that you want to check the null that the coefficient corresponding to the second column of your design matrix is zero (curiously this is: Normal minus Tumour, it might be more natural to define the 'levels' of your condition factor more naturally)

your design is wrong for reasons that you have outlined above - the tumour/normal samples are matched. As I said, get the basics fixed before you read too much into your results. That is, put indicators for patient-of-origin into your design. Please read how to do this in the limma user's guide, in the pages I indicated.

ADD REPLYlink written 4 months ago by russhh3.6k
#create a dge list

dds <- DGEList(counts = rkich, genes = genes)

#initialize DESeqDataSet

df <- data.framePatient.ID = factor(rep(1:3,each=2)), Treatment = 

factor(rep(c("Pre","On"),3),levels=c("Pre","On")))

countData <- dds$counts

dds <- DESeqDataSetFromMatrix(countData, DataFrame(df), ~ Patient.ID + Treatment )

#main

dds <- DESeq(dds)


res <- results(dds, name ="Treatment_On_vs_Pre", alpha = 0.05)

summary(res)

out of 18545 with nonzero total read count

adjusted p-value < 0.05

LFC > 0 (up)     : 2531, 14% 

LFC < 0 (down)   : 2010, 11% 

outliers [1]     : 0, 0% 

low counts [2]   : 2145, 12%

Thanks for your advice. I have accordingly changed my design matrix as follows.

"curiously this is: Normal minus Tumour" Should it be the other way around?

The whole purpose of this experiment is to find out the genes which is significantly affected on its transition from before and after treatment. Does the above mentioned design achieve this goal?

ADD REPLYlink modified 4 months ago • written 4 months ago by Uday Madappa40
1

Where can i define the null hypothesis?

If you want to use DESeq2:

1) read carefully DESeq2 documentation: Analyzing RNA-seq data with DESeq2

2) ?results

3) search for DESeq2 paired samples - you will find plenty of useful resources, like here or here.

ADD REPLYlink written 4 months ago by h.mon16k

Correct me if I'm wrong. As for my understanding,

  1. alpha is a thresh hold value. If the p value of a particular gene is less than alpha, the null hypothesis is rejected and the gene becomes a part of the result. Does that mean that the definition of null hypothesis is that all genes are not significant?

  2. What is the significance of adjusted p value?

  3. Is the concept of up or down regulated genes decided by the log fold change value?

  4. What is the significance of plotMA() in this scenario?

Thanks a lot for your help.

ADD REPLYlink modified 4 months ago • written 4 months ago by Uday Madappa40
1

Answer to 1) is taken from from DESeq2 documentation I linked above:

DESeq2 offers two kinds of hypothesis tests: the Wald test, where we use the estimated standard error of a log2 fold change to test if it is equal to zero, and the likelihood ratio test (LRT). The LRT examines two models for the counts, a full model with a certain number of terms and a reduced model, in which some of the terms of the full model are removed. The test determines if the increased likelihood of the data using the extra terms in the full model is more than expected if those extra terms are truly zero.

Answer to 2) is taken from ?results:

pAdjustMethod: the method to use for adjusting p-values, see ?p.adjust

Answer to 3) is taken from ?results:

The results table when printed will provide the information about the comparison, e.g. "log2 fold change (MAP): condition treated vs untreated", meaning that the estimates are of log2(treated / untreated), as would be returned by contrast=c("condition","treated","untreated"). Multiple results can be returned for analyses beyond a simple two group comparison, so results takes arguments contrast and name to help the user pick out the comparisons of interest for printing a results table. The use of the contrast argument is recommended for exact specification of the levels which should be compared and their order.

ADD REPLYlink written 4 months ago by h.mon16k

If you could tell me in simple words the answer to my questions I'd be really greatful.

How do i know if a particular gene is up or down regulated?

Thanks for you quick response.

ADD REPLYlink modified 4 months ago • written 4 months ago by Uday Madappa40

You never really know

ADD REPLYlink written 4 months ago by russhh3.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 715 users visited in the last hour