Differential analysis between two cell lines
0
1
Entering edit mode
6.2 years ago
Biologist ▴ 290

Hi,

I have raw counts data for two prostrate cancer cell lines. I want to get differential expressed genes between this two cell lines. Just doing a simple t-test is fine or I should use edgeR/DeSeq2?

If t-test is fine I want to make a violin plot so, do I need to convert raw counts to log cpm or RPKM?

R RNA-Seq edgeR differential analysis • 3.7k views
ADD COMMENT
1
Entering edit mode

Is this single-cell data, or regular "bulk" RNAseq data? If regular RNAseq, go with edgeR or DESeq2, if single cell, you have to look for a single-cell analysis pipeline.

Do not use a t-test.

ADD REPLY
0
Entering edit mode

ok. but it is only 1 cell-line vs another cell-line right. Is it possible to go with edgeR or Deseq2? If yes how?

ADD REPLY
0
Entering edit mode

Look in the vignette for edgeR, specifically chapter 4 (page 39.) https://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

ADD REPLY
0
Entering edit mode

Hi, I followed the tutorial. But there is an error.

y <- estimateDisp(y, design)
Error: $ operator is invalid for atomic vectors
ADD REPLY
0
Entering edit mode

@h.mon I want to check expression of specific genes between those two cell lines. Just to know which gene is more expressed in which cell line. Why not use a t-test?

ADD REPLY
0
Entering edit mode

It appears that you are just comparing 1 sample versus another (you have not mentioned anything about replicates), in which case you cannot faithfully conduct any type of statistical comparison. Your best bet is to calculate ratios for each gene in one sample versus the other.

ADD REPLY
0
Entering edit mode

Thanks Kevin. In the nature paper t-test between basal cell-lines and non-basal lines please check Figure1C and also its legend. They have just done students t-test to check the expression between basal and non-basal.

Similarly, in my analysis I have only two prostrate cell-lines and want to check which genes are more expressed in which cell-line. Do you think t-test is fine to find which genes are significant b/w both cell-lines? calculate ratios for each gene? What to do for that I have raw counts data like following:

Genes             Prostrate cell1   Prostrate cell2
MIR137HG            989                         896
RP11-576D8.4      3                           0
SIX3-AS1              0                          11
AC012354.6    0                           1
AC010967.2    0                           0
RP11-19E11.1    126                            0
RP4-555D20.4      0                         167
LINC00973            31                        2240
RP11-615J4.3      0                            0
CTD-2297D10.2     0                            0
CTC-254B4.1   4                            0
DLX6-AS1             53                           13
RP5-884M6.1   5                          394
RP11-10A14.5      3                            74
RP11-115J16.1     0                             0
RP11-563N12.2     0                             0
RP11-359E19.2     0                             0
RP11-513O17.2    3277                       0
SNORD3D           90                               52
HOXB-AS4               0                                0
RP11-357H14.1      0                                 0
CTD-2319I12.5       0                                1
LINC00511             309                      1951
ADD REPLY
1
Entering edit mode

Thanks for pointing that out. I note that they are doing a t-test using RPKM, which is incorrect. However, that's in the Nature journal series, the publications of which frequently contain improper statistical methods.

If you are going to use a t-test for your data, then at least normalise to CPM and neither FPKM nor RPKM.

ADD REPLY
0
Entering edit mode

Small help. So, first I converted counts data to logCPM. Lets say the above table is in a dataframe tin2.

tin2 <- data.frame(tin2[,-1], row.names=tin2[,1])
logCPM <- cpm(tin2, prior.count=2, log=TRUE)

So, now I have genes as rows and two cell-lines as columns with logCPM values. From this data I want to make violin plot for each gene showing the significance between two cell-lines. Could you please help me how to do it.

ADD REPLY
1
Entering edit mode

You can generate a violin plot for each gene but it would not look normal. It would just contain a single line to represent the median, or just a 'blob' of a value - you only have 2 samples.

Getting back to Figure 1C in the manuscript that you mentioned, the legend states:

(c) Violin plot of LINP1 expression in basal (n=26) vs non-basal (n=20) breast cancer cell lines from CCLE. Two-tailed Student’s t-test; p-value=0.046.

Those guys have 26 basal samples and 20 non-basal. So, their plot is plotting 26 values versus 20, and that's also what they compare with their t-test.

As you only have 2 samples, you cannot generate the same plot - sorry. Your best hope is to just derive the ratio for each gene between both cell-lines that you have.

ADD REPLY
0
Entering edit mode

Hi Kevin,

Thankyou. In edgeR tutorial I saw section 2.11 which can be followed if there are no replicates.

   library(edgeR)
    y <- DGEList(counts=tin2[,2:3], genes=tin2[,1], group = 1:2)
    y <- calcNormFactors(y,method = "TMM")
    bcv <- 0.1

    et <- exactTest(y, dispersion=bcv^2)
    topTags(et,n=100)
    tab <- topTags(et,n=Inf)
    summary(decideTestsDGE(et))

           1+2
    Down     5
    NotSig  12
    Up       6

    keep <- tab$table$FDR <= 0.05
    rt <- tab$table[keep,]

                   genes       logFC   logCPM        PValue           FDR
    18     RP11-513O17.2 -15.1503421 18.64105 4.992433e-149 1.148260e-147
    8          LINC00973   5.3437230 17.31458  2.130951e-72  2.450594e-71
    13       RP5-884M6.1   5.4476608 14.85419  3.373919e-36  2.586671e-35
    6       RP11-19E11.1 -10.4504538 13.89255  5.819664e-32  3.346307e-31
    7       RP4-555D20.4  10.0299482 13.65357  3.821036e-24  1.757676e-23
    23         LINC00511   1.8309535 17.43327  5.836966e-16  2.237504e-15
    12          DLX6-AS1  -2.8395750 12.84278  8.261722e-10  2.714566e-09
    14      RP11-10A14.5   3.7576291 12.65649  1.468446e-09  4.221782e-09
    19           SNORD3D  -1.6157071 13.85609  8.499783e-07  2.172167e-06
    1           MIR137HG  -0.9696249 17.51049  7.723187e-06  1.776333e-05
    3           SIX3-AS1   6.1251140 10.34147  4.642655e-03  9.707370e-03

So, from the results which how to say which gene is upregulated in which cell-line?

ADD REPLY
0
Entering edit mode

Okay, but in which World is it a robust study to compare just 1 sample versus the other? Think of this scenario: you manage to get a replicate for each of your cell lines and then repeat the analysis (now 2 versus 2) and then find that virtually all of your fold changes have flipped direction. What then? You then get a third replicate, repeat it, and find that one of your original samples behaves like an outlier, and many of your fold-changes have again flipped direction. What then?

With just 2 samples, you certainly:

  • cannot generate meaningful box- or scatter plots
  • cannot judge whether a sample is an outlier or not
  • cannot generate credible statistics

You should, at minimum, aim for 3 versus 3.

Thanks!

ADD REPLY
0
Entering edit mode

I see your point here. But I have only two prostrate cell-lines. And to knockdown the some particular genes in one of the cell-line, first I have to know in which cell-line those genes are highly expressed. so, it will be easier to me to select that cell-line and knockdown those genes. This was my idea for some experiments. So, I got the CCLE data for two prostrate cell-lines and thought of doing like above.

ADD REPLY
0
Entering edit mode

I see. In that case, the best that you can do is follow that tutorial by the EdgeR authors and to just be aware of the limitations. The genes with the largest absolute fold-change differences should reflect the ones in which you will be interested.

You may also consider transforming your logCPM data to the Z-scale and then take transcripts that have absolute Z-score>3 or >4 in either cell-line. Hopefully, the fold-change results and those o the Z-scores will match on many genes.

Thanks!

ADD REPLY
0
Entering edit mode

Sure. I will follow this. So, in my above EdgeR analysis

group = 1:2 which is 
                               group
prostrate cell-line1     1 
prostrate cell-line2     2

in EdgeR results above the genes with positive logFC are upregulated in prostrate cell-line1 and genes with negative logFC are unregulated in prostrate cell-line2. Am I right?

And as you said with above given counts data first I transformed them into logCPM like below:

tin2 <- data.frame(tin2[,-1], row.names=tin2[,1])
logCPM <- cpm(tin2, prior.count=2, log=TRUE)

                     Prostrate_Cell-line1 Prostrate_Cell-line2
MIR137HG           17.566449       17.156450
RP11-576D8.4        9.881363        8.473547
SIX3-AS1            8.473547       11.066461
AC012354.6          8.473547        9.017102
AC010967.2          8.473547        8.473547
RP11-19E11.1       14.611886        8.473547

From this I transformed to Z-scale.

logCPM <- t(scale(t(logCPM)))

                    Prostrate_Cell-line1 Prostrate_Cell-line2
MIR137HG           0.7071068      -0.7071068
RP11-576D8.4       0.7071068      -0.7071068
SIX3-AS1          -0.7071068       0.7071068
AC012354.6        -0.7071068       0.7071068
AC010967.2               NaN             NaN
RP11-19E11.1       0.7071068      -0.7071068

Did I go wrong some where?

ADD REPLY
0
Entering edit mode

Hey, regarding the Z-scores, that's just a consequence of having too few dimensions in your data - apologies for suggesting that idea.

Regarding the fold changes, it will depend on the direction of comparison. A useful way to check is to just look at your normalised counts and contrast them to the fold-change directions for the purpose of inferring this.

ADD REPLY
0
Entering edit mode

Ok. please see this example.

logCPM of MIR137HG

                     Prostrate_Cell-line1 Prostrate_Cell-line2
MIR137HG           17.566449             17.156450

The expression is high in Prostrate_Cell-line1.

And differential analysis between both cell-lines gave -0.9696249 logFC for MIR137HG [the value is negative because DEA is Prostrate_Cell-line2 vs Prostrate_Cell-line1. If I do Prostrate_Cell-line1 vs Prostrate_Cell-line2 the value will be positive]

Is this the way you told me above?

ADD REPLY
0
Entering edit mode

In that case, it actually looks like it is Cell-Line2 / Cell-Line1, and that Cell-Line 1 is the 'reference' level.

17.156450 / 17.566449 = 0.97666010928

The fold-change indicates that cell-line 2 has fractionally less expression.

You may want to check other genes with large fold-changes just to confirm This can be changed where you have stored your group variable with:

group <- factor(group, levels=c(1, 2)) #1 is reference level

group <- factor(group, levels=c(2, 1)) #2 is reference level
ADD REPLY
0
Entering edit mode

Sorry, could you please clarify one thing in the above analysis.

group = 1:2 which is 
                               group
prostrate cell-line1     1 
prostrate cell-line2     2

Summary shows 1+2 6 genes were Upregulated in which cell-line?

ADD REPLY
1
Entering edit mode

I assume up-regulated in cell-line 2. Just check the expression of each gene though, to double check.

ADD REPLY
0
Entering edit mode

Sure. will look into that. thank you

ADD REPLY
0
Entering edit mode

Good luck with it and keep me updated!

ADD REPLY

Login before adding your answer.

Traffic: 1377 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6