Question

RSEM_R differential expression and PCA plot

0

Entering edit mode

5.2 years ago

mxlsherry1992 ▴ 80

Dear all,

I have a species without reference genome, so I did the RNA-SEQ analysis using " Trinity-align_and_estimate_abundance.pl -RSEM" pipeline, and got two files:

RSEM.isoforms.results 
RSEM.genes.results

Then, I used abundance_estimates_to_matrix.pl and run_DE_analysis.pl to do the differential expression analysis.

But now, I want to use R studio to do some PCA plot and heatmap, here is part of my R script, I just want to know which file should I used as the input file for R? (in the position of "pan_all_counts")

source("https://bioconductor.org/biocLite.R")
biocLite("pheatmap")
library("DESeq2")
library(ggplot2)
library(dplyr)
library(pheatmap)

## Input the count data
countData <- read.table("pan_all_counts", header=TRUE, row.names=1)
dim(countData)
head(countData)

##Import the metadata that inlcudes the treatment assignments and if single or paired reads.
colData <- read.table("pan_all_count_sample", header=TRUE, row.names=1)
dim(colData)
head(colData)

# check the order of treatment assignments in colData and countData
all(rownames(colData) %in% colnames(countData))
countData <- countData[, rownames(colData)]
all(rownames(colData) == colnames(countData))

##e Creat the dataset and define model
?DESeqDataSetFromMatrix
dds <- DESeqDataSetFromMatrix(countData = countData,
                              colData = colData,
                              design = ~ condition)
dds

##########remove genes with no counts
dds <- dds[rowSums(counts(dds)) > 1,]
dds

###########PCA plot
#rlog This function transforms the count data to the log2 scale in a way which minimizes differences between samples for rows with small counts, 
#and which normalizes with respect to library size.
#rld<-rlog(dds)
#plotPCA(rld)
############

#perform DESeq, DESeq will do the following things:
#estimating dispersions
#found already estimated dispersions, replacing these
#gene-wise dispersion estimates
#mean-dispersion relationship
#final dispersion estimates
#fitting model and testing
jumpy <- DESeq(dds)
jumpy
resultsNames(jumpy)
#make sure results look okay
res <- results(jumpy)
write.csv(as.data.frame(res),file="res1.csv")
head(res)
summary(res)

#reduce alpha to .05 for analysis of results
?results
res05 <- results(jumpy, alpha =.05)
head(res05)
names(res05)
#find out how many genes are differentially expressed below the maximum adjusted p-value cutoff
#sum(res05$padj < .05, na.rm=TRUE)
sum(res05$padj < .05, na.rm=TRUE)
#order all the genes by padj
#resOrdered <- res05[order(res05$padj),]
resOrdered <- res05[order(res05$padj),]
#store the number of DE genes as a variable
#numDEgenes <- sum(res05$padj < .05, na.rm=TRUE) should use padj
numDEgenes <- sum(res05$padj < .05 & (res05$log2FoldChange > 1 | res05$log2FoldChange < -1), na.rm=TRUE)
numDEgenes
#get only the DEgenes ordered by padj
DEgenes <- head(resOrdered, numDEgenes)
head(DEgenes)
#write only these genes to a CSV file
write.csv(as.data.frame(DEgenes), file="starfishtopDEgenes1.csv")

RNA-Seq • 2.2k views

ADD COMMENT • link 5.2 years ago by mxlsherry1992 ▴ 80

0

Entering edit mode

But now, I want to use R studio to do some PCA plot and heatmap,

If above is your aim, I wonder why you want to do DEseq as you already did differential analysis using RSEM. You can use your .results files to generate heatmap and pca plot. I assume it has count matrix where rows are genes / isoforms and columns are samples.

ADD REPLY • link 5.2 years ago by Chirag Parsania ★ 2.0k

0

Entering edit mode

Hi, Thanks for the reply! so you means I can use the results of

RSEM.genes.results

directly for PCA and heatmap?

Here is part of the RSEM.genes.results, if I need to extracted one columns to do that? ( And I have 7 sample, 2 replicated, so I got 14 RSEM.genes.results files together, will want to do the PCA and heatmap together, if you know which software can do that ..)

gene_id transcript_id(s)    length  effective_length    expected_count  TPM FPKM
TRINITY_DN0_c0_g1   TRINITY_DN0_c0_g1_i1    249.00  80.67   0.00    0.00    0.00
TRINITY_DN100000_c0_g1  TRINITY_DN100000_c0_g1_i1   851.00  680.67  8.00    1.19    2.55

ADD REPLY • link 5.2 years ago by mxlsherry1992 ▴ 80

0

Entering edit mode

It seems RSEM.genes.results contains normalised expression values which are TPM and FPKM. As you said, you have already performed differential analysis using run_DE_analysis.pl you must have fold change matrix as well. Depending upon your purpose of analysis you can use either FPKM/ TPM or fold change value to plot the heatmap. One of the purposes of PCA analysis is to check how good replicates are. If you also want to use PCA for same purpose, I would suggest you to use log2(FPKM) values to generate PCA plot. There are several tutorial available online for PCA. For I heamap my personal favourite us ComplexHeatmap

ADD REPLY • link 5.2 years ago by Chirag Parsania ★ 2.0k

0

Entering edit mode

Thanks for your kindly reply!! So it seems FPKM/FPM and log2foldchange value works for heatmap plotting..? I just copy and paste my results for different samples and input them to do heatmap is ok...?:)

Thank you!

ADD REPLY • link 5.2 years ago by mxlsherry1992 ▴ 80

0

Entering edit mode

This is very vague question to answer. It depends on for what purpose you want to plot the heatmap

ADD REPLY • link 5.2 years ago by Chirag Parsania ★ 2.0k