Question

Differential Gene Expression

0

Entering edit mode

2.7 years ago

anasjamshed ▴ 120

I have this data: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE162562&fbclid=IwAR0iZQhttG8HzGhFIIMWbFgNszQrVDgiyVChYzQ_ypCx_d-1pn_tm7STjGs

and I want to compare:

healthy vs Mild
healthy vs Highly exposed seronegative (ishgl)
Healthy vs Asymptomatic covid19 patient
healthy vs Highly exposed seronegative (non ishgl)

from this data.

I want to find DEGs as well. Is it possible from this data as i am not seeing the Geo2R option at the link?

DEG R • 4.3k views

ADD COMMENT • link 2.5 years ago by anasjamshed ▴ 120

1

Entering edit mode

The count files are provided for each sample as a tsv in supplementary files, which you can aggregate and use as input to DESeq2 or edgeR. Alternatively, GEO provides links to the accompanying SRA entry containing the fastq files for those samples. With the fastq files you can run through a workflow such as Salmon + DESeq2 to find differentially expressed genes.

ADD REPLY • link 2.7 years ago by rpolicastro 13k

0

Entering edit mode

i am not seeing any count file

ADD REPLY • link 2.7 years ago by anasjamshed ▴ 120

0

Entering edit mode

Go to a sample page (such as this), go to the bottom where it says "Supplementary file" and there should be a text file corresponding to the sample (e.g. GSM4954457_A_1_Asymptom.txt.gz).

ADD REPLY • link 2.7 years ago by rpolicastro 13k

0

Entering edit mode

but how can I aggregate them and can aggregate and use as input to DESeq2 or edgeR? I have tried this: DEGs

Can I proceed with this?

ADD REPLY • link 2.7 years ago by anasjamshed ▴ 120

0

Entering edit mode

When I click on SRA run selector I did not find any fastq files .plz help me

ADD REPLY • link 2.7 years ago by anasjamshed ▴ 120

0

Entering edit mode

For sra-explorer

Check out that link again: sra-explorer : find SRA and FastQ download URLs in a couple of clicks

You have to use SRA-explorer to search your SRA number for your dataset of interest.

So if you go to your dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE162562&fbclid=IwAR0iZQhttG8HzGhFIIMWbFgNszQrVDgiyVChYzQ_ypCx_d-1pn_tm7STjGs

and then scroll to where it says SRA (above Download Famiy). You will see SRP295561 this is the SRA number you search and then I think you can follow the tutorial using the links above. There may be a better way such as using the sra-toolkit or something, but I kind of like this way because you can use aspera, which allows you to download the large fastq files faster.

So what sra-explorer.info will provide are commands for your to make into a bash/shell script to run to download. Make sure you have enough storage on your computer/server/etc....

ADD REPLY • link 2.7 years ago by Pratik ★ 1.0k

0

Entering edit mode

After finding DEGs through deseq2, how can I apply topgo to do functional enrichment analysis?

ADD REPLY • link 2.6 years ago by anasjamshed ▴ 120

score 8 · Answer 1 · 2021-08-26

8

Entering edit mode

2.6 years ago

ATpoint 81k

Can you analyze in GEO2R? => No, because this is RNA-seq and not microarrays.

You are lucky thought that the authors seem to provide raw counts so you can easily fede them into DESeq2. Here is a code suggestion, for details please read the DESeq2 vignette extensively, it contains answers to 99.99% of all questions that might arise: http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

I also included some basic QC via Principal Component Analysis (PCA) and as you will see there is an obvious batch effect that needs to be corrected. Please go through the code, read the manual, try to understand what is going on. Be sure to really get a background.

This assumes that you downloaded the entire _RAW folder via the browser and then simply decompressed it by double-clicking on it, no magic here.



#/ download via the browser and uncompress the archive.
#/ we assume that the decompressed folder is in ~/Downloads/
listed_files <- list.files("~/Downloads/GSE162562_RAW", pattern=".gz", full.names=TRUE)

#/ load into R:
loaded <- lapply(listed_files, function(x) read.delim(x, header=FALSE, row.names = "V1"))

#/ bind everything into a single count matrix and assign colnames:
raw_counts <- do.call(cbind, loaded)
colnames(raw_counts) <- gsub(".txt.*", "", basename(listed_files))

#/ there is some nonsense in these files, probably due to the tool that was used (HTSeq?),
#/ such as rows with names "__no_feature", lets remove that:
raw_counts <- raw_counts[!grepl("^__", rownames(raw_counts)),]

#/ make a DESeq2 object. We can parse the group membership (Mild etc) from the colnames,
#/ which are based on the original filenames:
library(DESeq2)
dds <- DESeqDataSetFromMatrix(
  countData=raw_counts,
  colData=data.frame(group=factor(unlist(lapply(strsplit(colnames(raw_counts), split="_"), function(x) x[4])))),
  design=~group)

#/ some QC via PCA using the in-built PCA function from DESeq2:
vsd <- vst(dds, blind=TRUE)
#/ plot the PCA:
plotPCA(vsd, "group")

#/ extract the data:
pcadata <- plotPCA(vsd, "group", returnData=TRUE)

#/ there is a very obvious batch effect, that we can correct, simply everything that is greater or less than zero in the first principal component, very obvious just by looking at the plot:
dds$batch <- factor(ifelse(pcadata$PC1 > 0, "batch1", "batch2"))

#/ lets use limma::removeBatchEffect to see whether it can be removed:
library(limma)
vsd2 <- vsd
assay(vsd2) <- removeBatchEffect(assay(vsd), batch=dds$batch)

#/ repeat PCA, looks much better. this means we modify our design to include batch and group,
#/ again, please see the vignette what this all means:
plotPCA(vsd2, "group")

design(dds) <- ~batch + group

#/ standard workflow, see vignette:
dds <- DESeq(dds)

#/ extract the results, here for example Mild vs Asymptom, using "ashr" shrinkage, see vignette:
library(ashr)
res_Mild_vs_Asymptom <- lfcShrink(dds=dds, contrast=c("group", "Mild", "Asymptom"), type="ashr")

This is the first PCA that reveals the batch effect:

enter image description here

...and the second one that demonstrates that the batch effect can be removed:

enter image description here

That is now a lot of code, but this is basically how you do a standard RNA-seq differential analysis.

ADD COMMENT • link 2.6 years ago by ATpoint 81k

0

Entering edit mode

How can I find differentially expressed genes after this ? Should I use topTable()?

ADD REPLY • link 2.6 years ago by anasjamshed ▴ 120

2

Entering edit mode

Have you even looked at the code and vignette? Read the manual. It is the last line in my code that contains the DEGs for that contrast.

ADD REPLY • link 2.6 years ago by ATpoint 81k

0

Entering edit mode

Yes I looked at it but it is showing a lot of genes but I want to find upregulated and downregulated

ADD REPLY • link 2.6 years ago by anasjamshed ▴ 120

0

Entering edit mode

Read the manual -- I am out.

ADD REPLY • link 2.6 years ago by ATpoint 81k

0

Entering edit mode

when i tried last line of code:

res_Mild_vs_Asymptom <- lfcShrink(dds=dds, contrast=c("group", "Mild", "Asymptom"), type="ashr")

i get this error:

Error in mixem_rcpp(L, w, z, x, eps, numiter, zero.threshold, verbose): function 'Rcpp_precious_remove' not provided by package 'Rcpp'
Traceback:

1. lfcShrink(dds = dds, contrast = c("group", "Mild", "Asymptom"), 
 .     type = "ashr")
2. ashr::ash(betahat, sebetahat, mixcompdist = "normal", method = "shrink", 
 .     ...)
3. ash.workhorse(betahat, sebetahat, mixcompdist = mixcompdist, 
 .     df = df, ...)
4. estimate_mixprop(data, g, prior, optmethod = optmethod, control = control, 
 .     weights = weights)
5. do.call(optmethod, args = list(matrix_lik = matrix_lik, prior = prior, 
 .     pi_init = pi_init, control = control, weights = weights))

ADD REPLY • link 2.6 years ago by anasjamshed ▴ 120

0

Entering edit mode

Hello,

This is really off-topic but I just would like to know more about the batch effect. I started to understand the concept of it but would like to know what made you state that your first PCA plot has the batch effect. Is this because there's two obvious clusters in the first PCA plot?

Also, I am just wondering if the two clustered got seperated by some other hidden factors authors didn't mension such as gender, then could we say we don't have the batch effect?

Thanks!

ADD REPLY • link 2.6 years ago by jkim ▴ 170

1

Entering edit mode

I assumed it was a batch effect because it separates on the first PC (so the one that explains the most variance) and both batches or "groups" in the plot have members of all groups. Yes, it could i theory be that it is not a technical batch effect but something like sex (gender is a word from social science, sex is the biological term based on presence of X/Y chromosomes), but the results would be the same, we would include it into the model as we want to compare the four groups and want no confounding by this "batch" or whatever the reason for the separation is.

ADD REPLY • link 2.6 years ago by ATpoint 81k

0

Entering edit mode

Thank you for your explanation!

ADD REPLY • link 2.6 years ago by jkim ▴ 170

0

Entering edit mode

can you please explain this line :

colData=data.frame(group=factor(unlist(lapply(strsplit(colnames(raw_counts), split="_"), function(x) x[4]))))

ADD REPLY • link 2.5 years ago by anasjamshed ▴ 120

score 4 · Answer 2 · 2021-08-22

4

Entering edit mode

2.7 years ago

Pratik ★ 1.0k

Hey this should get you started to aggregate the counts together, it may be a little quick and dirty since I am using cbind and not merge by genes, but this should get you started.

You first want to download the supplementary file here: enter image description here

and then you can download that file "GSE162562_RAW.tar", extract it to your Desktop so you have the a GSE162562_RAW folder on your Desktop, then the following commands should aggregate the counts together.

#Un GZIP the count files
system("gunzip ~/Desktop/GSE162562_RAW/*.gz")

#get the list of sample names
GSMnames <- t(list.files("~/Desktop/GSE162562_RAW", full.names = F))

#remove .txt from file/sample names
GSMnames <- gsub(pattern = ".txt", replacement = "", GSMnames)

#make a vector of the list of files to aggregate
files <- list.files("~/Desktop/GSE162562_RAW", full.names = TRUE)

#check if there is the same number of rows in all samples
system("cd ~/Desktop/GSE162562_RAW | wc -l ~/Desktop/GSE162562_RAW/*.txt")
#there are 26369 rows so by extension there should be 26369 genes

#load the gene names up
genes <- read.table(files[1], header=FALSE, sep=",")[,1]

#make the raw aggregated data frame of all the counts
df    <- do.call(cbind,lapply(files,function(fn)read.table(fn,header=FALSE, sep="\t")[,2]))

#bind it together with genes
df <- cbind(genes,df)

#change row names to gene names
row.names(df)<- df[,1]

#remove remaining gene column
df = subset(df, select = -c(genes))

#change column names to sample names
colnames(df)<- data.frame(GSMnames)

#cleanup
rm(files, genes, GSMnames)

Then you can plug these counts into DESeq2 or EdgeR , you may have to make an appropriate meta data so you can setup your comparisons accordingly to generate a list of differentially expressed genes after followin the DESeq2 workflow.

Ideally though, you may want to disregard everything I typed above this, because I think it could be in your best interest to do what rpolicastro was mentioning in the first comment here:

Alternatively, GEO provides links to the accompanying SRA entry containing the fastq files for those samples. With the fastq files you can run through a workflow such as Salmon + DESeq2 to find differentially expressed genes.

which is to download the raw FASTQ files and then plugging them into Salmon + DESeq2. You have so much more control of everything that way, in my opinion. I, personally, like to be in control... This may require a bunch of more hoops to jump through through like installing conda and also snakemake if you use the tutorial rpolicastro linked. It's not too bad though.

To download the fastq files, I, personally, use the sra-explorer website and aspera (aspera allows you to download fastq files much faster): sra-explorer : find SRA and FastQ download URLs in a couple of clicks

You could google how to download and install aspera... or check out just Step 1 of this tutorial: [Deprecated] Fast download of FASTQ files from the European Nucleotide Archive (ENA) (mind you, there is a newer version of apsera out now so some of the step might be a bit different, but if you want go down this route, this should get you started) (Remember you only need Step 1 in this tutorial)

EDIT 08.27.2021: basically if you want to go the fastq file route you should download, install aspera, and add it to your $PATH and then use sra-explorer to get the aspera download links/commands. you can make the aspera download commands to a .sh file and run it in terminal to quickly download all of the fastq files you need

ADD COMMENT • link 2.6 years ago by Pratik ★ 1.0k

0

Entering edit mode

It is showing error: error

ADD REPLY • link 2.7 years ago by anasjamshed ▴ 120

1

Entering edit mode

Hey the above script is meant for R and MacOS or Linux. The thought process here is to just extract the tar file to your desktop so you have a folder GSE162561_RAW, and then extract the gzip files all together. I think Google may be able to help you with this hoop to jump through ; )

ADD REPLY • link 2.7 years ago by Pratik ★ 1.0k

0

Entering edit mode

I run this all script in R(Linux):

#Un GZIP the count files
system("gunzip ~/Desktop/GSE162562_RAW/*.gz")

#get the list of sample names
GSMnames <- t(list.files("~/Desktop/GSE162562_RAW", full.names = F))

#remove .txt from file/sample names
GSMnames <- gsub(pattern = ".txt", replacement = "", GSMnames)

#make a vector of the list of files to aggregate
files <- list.files("~/Desktop/GSE162562_RAW", full.names = TRUE)

#check if there is the same number of rows in all samples
system("cd ~/Desktop/GSE162562_RAW | wc -l ~/Desktop/GSE162562_RAW/*.txt")
#there are 26369 rows so by extension there should be 26369 genes

#load the gene names up
genes <- read.table(files[1], header=FALSE, sep=",")[,1]

#make the raw aggregated data frame of all the counts
df    <- do.call(cbind,lapply(files,function(fn)read.table(fn,header=FALSE, sep="\t")[,2]))

#bind it together with genes
df <- cbind(genes,df)

#change row names to gene names
row.names(df)<- df[,1]

#remove remaining gene column
df = subset(df, select = -c(genes))

#change column names to sample names
colnames(df)<- data.frame(GSMnames)

#cleanup
rm(files, genes, GSMnames)

write.csv(df,countsgen.csv)

But this giving me : counts

This data seems to me useless as it does not show any gene or sample names.

Plz help me

ADD REPLY • link 2.7 years ago by anasjamshed ▴ 120

1

Entering edit mode

Hi anasjamshed1994, another note, because it seems you may be doing this all in terminal R, which could be good if you have a large dataset and want to make something to run on a cluster or so. Since this isn't a crazy big dataset you could have an easier time doing this in RStudio so when you create the "df" data frame using the commands I provided, you can visualize the dataframe in RStudio by clicking on "df" in your RStudio environment or typing View(df). You should be able to see sample as column names and genes as row names:

enter image description here

There is really no need to write your data frame as a csv here, because your data frame will serve as input into DESeq2. You may have to rename columns on your df (counts matrix) and make a coldata data frame to plug into DESeq2.

The counts matrix and the coldata are the two key components of beginning a DESeq2 workflow:

Check out this section of the DESeq2 tutorial to give you some ideas on how to proceed: "Count matrix input" http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#count-matrix-input

Ultimately, you are having trouble with understanding some basics in R. No problem, you just have to google things here and there to figure out which I'm sure you have been doing with success.

You want to get to this point in the tutorial:

library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ condition)

where you replace cts with df, you make a coldata data frame, and you choose your appropriate design.

So your commands might look like this, just to be a little bit more explicit:

library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData = df,
                              colData = coldata,
                              design = ~ yourcomparisonhere)

Another note is that usually people have the most trouble with choosing the design. I think you have a straight forward design. You should construct your coldata so then you can easily do your designs/comparisons. If I remember correctly, you will have to do a DESeq2 object and a new design for each comparison, using the same counts matrix (df) and the same coldata, just a different design.

This might help for coldata so will the DESeq2 tutorial I linked above... https://support.bioconductor.org/p/115078/

Just judging by your questions, Google searching biostars/bioconductor forums/stack overflow might be your friend for making the coldata.

Good luck!

-Pratik

ADD REPLY • link 2.7 years ago by Pratik ★ 1.0k

0

Entering edit mode

I tried this in windows(R):

URL <- "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE162nnn/GSE162562/suppl/GSE162562_RAW.tar"
FILE <- file.path(tempdir(), basename(URL))


utils::download.file(URL, FILE, mode = "wb")

exdir <- path.expand("~/GSE162562_RAW")

dir.create(exdir, showWarnings = FALSE)

utils::untar(FILE, exdir = exdir)

unlink(FILE, recursive = TRUE, force = TRUE)

files <- list.files(exdir, full.names = TRUE)


GSMnames <- t(list.files(exdir, full.names = F))

GSMnames <- gsub(pattern = ".txt", replacement = "", GSMnames)

GSMnames <- gsub(pattern = ".gz", replacement = "", GSMnames)

names(files) <- sub("\\.txt\\.gz$", "", basename(files))
files

genes <- read.table(files[1], header=FALSE, sep=",")[,1]


df    <- do.call(cbind,lapply(files,function(fn)read.table(fn,header=FALSE, sep="\t")[,2]))

df <- cbind(genes,df)

row.names(df)<- df[,1]

df = subset(df, select = -c(genes))

colnames(df)<- data.frame(GSMnames)

rm(files, genes, GSMnames)

and get this : output

Is this ok?

ADD REPLY • link 2.7 years ago by anasjamshed ▴ 120