Hey this should get you started to aggregate the counts together, it may be a little quick and dirty since I am using
cbind and not
merge by genes, but this should get you started.
You first want to download the supplementary file here:
and then you can download that file "GSE162562_RAW.tar", extract it to your Desktop so you have the a GSE162562_RAW folder on your Desktop, then the following commands should aggregate the counts together.
#Un GZIP the count files
#get the list of sample names
GSMnames <- t(list.files("~/Desktop/GSE162562_RAW", full.names = F))
#remove .txt from file/sample names
GSMnames <- gsub(pattern = ".txt", replacement = "", GSMnames)
#make a vector of the list of files to aggregate
files <- list.files("~/Desktop/GSE162562_RAW", full.names = TRUE)
#check if there is the same number of rows in all samples
system("cd ~/Desktop/GSE162562_RAW | wc -l ~/Desktop/GSE162562_RAW/*.txt")
#there are 26369 rows so by extension there should be 26369 genes
#load the gene names up
genes <- read.table(files, header=FALSE, sep=",")[,1]
#make the raw aggregated data frame of all the counts
df <- do.call(cbind,lapply(files,function(fn)read.table(fn,header=FALSE, sep="\t")[,2]))
#bind it together with genes
df <- cbind(genes,df)
#change row names to gene names
#remove remaining gene column
df = subset(df, select = -c(genes))
#change column names to sample names
rm(files, genes, GSMnames)
Then you can plug these counts into DESeq2 or EdgeR , you may have to make an appropriate meta data so you can setup your comparisons accordingly to generate a list of differentially expressed genes after followin the DESeq2 workflow.
Ideally though, you may want to disregard everything I typed above this, because I think it could be in your best interest to do what rpolicastro was mentioning in the first comment here:
Alternatively, GEO provides links to the accompanying SRA entry
containing the fastq files for those samples. With the fastq files you
can run through a workflow such as Salmon + DESeq2 to find
differentially expressed genes.
which is to download the raw FASTQ files and then plugging them into Salmon + DESeq2. You have so much more control of everything that way, in my opinion. I, personally, like to be in control... This may require a bunch of more hoops to jump through through like installing
conda and also
snakemake if you use the tutorial rpolicastro linked. It's not too bad though.
To download the fastq files, I, personally, use the sra-explorer website and aspera (aspera allows you to download fastq files much faster): sra-explorer : find SRA and FastQ download URLs in a couple of clicks
You could google how to download and install aspera... or check out just Step 1 of this tutorial: [Deprecated] Fast download of FASTQ files from the European Nucleotide Archive (ENA) (mind you, there is a newer version of apsera out now so some of the step might be a bit different, but if you want go down this route, this should get you started) (Remember you only need Step 1 in this tutorial)
EDIT 08.27.2021: basically if you want to go the fastq file route you should download, install aspera, and add it to your
$PATH and then use sra-explorer to get the aspera download links/commands. you can make the aspera download commands to a .sh file and run it in terminal to quickly download all of the fastq files you need