Hello, I am a complete novice in regards to utilizing DESeq2. I was wondering if there was anyway to extract data from various GSEXXXXX RNA-Seq specific experiments in order to perform DESeq2 analysis on them. Right now I have a list of various GSE experiment numbers that are all RNA-seq specific. How would I go about using this list of GSE experiment numbers to be able to prepare the data for DESeq2 analysis? I've tried utilizing the getGeo() function which returns a series_matrix.txt file but I am not sure where to go from here. From what i've seen on other posts, this is the wrong file to use for DESeq2 analysis and instead you need the raw counts available from the getGEOSuppFiles() function. If this is the case, my question is that in the post discussing this, the supplementary file is of a csv format. When I run this function on my list of GSE experiment numbers most of them are .tar files which contain numerous .cel files. How would I go about preparing this data for DESeq2 analysis? I apologize if I got this all wrong, but I cannot seem to find any comprehensive answers on how to perform DESeq2 starting from a GSE accession number.
Ok so instead I have opted to use the download script provided by ARCHS4 to generate my own raw count files.The problem with this is that the column names are all the GSM accession numbers instead of WT or Treatment which I am trying to attain. As of now the only way I am able to differentiate samples between WT and Treatment is through the getGEO() function and using the !Sample_title annotations. Do you happen to know an easier way to get the GSM sample annotations?
I would probably just create a named vector of the GSM accession numbers and the annotations (e.g.
c(GSM1223456="WT", GSM987654="Treatment", etc) and use
match` with the column names to rename the columns. This stack overflow post provides a clear example of how to do that.