Preparing GSE data for DESeq2 Analysis
1
0
Entering edit mode
3.7 years ago

Hello, I am a complete novice in regards to utilizing DESeq2. I was wondering if there was anyway to extract data from various GSEXXXXX RNA-Seq specific experiments in order to perform DESeq2 analysis on them. Right now I have a list of various GSE experiment numbers that are all RNA-seq specific. How would I go about using this list of GSE experiment numbers to be able to prepare the data for DESeq2 analysis? I've tried utilizing the getGeo() function which returns a series_matrix.txt file but I am not sure where to go from here. From what i've seen on other posts, this is the wrong file to use for DESeq2 analysis and instead you need the raw counts available from the getGEOSuppFiles() function. If this is the case, my question is that in the post discussing this, the supplementary file is of a csv format. When I run this function on my list of GSE experiment numbers most of them are .tar files which contain numerous .cel files. How would I go about preparing this data for DESeq2 analysis? I apologize if I got this all wrong, but I cannot seem to find any comprehensive answers on how to perform DESeq2 starting from a GSE accession number.

GEO DESEQ2 RNA-Seq • 1.6k views
ADD COMMENT
1
Entering edit mode
3.7 years ago

CEL files are not RNA-seq. They are RNA microarrays, which are better processed through limma and cannot be directly compared to RNA-seq data.

Additionally, there will be serious batch/technical effects when trying to compare data generated from different folks in different labs at different institutions using different methods. Trying to integrate just two datasets that are supposedly identical but prepped in different labs is challenging - more than that and any results you get should be taken with several grains of salt. If possible, a better approach would be to perform your differential expression analyses within each dataset and compare the results between them, though this assume the samples were all sorted similarly, the experimental setup was similar, etc.

Your approach is correct though - getGEO and getGEOSuppFiles are typically the way to go, depending on how the GEO record is organized. Note that getGEOSuppFiles only downloads the files, but then you can import them as needed. If the user provided gene counts (from salmon, kallisto, htseq, etc), then this process is pretty straightforward and prepping the files for DESeq2 use isn't a difficult task. Otherwise, you will have to download the raw data (FASTQ or BAM files) and generate them yourself.

ADD COMMENT
0
Entering edit mode

Ok so instead I have opted to use the download script provided by ARCHS4 to generate my own raw count files.The problem with this is that the column names are all the GSM accession numbers instead of WT or Treatment which I am trying to attain. As of now the only way I am able to differentiate samples between WT and Treatment is through the getGEO() function and using the !Sample_title annotations. Do you happen to know an easier way to get the GSM sample annotations?

ADD REPLY
0
Entering edit mode

I would probably just create a named vector of the GSM accession numbers and the annotations (e.g. c(GSM1223456="WT", GSM987654="Treatment", etc) and usematch` with the column names to rename the columns. This stack overflow post provides a clear example of how to do that.

ADD REPLY

Login before adding your answer.

Traffic: 1489 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6