Question

Preparing GSE data for DESeq2 Analysis

0

Entering edit mode

3.7 years ago

thewisechineseguru • 0

Hello, I am a complete novice in regards to utilizing DESeq2. I was wondering if there was anyway to extract data from various GSEXXXXX RNA-Seq specific experiments in order to perform DESeq2 analysis on them. Right now I have a list of various GSE experiment numbers that are all RNA-seq specific. How would I go about using this list of GSE experiment numbers to be able to prepare the data for DESeq2 analysis? I've tried utilizing the getGeo() function which returns a series_matrix.txt file but I am not sure where to go from here. From what i've seen on other posts, this is the wrong file to use for DESeq2 analysis and instead you need the raw counts available from the getGEOSuppFiles() function. If this is the case, my question is that in the post discussing this, the supplementary file is of a csv format. When I run this function on my list of GSE experiment numbers most of them are .tar files which contain numerous .cel files. How would I go about preparing this data for DESeq2 analysis? I apologize if I got this all wrong, but I cannot seem to find any comprehensive answers on how to perform DESeq2 starting from a GSE accession number.

GEO DESEQ2 RNA-Seq • 1.6k views

ADD COMMENT • link updated 3.7 years ago by jared.andrews07 ★ 16k • written 3.7 years ago by thewisechineseguru • 0

score 1 · Answer 1 · 2020-07-28

CEL files are not RNA-seq. They are RNA microarrays, which are better processed through limma and cannot be directly compared to RNA-seq data.

Additionally, there will be serious batch/technical effects when trying to compare data generated from different folks in different labs at different institutions using different methods. Trying to integrate just two datasets that are supposedly identical but prepped in different labs is challenging - more than that and any results you get should be taken with several grains of salt. If possible, a better approach would be to perform your differential expression analyses within each dataset and compare the results between them, though this assume the samples were all sorted similarly, the experimental setup was similar, etc.

Your approach is correct though - getGEO and getGEOSuppFiles are typically the way to go, depending on how the GEO record is organized. Note that getGEOSuppFiles only downloads the files, but then you can import them as needed. If the user provided gene counts (from salmon, kallisto, htseq, etc), then this process is pretty straightforward and prepping the files for DESeq2 use isn't a difficult task. Otherwise, you will have to download the raw data (FASTQ or BAM files) and generate them yourself.