Question

Questions about supplementary file contents of a GEO dataset

0

Entering edit mode

16 months ago

Josh ▴ 20

Hi all, I am interested in the supplementary files from the GEO dataset: GSE145840.

However, I have doubts about the content of the files.

Do those files contain the raw count information in tables?
From that GEO dataset I mentioned, the tables in the supplementary files do not contain column names. From your experience, do you know what each column corresponds to? (the first one is the name of a gene, but the second and third ones what do they refer to?). Or do you know where in the dataset might be something that would give me information about those columns? For example, here I show you 2 supplementary files where the number of columns are different:

From GSM4337084_BAT_105.txt:

Col1	Col2
4933401J01Rik	0
Gm26206	0
Xkr4	4

From GSM4337104_LIVER-HFD1.txt:

Col1	Col2	Col3
4933401J01Rik	1070	0
Gm26206	110	0
Xkr4	6094	1

Thanks for your time and help :)

GEO • 832 views

ADD COMMENT • link 16 months ago by Josh ▴ 20

1

Entering edit mode

I recommend you to use the raw data from here for your analysis

ADD REPLY • link 16 months ago by DareDevil ★ 4.3k

score 3 · Accepted Answer · 2022-11-29

I agree It's frustrating that there's no direct description of the supplementary files, but looking at the record it simply says STAR alignment, featureCount, gene-level extraction. The three column files have: geneID geneLength readCount, while the two column files simply have geneID readCount. The two column files also have 91 more genes counted than the three column files (for the few I looked at). I'd guess they may have been aligned by different people, or using slightly different transcriptome versions. Or maybe whoever assembled the submission changed their process while putting the submission together (including or excluding the alignment target length from the file).

This is a good case for simply downloading the raw data and aligning and counting it yourself - especially if you have to analyze the entire data set, given that all the files don't have counts for all the same genes. On the other hand, for a quick analysis you could extract out the counts for the common gene set, but do some QC to make sure the data sets are comparable.