Question

Parsing issues with single cell file from GEO

0

Entering edit mode

2.7 years ago

iddryg • 0

Update: I solved my own problem, solution at the bottom.

Hi, I'm trying to load a tab-delimited text file of single cell data from here:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE120575

(The file is: GSE120575_Sade_Feldman_melanoma_single_cells_TPM_GEO.txt.gz)

I unzipped the file and loaded the .txt file into R using read.delim() and read_tsv(). However, I noticed some cells (columns) are missing.

I'm trying to subset just the CD8+ T-cells from the dataset using the cell IDs from their supplementary data, Table S2, which has a sheet with all the CD8+ T-cell IDs. (paper and supp data here: https://www.sciencedirect.com/science/article/pii/S0092867418313941?via%3Dihub#app2 )

Oddly, some of the CD8+ T-cells are not present in the single cell dataframe. The number of cells loaded in the single cell dataframe is 16292, The number of T-cells from the supplement is 6350, and the number of T-cell IDs in the single cell dataframe is 5410. Some cells (columns) were just not loaded apparently.

I went back to check the original txt file, and the missing T-cell IDs are in the txt file. This makes me think there's an issue with loading/parsing the txt file and some columns were dropped.

Any ideas for correctly loading this txt file so the columns aren't dropped and I can access all of the T-cells? Here is my code.

# Import single cell data
hacohen.data <- read.delim("GSE120575_Sade_Feldman_melanoma_single_cells_TPM_GEO.txt")
hacohen.data2 <- read_tsv("GSE120575_Sade_Feldman_melanoma_single_cells_TPM_GEO.txt")
dim(hacohen.data) # 55738 16292
dim(hacohen.data2) # 55738 16292

# Import csv with the Good and Bad identities of the CD8 cells
CD8_GB_idents <- read.csv(file = 'Cd8_GB_idents.csv')
head(CD8_GB_idents)

  Cell.Name Cluster
1 A2_P4_M11   CD8_G
2 A4_P3_M11   CD8_B
3 A4_P4_M11   CD8_B
4 A4_P6_M11   CD8_G
5 A6_P6_M11   CD8_B
6 A7_P2_M11   CD8_G

# Does the single cell dataset contain all of the CD8 cell labels?
length(CD8_GB_idents$Cell.Name) # 6350
length(colnames(hacohen.data)) # 16292
length(intersect(CD8_GB_idents$Cell.Name, colnames(hacohen.data))) # 5410

# Looks like not all of the CD8 cells labels are in the single cell dataset...
# Which ones are and aren't?
common_cells <- CD8_GB_idents[CD8_GB_idents$Cell.Name %in% 
                              intersect(CD8_GB_idents$Cell.Name, colnames(hacohen.data)),]
missing_cells <- CD8_GB_idents[!(CD8_GB_idents$Cell.Name %in% 
                             intersect(CD8_GB_idents$Cell.Name, colnames(hacohen.data))),]

head(common_cells)
  Cell.Name Cluster
1 A2_P4_M11   CD8_G
2 A4_P3_M11   CD8_B
3 A4_P4_M11   CD8_B
4 A4_P6_M11   CD8_G
5 A6_P6_M11   CD8_B
6 A7_P2_M11   CD8_G

head(missing_cells)

                Cell.Name Cluster
1453 A10_P3_MMD1-84A_L001   CD8_G
1454 A11_P2_MMD1-84A_L001   CD8_G
1455  A2_P1_MMD1-84A_L001   CD8_B
1456  A2_P2_MMD1-84A_L001   CD8_G
1457  A2_P3_MMD1-84A_L001   CD8_B
1458  A2_P4_MMD1-84A_L001   CD8_G

cell single parsing R • 909 views

ADD COMMENT • link updated 2.7 years ago by Kevin Blighe 87k • written 2.7 years ago by iddryg • 0

score 1 · Accepted Answer · 2021-07-28

1

Entering edit mode

2.7 years ago

iddryg • 0

Okay, as I wrote this out I realized the issue. read.delim() turns hyphens into periods, which is why the names didn't match. I learned that read.delim() is just a wrapper for read.table(), and you can disable the hyphen-to-period functionality using read.table(......, check.names=FALSE).

see: https://stackoverflow.com/questions/25471567/how-to-prevent-read-table-from-changing-underscores-and-hyphens-to-dots

I guess I will just post this anyway, in case someone else runs into a similar problem. Cheers.

ADD COMMENT • link 2.7 years ago by iddryg • 0

0

Entering edit mode

For big files, you may want to additionally add data.table::fread() to your repertoire of functions. It, by default, reads data into a DataTable, but you can read into a standard data-frame via fread(..., data.table = FALSE)

ADD REPLY • link 2.7 years ago by Kevin Blighe 87k