Update: I solved my own problem, solution at the bottom.
Hi, I'm trying to load a tab-delimited text file of single cell data from here:
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE120575
(The file is: GSE120575_Sade_Feldman_melanoma_single_cells_TPM_GEO.txt.gz)
I unzipped the file and loaded the .txt file into R using read.delim() and read_tsv(). However, I noticed some cells (columns) are missing.
I'm trying to subset just the CD8+ T-cells from the dataset using the cell IDs from their supplementary data, Table S2, which has a sheet with all the CD8+ T-cell IDs. (paper and supp data here: https://www.sciencedirect.com/science/article/pii/S0092867418313941?via%3Dihub#app2 )
Oddly, some of the CD8+ T-cells are not present in the single cell dataframe. The number of cells loaded in the single cell dataframe is 16292, The number of T-cells from the supplement is 6350, and the number of T-cell IDs in the single cell dataframe is 5410. Some cells (columns) were just not loaded apparently.
I went back to check the original txt file, and the missing T-cell IDs are in the txt file. This makes me think there's an issue with loading/parsing the txt file and some columns were dropped.
Any ideas for correctly loading this txt file so the columns aren't dropped and I can access all of the T-cells? Here is my code.
# Import single cell data
hacohen.data <- read.delim("GSE120575_Sade_Feldman_melanoma_single_cells_TPM_GEO.txt")
hacohen.data2 <- read_tsv("GSE120575_Sade_Feldman_melanoma_single_cells_TPM_GEO.txt")
dim(hacohen.data) # 55738 16292
dim(hacohen.data2) # 55738 16292
# Import csv with the Good and Bad identities of the CD8 cells
CD8_GB_idents <- read.csv(file = 'Cd8_GB_idents.csv')
head(CD8_GB_idents)
Cell.Name Cluster
1 A2_P4_M11 CD8_G
2 A4_P3_M11 CD8_B
3 A4_P4_M11 CD8_B
4 A4_P6_M11 CD8_G
5 A6_P6_M11 CD8_B
6 A7_P2_M11 CD8_G
# Does the single cell dataset contain all of the CD8 cell labels?
length(CD8_GB_idents$Cell.Name) # 6350
length(colnames(hacohen.data)) # 16292
length(intersect(CD8_GB_idents$Cell.Name, colnames(hacohen.data))) # 5410
# Looks like not all of the CD8 cells labels are in the single cell dataset...
# Which ones are and aren't?
common_cells <- CD8_GB_idents[CD8_GB_idents$Cell.Name %in%
intersect(CD8_GB_idents$Cell.Name, colnames(hacohen.data)),]
missing_cells <- CD8_GB_idents[!(CD8_GB_idents$Cell.Name %in%
intersect(CD8_GB_idents$Cell.Name, colnames(hacohen.data))),]
head(common_cells)
Cell.Name Cluster
1 A2_P4_M11 CD8_G
2 A4_P3_M11 CD8_B
3 A4_P4_M11 CD8_B
4 A4_P6_M11 CD8_G
5 A6_P6_M11 CD8_B
6 A7_P2_M11 CD8_G
head(missing_cells)
Cell.Name Cluster
1453 A10_P3_MMD1-84A_L001 CD8_G
1454 A11_P2_MMD1-84A_L001 CD8_G
1455 A2_P1_MMD1-84A_L001 CD8_B
1456 A2_P2_MMD1-84A_L001 CD8_G
1457 A2_P3_MMD1-84A_L001 CD8_B
1458 A2_P4_MMD1-84A_L001 CD8_G
For big files, you may want to additionally add
data.table::fread()
to your repertoire of functions. It, by default, reads data into a DataTable, but you can read into a standard data-frame viafread(..., data.table = FALSE)