Update: I solved my own problem, solution at the bottom.
Hi, I'm trying to load a tab-delimited text file of single cell data from here:
(The file is: GSE120575_Sade_Feldman_melanoma_single_cells_TPM_GEO.txt.gz)
I unzipped the file and loaded the .txt file into R using read.delim() and read_tsv(). However, I noticed some cells (columns) are missing.
I'm trying to subset just the CD8+ T-cells from the dataset using the cell IDs from their supplementary data, Table S2, which has a sheet with all the CD8+ T-cell IDs. (paper and supp data here: https://www.sciencedirect.com/science/article/pii/S0092867418313941?via%3Dihub#app2 )
Oddly, some of the CD8+ T-cells are not present in the single cell dataframe. The number of cells loaded in the single cell dataframe is 16292, The number of T-cells from the supplement is 6350, and the number of T-cell IDs in the single cell dataframe is 5410. Some cells (columns) were just not loaded apparently.
I went back to check the original txt file, and the missing T-cell IDs are in the txt file. This makes me think there's an issue with loading/parsing the txt file and some columns were dropped.
Any ideas for correctly loading this txt file so the columns aren't dropped and I can access all of the T-cells? Here is my code.
# Import single cell data hacohen.data <- read.delim("GSE120575_Sade_Feldman_melanoma_single_cells_TPM_GEO.txt") hacohen.data2 <- read_tsv("GSE120575_Sade_Feldman_melanoma_single_cells_TPM_GEO.txt") dim(hacohen.data) # 55738 16292 dim(hacohen.data2) # 55738 16292 # Import csv with the Good and Bad identities of the CD8 cells CD8_GB_idents <- read.csv(file = 'Cd8_GB_idents.csv') head(CD8_GB_idents) Cell.Name Cluster 1 A2_P4_M11 CD8_G 2 A4_P3_M11 CD8_B 3 A4_P4_M11 CD8_B 4 A4_P6_M11 CD8_G 5 A6_P6_M11 CD8_B 6 A7_P2_M11 CD8_G # Does the single cell dataset contain all of the CD8 cell labels? length(CD8_GB_idents$Cell.Name) # 6350 length(colnames(hacohen.data)) # 16292 length(intersect(CD8_GB_idents$Cell.Name, colnames(hacohen.data))) # 5410 # Looks like not all of the CD8 cells labels are in the single cell dataset... # Which ones are and aren't? common_cells <- CD8_GB_idents[CD8_GB_idents$Cell.Name %in% intersect(CD8_GB_idents$Cell.Name, colnames(hacohen.data)),] missing_cells <- CD8_GB_idents[!(CD8_GB_idents$Cell.Name %in% intersect(CD8_GB_idents$Cell.Name, colnames(hacohen.data))),] head(common_cells) Cell.Name Cluster 1 A2_P4_M11 CD8_G 2 A4_P3_M11 CD8_B 3 A4_P4_M11 CD8_B 4 A4_P6_M11 CD8_G 5 A6_P6_M11 CD8_B 6 A7_P2_M11 CD8_G head(missing_cells) Cell.Name Cluster 1453 A10_P3_MMD1-84A_L001 CD8_G 1454 A11_P2_MMD1-84A_L001 CD8_G 1455 A2_P1_MMD1-84A_L001 CD8_B 1456 A2_P2_MMD1-84A_L001 CD8_G 1457 A2_P3_MMD1-84A_L001 CD8_B 1458 A2_P4_MMD1-84A_L001 CD8_G