Parsing issues with single cell file from GEO
1
0
Entering edit mode
2.7 years ago
iddryg • 0

Update: I solved my own problem, solution at the bottom.

Hi, I'm trying to load a tab-delimited text file of single cell data from here:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE120575

(The file is: GSE120575_Sade_Feldman_melanoma_single_cells_TPM_GEO.txt.gz)

I unzipped the file and loaded the .txt file into R using read.delim() and read_tsv(). However, I noticed some cells (columns) are missing.

I'm trying to subset just the CD8+ T-cells from the dataset using the cell IDs from their supplementary data, Table S2, which has a sheet with all the CD8+ T-cell IDs. (paper and supp data here: https://www.sciencedirect.com/science/article/pii/S0092867418313941?via%3Dihub#app2 )

Oddly, some of the CD8+ T-cells are not present in the single cell dataframe. The number of cells loaded in the single cell dataframe is 16292, The number of T-cells from the supplement is 6350, and the number of T-cell IDs in the single cell dataframe is 5410. Some cells (columns) were just not loaded apparently.

I went back to check the original txt file, and the missing T-cell IDs are in the txt file. This makes me think there's an issue with loading/parsing the txt file and some columns were dropped.

Any ideas for correctly loading this txt file so the columns aren't dropped and I can access all of the T-cells? Here is my code.

# Import single cell data
hacohen.data <- read.delim("GSE120575_Sade_Feldman_melanoma_single_cells_TPM_GEO.txt")
hacohen.data2 <- read_tsv("GSE120575_Sade_Feldman_melanoma_single_cells_TPM_GEO.txt")
dim(hacohen.data) # 55738 16292
dim(hacohen.data2) # 55738 16292

# Import csv with the Good and Bad identities of the CD8 cells
CD8_GB_idents <- read.csv(file = 'Cd8_GB_idents.csv')
head(CD8_GB_idents)

  Cell.Name Cluster
1 A2_P4_M11   CD8_G
2 A4_P3_M11   CD8_B
3 A4_P4_M11   CD8_B
4 A4_P6_M11   CD8_G
5 A6_P6_M11   CD8_B
6 A7_P2_M11   CD8_G

# Does the single cell dataset contain all of the CD8 cell labels?
length(CD8_GB_idents$Cell.Name) # 6350
length(colnames(hacohen.data)) # 16292
length(intersect(CD8_GB_idents$Cell.Name, colnames(hacohen.data))) # 5410

# Looks like not all of the CD8 cells labels are in the single cell dataset...
# Which ones are and aren't?
common_cells <- CD8_GB_idents[CD8_GB_idents$Cell.Name %in% 
                              intersect(CD8_GB_idents$Cell.Name, colnames(hacohen.data)),]
missing_cells <- CD8_GB_idents[!(CD8_GB_idents$Cell.Name %in% 
                             intersect(CD8_GB_idents$Cell.Name, colnames(hacohen.data))),]

head(common_cells)
  Cell.Name Cluster
1 A2_P4_M11   CD8_G
2 A4_P3_M11   CD8_B
3 A4_P4_M11   CD8_B
4 A4_P6_M11   CD8_G
5 A6_P6_M11   CD8_B
6 A7_P2_M11   CD8_G

head(missing_cells)

                Cell.Name Cluster
1453 A10_P3_MMD1-84A_L001   CD8_G
1454 A11_P2_MMD1-84A_L001   CD8_G
1455  A2_P1_MMD1-84A_L001   CD8_B
1456  A2_P2_MMD1-84A_L001   CD8_G
1457  A2_P3_MMD1-84A_L001   CD8_B
1458  A2_P4_MMD1-84A_L001   CD8_G
cell single parsing R • 906 views
ADD COMMENT
1
Entering edit mode
2.7 years ago
iddryg • 0

Okay, as I wrote this out I realized the issue. read.delim() turns hyphens into periods, which is why the names didn't match. I learned that read.delim() is just a wrapper for read.table(), and you can disable the hyphen-to-period functionality using read.table(......, check.names=FALSE).

see: https://stackoverflow.com/questions/25471567/how-to-prevent-read-table-from-changing-underscores-and-hyphens-to-dots

I guess I will just post this anyway, in case someone else runs into a similar problem. Cheers.

ADD COMMENT
0
Entering edit mode

For big files, you may want to additionally add data.table::fread() to your repertoire of functions. It, by default, reads data into a DataTable, but you can read into a standard data-frame via fread(..., data.table = FALSE)

ADD REPLY

Login before adding your answer.

Traffic: 2937 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6