I have a list of genes (
long_list) from which I want to filter a list with genes of interest (
goi) to retrieve a filtered list of genes (
gois are all expected to be contained in the
However, when I filter for the
gois, I always retrieve four extra genes.
genes_filtered <- long_list[long_list$ensembl %in% goi$ensembl,]
Dimensions of the lists:
dim(long_list) > 15251 dim(goi) > 11221 dim(genes_filtered) > 11225 #Should be 11221 (same as goi)
I have tried the following to get to the bottom if this.
Checking duplicates in
dim(genes_filtered[duplicated(genes_filtered$ensembl),]) >  0 dim(long_list[duplicated(long_list$ensembl),]) >  0 dim(goi[duplicated(goi$ensembl),]) > NULL
Checking missing values:
sum(is.na(genes_filtered)) >  0 sum(is.na(long_list)) >  0 sum(is.na(goi)) >  0
Checking values contained in
genes_filtered but not in
# 1: Using lists genes_filtered[!(genes_filtered$ensembl %in% goi$ensembl)] data frame with 0 columns and 11125 rows # 2: Extracting columns first from lists f <- genes_filtered$ensembl g <- goi$ensembl g[!(g %in% f)]  "ENSG00000283208" "ENSG00000284292" "ENSG00000262633" ...
Method 2 retrieves a list of in total 96 genes, which is not expected.
Can anyone explain why the filtering method at the top of the post does not work and possibly suggest a correct way?