Tutorial:Factors in R - How do they work, and how did I break my data?
0
10
Entering edit mode
14 months ago

Of the object types in R, factors tend to give people the most grief. I wanted to provide a quick (but not too quick) primer on factors in R to help alleviate some of the confusion.

Other R tutorials:

# Factors are slightly different.

### Base objects

R has a few base objects, and these include common data structures such as numeric, character, integer, and logical.

They are recognized as a base object (using a numeric value as an example).

> sloop::otype(1)
[1] "base"


Base objects can have a class.

> class(1)
[1] "numeric"


But importantly, base objects do not have attributes.

> attributes(1)
NULL


### Factors and S3 object.

Factors on the other hand are a little different, they are an S3 object (which is not a base object).

f <- factor(c("A", "B"))
> sloop::otype(f)
[1] "S3"


S3 objects have a class.

> class(f)
[1] "factor"


But unlike base objects, S3 objects have attributes (confusingly, including a formal class attribute).

> attributes(f)
$levels [1] "A" "B"$class
[1] "factor"


# Factors are fancy integers.

What does this distinction between base and S3 objects mean for us? There is no "pure" factor base object, but instead factors are just an extension of the integer base object. If you check the type of a factor it will indeed return an integer.

> typeof(f)
[1] "integer"


If we take a peak behind the scenes of an integer, you’ll see that it’s of type INTSXP. An L after a number in R is just a shortcut to make it an integer.

> lobstr::sxp(1L)
[1:0x55ecb6fd27d0] <INTSXP[1]> (named:5)


If we do the same for a factor, you will see something similar (but with extra "stuff").

> lobstr::sxp(f)
[1:0x55ecb4e9ee18] <INTSXP[2]> (object named:12)
_attrib [2:0x55ecb4a1af50] <LISTSXP> (named:1)
levels [3:0x55ecb6691a18] <STRSXP[2]> (named:65535)
class [4:0x55ecb4e9eda8] <STRSXP[1]> (named:65535)


What this means is that the original character values of c("A", "B") are now just the integers 1, 2. This integer has two attributes: class factor, and levels "A","B".

So now our original character vector of c("A", "B") is actually a vector of integers c(1L, 2L) with the levels attribute of c("A", "B").

# How are my chromosome names being converted to a factor?

Let’s say we have a character vector of chromosome names.

chrm <- sprintf("chr%s", 1:12)

> chrm
[1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9"
[10] "chr10" "chr11" "chr12"


We want to convert this to a factor, and in doing so we know that an integer is created, this integer has a levels attribute, and this integer has the S3 object class of factor. I will present an abbreviated (and slightly modified) representation of what the factor function does when creating a factor.

First, the levels are created. Only unique values are retained, the values are sorted by alphabetic or numeric order, and converted to a character (if it’s not a character).

levels <- unique(chrm)
levels <- sort(levels)
levels <- as.character(levels)

> levels
[1] "chr1"  "chr10" "chr11" "chr12" "chr2"  "chr3"  "chr4"  "chr5"  "chr6"
[10] "chr7"  "chr8"  "chr9"


The function then goes back and assigns an integer to each chromosome based on the order in which that chromosome appears in the levels. So for example, "chr2" is the 5th element in the sorted levels, so will be assigned the integer 5.

int <- match(chrm, levels)

> int
[1]  1  5  6  7  8  9 10 11 12  2  3  4


A levels attribute is then added to the integer, because we know that a factor must have a levels attribute.

levels(int) <- levels

> int
[1]  1  5  6  7  8  9 10 11 12  2  3  4
attr(,"levels")
[1] "chr1"  "chr10" "chr11" "chr12" "chr2"  "chr3"  "chr4"  "chr5"  "chr6"
[10] "chr7"  "chr8"  "chr9"


And finally, the class for the object is set to factor.

class(int) <- "factor"

> int
[1] chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chr10 chr11 chr12
Levels: chr1 chr10 chr11 chr12 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9


The int variable is now confusingly a properly formatted factor.

> sloop::otype(int); class(int)
[1] "S3"
[1] "factor"


Notice that the integers are now being displayed as the chromosome names on screen, but they are still indeed the integers we generated earlier.

> typeof(int)
[1] "integer"
> as.integer(int)
[1]  1  5  6  7  8  9 10 11 12  2  3  4


# Factors broke my data and I hate R!

Now that we know factors are a little different, and that factors are fancy integers, we must keep this in mind when working with factors. Be mindful when converting factors between different object types!

### Simple example.

Let’s consider a slightly less confusing example first.

f <- c("C", "A", "B")
f <- factor(f)

> f
[1] C A B
Levels: A B C


We know that the factor levels are in alphabetical order, and the letters C, A, B are actually the integers 3, 1, 2. Converting the factor to a character will just return a character vector of C, A, B.

> as.character(f)
[1] "C" "A" "B"


Converting this using either as.numeric or as.integer will return the underlying integers for C, A, B.

> as.numeric(f)
[1] 3 1 2


### Confusing example.

I’ve built you up using a simple example, and now time to break you down using a confusing example.

f <- c(10, 5, 7)
f <- factor(f)

> f
[1] 10 5  7
Levels: 5 7 10


The factor levels are in ascending order, which means the underlying integers for the numbers 10, 5, and 7 would be 3, 1, 2. Converting this factor to a character gives us sensible results, although the original numbers are now characters instead of numeric.

> as.character(f)
[1] "10" "5"  "7"


However, if we were to convert this to a number or integer, it will return the underlying integer representation, and not the original numbers!

> as.numeric(f)
[1] 3 1 2


For this particular example, the safe way to convert the factor back into the original numbers is to confusingly convert to a character first, and then back to a number.

> as.numeric(as.character(f))
[1] 10  5  7


# Ending notes.

If you want to read more about factors I recommend the base types and S3 types chapters from Advanced R.

The forcats library (part of the tidyverse) will give you more control when working with factors, and is highly recommended.

# R session information.

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /geode2/home/u070/rpolicas/Carbonate/.conda/envs/R/lib/libopenblasp-r0.3.10.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] lobstr_1.1.1 sloop_1.0.1

loaded via a namespace (and not attached):
[1] compiler_4.0.3 tools_4.0.3    Rcpp_1.0.5     rlang_0.4.9

R Tutorial • 794 views
1
Entering edit mode

One thing that I would mention is that some functions will tacitly convert data to factors, yet when printing it, the content may still look like a string or integer. Moreover, sometimes subsequent operations still "work" even when the data has the wrong type. Alas, the results computed when data is treated as a factor are usually different.

That silent behavior that leads to wrong results is the greatest danger.

1
Entering edit mode

The eternal struggle in a weakly typed language. This is great advice, and more generally people should always be aware of what is being returned by a function, and what a function expects as input.

You can figure out what your inputs or outputs are with functions like str, class, and typeof. The internals of functions are a little tougher, since you rely on the maintainer to enforce input types that make sense. There are usually hints or explicit statements to the format of inputs and outputs in the function documentation (? or help()) for the arguments. The "Values" section should tell you what object type is returned by a function too.