Of the object types in R, factors tend to give people the most grief. I wanted to provide a quick (but not too quick) primer on factors in R to help alleviate some of the confusion.
Other R tutorials:
- data.frames in R: How vectorization can save you time and heartache
- Long versus wide data for plotting genomics data in R.
- Pipes in R: An introduction and some advanced tips and tricks.
Factors are slightly different.
R has a few base objects, and these include common data structures such as numeric, character, integer, and logical.
They are recognized as a base object (using a numeric value as an example).
> sloop::otype(1)  "base"
Base objects can have a class.
> class(1)  "numeric"
But importantly, base objects do not have attributes.
> attributes(1) NULL
Factors and S3 object.
Factors on the other hand are a little different, they are an S3 object (which is not a base object).
f <- factor(c("A", "B")) > sloop::otype(f)  "S3"
S3 objects have a class.
> class(f)  "factor"
But unlike base objects, S3 objects have attributes (confusingly, including a formal class attribute).
> attributes(f) $levels  "A" "B" $class  "factor"
Factors are fancy integers.
What does this distinction between base and S3 objects mean for us? There is no "pure" factor base object, but instead factors are just an extension of the integer base object. If you check the type of a factor it will indeed return an integer.
> typeof(f)  "integer"
If we take a peak behind the scenes of an integer, you’ll see that it’s of type
L after a number in R is just a shortcut to make it an integer.
> lobstr::sxp(1L) [1:0x55ecb6fd27d0] <INTSXP> (named:5)
If we do the same for a factor, you will see something similar (but with extra "stuff").
> lobstr::sxp(f) [1:0x55ecb4e9ee18] <INTSXP> (object named:12) _attrib [2:0x55ecb4a1af50] <LISTSXP> (named:1) levels [3:0x55ecb6691a18] <STRSXP> (named:65535) class [4:0x55ecb4e9eda8] <STRSXP> (named:65535)
What this means is that the original character values of
c("A", "B") are now just the integers
1, 2. This integer has two attributes: class
factor, and levels
So now our original character vector of
c("A", "B") is actually a vector of integers
c(1L, 2L) with the levels attribute of
How are my chromosome names being converted to a factor?
Let’s say we have a character vector of chromosome names.
chrm <- sprintf("chr%s", 1:12) > chrm  "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"  "chr10" "chr11" "chr12"
We want to convert this to a factor, and in doing so we know that an integer is created, this integer has a levels attribute, and this integer has the S3 object class of factor. I will present an abbreviated (and slightly modified) representation of what the
factor function does when creating a factor.
First, the levels are created. Only unique values are retained, the values are sorted by alphabetic or numeric order, and converted to a character (if it’s not a character).
levels <- unique(chrm) levels <- sort(levels) levels <- as.character(levels) > levels  "chr1" "chr10" "chr11" "chr12" "chr2" "chr3" "chr4" "chr5" "chr6"  "chr7" "chr8" "chr9"
The function then goes back and assigns an integer to each chromosome based on the order in which that chromosome appears in the levels. So for example, "chr2" is the 5th element in the sorted levels, so will be assigned the integer 5.
int <- match(chrm, levels) > int  1 5 6 7 8 9 10 11 12 2 3 4
A levels attribute is then added to the integer, because we know that a factor must have a levels attribute.
levels(int) <- levels > int  1 5 6 7 8 9 10 11 12 2 3 4 attr(,"levels")  "chr1" "chr10" "chr11" "chr12" "chr2" "chr3" "chr4" "chr5" "chr6"  "chr7" "chr8" "chr9"
And finally, the class for the object is set to factor.
class(int) <- "factor" > int  chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 Levels: chr1 chr10 chr11 chr12 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9
int variable is now confusingly a properly formatted factor.
> sloop::otype(int); class(int)  "S3"  "factor"
Notice that the integers are now being displayed as the chromosome names on screen, but they are still indeed the integers we generated earlier.
> typeof(int)  "integer" > as.integer(int)  1 5 6 7 8 9 10 11 12 2 3 4
Factors broke my data and I hate R!
Now that we know factors are a little different, and that factors are fancy integers, we must keep this in mind when working with factors. Be mindful when converting factors between different object types!
Let’s consider a slightly less confusing example first.
f <- c("C", "A", "B") f <- factor(f) > f  C A B Levels: A B C
We know that the factor levels are in alphabetical order, and the letters
C, A, B are actually the integers
3, 1, 2. Converting the factor to a character will just return a character vector of
C, A, B.
> as.character(f)  "C" "A" "B"
Converting this using either
as.integer will return the underlying integers for
C, A, B.
> as.numeric(f)  3 1 2
I’ve built you up using a simple example, and now time to break you down using a confusing example.
f <- c(10, 5, 7) f <- factor(f) > f  10 5 7 Levels: 5 7 10
The factor levels are in ascending order, which means the underlying integers for the numbers
10, 5, and 7 would be
3, 1, 2. Converting this factor to a character gives us sensible results, although the original numbers are now characters instead of numeric.
> as.character(f)  "10" "5" "7"
However, if we were to convert this to a number or integer, it will return the underlying integer representation, and not the original numbers!
> as.numeric(f)  3 1 2
For this particular example, the safe way to convert the factor back into the original numbers is to confusingly convert to a character first, and then back to a number.
> as.numeric(as.character(f))  10 5 7
If you want to read more about factors I recommend the base types and S3 types chapters from Advanced R.
The forcats library (part of the tidyverse) will give you more control when working with factors, and is highly recommended.
R session information.
> sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-conda-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux Matrix products: default BLAS/LAPACK: /geode2/home/u070/rpolicas/Carbonate/.conda/envs/R/lib/libopenblasp-r0.3.10.so locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C  LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8  LC_PAPER=en_US.UTF-8 LC_NAME=C  LC_ADDRESS=C LC_TELEPHONE=C  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  stats graphics grDevices utils datasets methods base other attached packages:  lobstr_1.1.1 sloop_1.0.1 loaded via a namespace (and not attached):  compiler_4.0.3 tools_4.0.3 Rcpp_1.0.5 rlang_0.4.9
One thing that I would mention is that some functions will tacitly convert data to factors, yet when printing it, the content may still look like a string or integer. Moreover, sometimes subsequent operations still "work" even when the data has the wrong type. Alas, the results computed when data is treated as a factor are usually different.
That silent behavior that leads to wrong results is the greatest danger.
The eternal struggle in a weakly typed language. This is great advice, and more generally people should always be aware of what is being returned by a function, and what a function expects as input.
You can figure out what your inputs or outputs are with functions like
typeof. The internals of functions are a little tougher, since you rely on the maintainer to enforce input types that make sense. There are usually hints or explicit statements to the format of inputs and outputs in the function documentation (
help()) for the arguments. The "Values" section should tell you what object type is returned by a function too.