3 min read

R with data.table

Forgive this rant at the tidyverse universe. I started out learning R from Roger Peng’s excellent intro on Coursera. However, it wasn’t long before the tidyverse was being suggested as the ideal workflow for data science. Everything I could possibly want was offered by Hadley and Co. at RStudio. Need to plot stuff? There’s ggplot2. Need to work with dates? There’s lubridate. Need models? There’s parsnip (now caret). Need data manipulation? There’s dplyr. Need i/o? There’s readr. The underlying message was that there was no need to go anywhere else. Keep it all in house. This worked for a while, but as I began to write code for production environments, I realized a few things:

  • The tidyverse is constantly being updated, and those updates sometimes made old tidyverse code not work.

  • Tidyverse code may look pretty, but it ain’t fast.

  • Tidyverse packages have a ton of dependencies that make installation a hassle.

My first exposure to a crack in the ‘matrix’ was when I stumbled onto this Stack Overflow post. At first I thought it was just a bunch of cranks criticizing dplyr, and continued to write reams of dplyr code, but I started to sneak in snippets of data.table code when I could. I especially loved data.table’s fread. I had always heard people complain about R being slow, and the tidyverse fulfilled that very nicely. But I discovered that the data.table fread / fwrite functions made I/O unbelievably fast. That was curious to me. How could anything in R be that fast?

Eventually I started needing code to run faster, and I had the choice of either switching to another language, or finding a better was of using R. I knew people had talked about data.table being fast, and I started using it for some simpler projects. I especially loved that there were no dependencies, and that updates only made things better without breaking old code. The more I used data.table, the more I forgot how to use dplyr. It was just a different paradigm.

I started seeing other people I respected in the data science community saying the same thing. See this and this. It made me feel a little like the wool had been pulled over my eyes.

These days I’m try to avoid any package that comes out of the tidyverse. I don’t have any desire to ‘pipe’ anything.

myDataFrame %>% group_by(myGrp) %>% filter(mycolum == 123) %>% mutate(newColumn = oldColmn + 7)

This feels more like drawing on a whiteboard than writing code!

myDataFrame[myColumn == 123,newColumn := oldColumn + 7,by = myGrp]

I think at a more fundamental level, I’m confused by the philosophy that new coders should only use the packages from one group, a group that has a very opinionated way of coding. The whole idea behind the open-source community on CRAN is that we all contribute packages that we all can use. There’s an assumption that my package will do something well, and your package will do something different well, and it’s up to the user which packages to decide to use. Instead, with Hadley and Co., there’s somehow an implicit assumption that ‘our’ packages are the best, and that you should really only use tidyverse code. And new students should most definitely only be taught the tidyverse. I’m not sure what’s behind this. Is it a power-grab by RStudio? Is it just a control thing? As long as it’s all free, I guess it doesn’t matter, but I wonder if the idea is for us all to get so dependent on tidyverse, and then one day for it to become not free? Somehow the philosophy just doesn’t seem to be as encouraging of the free open development of packages by the entire community. Whatever it is, I’d prefer to avoid it. And I think my code is better for it.