At a high level, the tidyverse is a language for solving data science challenges with R code. Its primary goal is to facilitate a conversation between a human and a computer about data. Less abstractly, the tidyverse is a collection of R packages that share an underlying design philosophy, grammar, and data structures, so that learning one package should make it easier to learn the next.
The tidyverse encompasses the repeated tasks at the heart of every data science project: data import, tidying, manipulation, visualisation, and programming. We expect that almost every project will use multiple domain-specific packages outside of the tidyverse: our goal is to provide tooling for the most common challenges, not to solve every possible problem. Notably, the tidyverse doesn’t include tools for statistical modelling or communication. These toolkits are critical for data science, but are so large that they merit separate treatment.
This paper describes the tidyverse package, the components of the tidyverse, and some of the underlying philosophy.
As well as being a collection of packages, the tidyverse itself is also a package. This provides a convenient way of downloading all tidyverse packages with a single R command:
The core tidyverse includes the packages that you’re likely to use in everyday data analyses, and these are loaded when you load the tidyverse package:
library(tidyverse) #> ── Attaching packages ───────────────────────────── tidyverse 126.96.36.19900 ── #> ✔ ggplot2 3.2.0 ✔ purrr 0.3.2 #> ✔ tibble 2.1.3 ✔ dplyr 0.8.2 #> ✔ tidyr 0.8.3 ✔ stringr 1.4.0 #> ✔ readr 1.3.1 ✔ forcats 0.4.0 #> ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ── #> ✖ dplyr::filter() masks stats::filter() #> ✖ dplyr::lag() masks stats::lag()
As of tidyverse 1.2.0, the core packages include ggplot2 (Wickham 2016a), dplyr (Wickham et al. 2018), tidyr (Wickham and Henry 2018), readr (Wickham, Hester, and Francois 2018), purrr (Henry and Wickham 2018), tibble (Müller and Wickham 2018), stringr (Wickham 2018b), and forcats (Wickham 2018a).
Non-core packages are installed with
install.packages("tidyverse"), but are not loaded with
library(tidyverse). They play more specialised roles, so will be individually loaded by the analyst as needed.
How do the component packages of the tidyverse fit together? We use the model of data science tools from “R for Data Science” (Wickham and Grolemund 2017):
Every analysis starts with data import: if you can’t get your data into R, you can’t do data science on it! Data import takes data stored in a file, database, or web API, and reads it into a data frame in R. Data import is supported by the core readr (Wickham, Hester, and Francois 2018) package for flat files (like csv, tsv, and fwf). Additional non-core packages, such as readxl (Wickham and Bryan 2018), rvest (Wickham 2016b), and haven (Wickham and Miller 2018), make it possible to import data stored in other common formats.
Next, we recommend that you tidy your data, getting it into a consistent form that makes the rest of the analysis easier. Most funtions in tidyverse work with tidy data (Wickham 2014), where every column is a variable, every row is an observation, and every cell contains a single value. If your data is not already in this form (almost always!), the core tidyr (Wickham and Henry 2018) provides tools to tidy it up.
Data transformation is supported by the core dplyr (Wickham et al. 2018) package. dplyr provides verbs that work with whole data frames, such as
mutate() to create new variables,
filter() to find observations matching given criteria, and
left_join() and friends to combine multiple tables. dplyr is paired with packages that provide tools for specific column types:
There are two main tools for understanding data: visualisation and modelling. The tidyverse provides the ggplot2 (Wickham 2016a) package for visualisation. ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics (Wilkinson 2005). You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details. Modelling is outside the scope of this paper, but is part of the closely affiliated tidymodels (Max and Wickham 2018) project, which shares interface design and data structures with the tidyverse.
Finally, you’ll need to communicate your results to someone else. Communication is one of the most important parts of data science, but is not included within tidyverse. Instead, we expect people will use other R packages, like rmarkdown (Allaire et al. 2018) and shiny (Chang et al. 2018), which support dozens of static and dynamic output formats.
Surrounding all these tools is programming. Programming is a cross-cutting tool that you use in every part of a data science project. Programming tools in the tidyverse include:
purrr (Henry and Wickham 2018), which enhances R’s functional programming toolkit.
tibble (Müller and Wickham 2018), which provides a modern re-imagining of the venerable data frame, keeping what time has proven to be effective, and throwing out what it has not.
reprex (Bryan et al. 2018), which helps programmers get help when they get stuck by easing the creation of reproducible examples.
magrittr (Bache and Wickham 2014), which provides the pipe,
%>%, used throughout the tidyverse. The pipe is a tool for function composition, making it easier to solve large problems by breaking them into small pieces.
We are still working to explicitly describe the unifying principles that make the tidyverse so consistent, and you can read our latest thoughts at https://principles.tidyverse.org/. But there is one particularly important principle that I want to call out here: the tidyverse is fundamentally human centered. That is, the tidyverse is designed to support the activities of a human data analyst, so to be effective tool builders, we must explicitly recognise and acknowledge the strengths and weaknesses of human cognition.
This is particularly important for R, because it’s a language that’s used primarily by non-programmers, and we want to make it as easy as possible for first-time and end-user programmers to learn the tidyverse. This means that we spend a lot of time thinking about interface design, and have recently started experimenting with surveys to help guide interface choices.
Similarly, the tidyverse is not just the collection of packages — it is also the community of people who use them. We want the tidyverse to be a diverse, inclusive, and welcoming community. We are still developing our skills in this area, but our existing approaches include active use of twitter to solicit feedback, announce updates, and generally listen to the community. We also keeping users apprised of major upcoming changes through the tidyverse blog, run developer days, and support lively discussions on RStudio community.
We are grateful for the financial support of RStudio, Inc.
Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2018. rmarkdown: Dynamic Documents for R. https://rmarkdown.rstudio.com.
Bache, Stefan Milton, and Hadley Wickham. 2014. magrittr: A Forward-Pipe Operator for R. https://CRAN.R-project.org/package=magrittr.
Bryan, Jennifer, Jim Hester, David Robinson, and Hadley Wickham. 2018. reprex: Prepare Reproducible Example Code via the Clipboard. https://CRAN.R-project.org/package=reprex.
Chang, Winston, Joe Cheng, JJ Allaire, Yihui Xie, and Jonathan McPherson. 2018. shiny: Web Application Framework for R. https://CRAN.R-project.org/package=shiny.
Henry, Lionel, and Hadley Wickham. 2018. purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.
Max, Kuhn, and Hadley Wickham. 2018. tidymodels: Easily Install and Load the ’Tidymodels’ Packages. https://CRAN.R-project.org/package=tidymodels.
Müller, Kirill. 2018. hms: Pretty Time of Day. https://CRAN.R-project.org/package=hms.
Müller, Kirill, and Hadley Wickham. 2018. tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.
Spinu, Vitalie, Garrett Grolemund, and Hadley Wickham. 2018. lubridate: Make Dealing with Dates a Little Easier. https://CRAN.R-project.org/package=lubridate.
Wickham, Hadley. 2014. “Tidy Data.” The Journal of Statistical Software 59.
———. 2016a. ggplot2: Elegant Graphics for Data Analysis. useR. Springer.
———. 2016b. rvest: Easily Harvest (Scrape) Web Pages. https://CRAN.R-project.org/package=rvest.
———. 2018a. forcats: Tools for Working with Categorical Variables (Factors). https://CRAN.R-project.org/package=forcats.
———. 2018b. stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.
Wickham, Hadley, and Jennifer Bryan. 2018. readxl: Read Excel Files. https://CRAN.R-project.org/package=readxl.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc.
Wickham, Hadley, and Lionel Henry. 2018. tidyr: Easily Tidy Data with ’Spread()’ and ’Gather()’ Functions. https://CRAN.R-project.org/package=tidyr.
Wickham, Hadley, Jim Hester, and Romain Francois. 2018. readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.
Wickham, Hadley, and Evan Miller. 2018. haven: Import and Export "Spss", "Stata" and "Sas" Files. https://CRAN.R-project.org/package=haven.
Wilkinson, Leland. 2005. The Grammar of Graphics. Berlin, Heidelberg: Springer-Verlag.