1+4
[1] 5
R is an open-source programming language that allows to analyse and manipulate data, create state-of-the-art graphics, and many more. It supports larger data sets, reads any type of data, and runs on multiple platforms (Windows, Mac, Linux) and CPU architectures (x86_64, arm64). R makes it easier to automate tasks, organize projects, ensure reproducibility, and find and fix errors, and anyone can contribute packages to improve its functionality. Moreover, the following points are worth to emphasize:
For newcomers to programming, the learning curve is rather flat at the beginning. One reason is that R tools are spread across many packages, which can overwhelm beginners. There is no centralized support and the helpful and active online community have different backgrounds. It can be difficult for beginners to find the right solution as there are often many different ways to tackle the same problem. Moreover, R can be slower than languages like Python, MATLAB, C/C++ or Java.
There are many different approaches to learning R. It pretty much depends on your preferences, needs, goals, prerequisites and limitations. It is up to you to search and find a suitable way to achieve your learning goals. While I hope you find my notes helpful, I additionally provide in section Section 1.3 a list of other resources that are worth considering. To start with, I recommend my swirl courses that provide an interactive learning environment, see Chapter 4.
Learning a programming language can, like learning a foreign language, be daunting and frustrating. However, if you put in the effort and are not afraid to make mistakes, anybody can learn it. You don’t have to be a nerd. To have a guide next to you can help and speed up your progress significantly. The key is taking action and getting involved. I mean, do write code. Try to copy the code that you read here and elsewhere. Explore what the code does on your machine. Don’t be afraid to make errors. Your PC will not explode. In this paper, most of the code is written in a manner that allows you to effortlessly copy and reproduce the output on your PC. Take advantage of this opportunity and go for it! Hands-on practice is far more enjoyable than merely reading through the material.
Here are some comments that may help you to learn efficiently:
Let me illustrate what I mean: Suppose you send your grandfather the following message:
“Let’s eat grandpa.”
He will probably understand that you’re inviting him to dinner. However, if you sent the same message to a computer, it would interpret the sentence literally due to the missing comma:
“Let’s eat, grandpa.”
The comma makes all the difference in clarifying that you’re speaking to your grandpa, not about eating him! Similarly, in programming, an incorrectly placed comma can break your code or change the meaning of your code.
Thousand of freely available books and resources exist. bookdown.org and the Big Book of R are two vast collections of links to R books that might verify my claim.
In RStudio you find in the right side at the bottom a panel that is called Help. There you find a lot of links, manuals, and references that offer you tons of resources to learn R for free including: education.rstudio.com and Links for Getting Help with R. At the top right of RStudio you find a panel called tutorial. Here you can install the learnr
package that offers some nice interactive tutorials.
Since you may feel overwhelmed by the number of resources, I would like to highlight some books:
ggplot
.Some other sources that are worth mentioning are these:
R is a functional programming language. If you want R to do something, you need to use a function. Or, in the words of Chambers (2017, p. 4):
“Everything that happens is a function call.”
For example, when you like to exit R, you do it with the function q()
:
> q() Save workspace image? [y/n/c]:
If you want to specify what exactly you want R to do for you, you need to refer to the arguments of a function. For example, if you don’t want to be asked interactively what you want to do with your workspace (this is the place where you store all your objects, see section Section 1.5), you can do this with an argument that is part of the q()
function:
> q(save = "no")
To learn more about a function, you can access its documentation by typing a question mark followed by the function name into the Console:
?q()
Unfortunately, the documentation can sometimes be a bit confusing for beginners in applied contexts. However, the documentation for all functions is structured similarly, typically featuring several key sections:
Understanding these sections can significantly enhance your ability to navigate and utilize R.
An excerpt of the R Documentation for the function q()
is shown in Figure 1.3. Here, we observe that the function has three arguments that you can manipulate. If you do not specify any of these arguments explicitly, we see that by default, R sets the three arguments as shown.
R is an object oriented programming language. That means,
“everything that exists in R is an object” (Chambers, 2017, p. 4).
Objects are the fundamental units that are used to store information. Objects can be a variety of data types, including vectors, matrices, data frames, lists, functions. Moreover, you can store empirical results, tables, figures and many more in form of so-called objects. All objects are shown in the workspace which is shown in the Environment panel.
In R, you can show the content of the workspace with ls()
. The function rm()
allows to remove objects and with rm(list=ls())
you clear all objects from the workspace.
While R has a command line interface, there are multiple third-party graphical user interfaces available that improve the user experience a lot. The most successful graphical user interface or integrated development environment (IDE) is RStudio. Throughout this book, I will assume that you are using R via RStudio. First time users often confuse the two. At its simplest, R is like a car’s engine while RStudio is like a car’s dashboard as illustrated in Figure Figure 1.4.
More precisely, R is a functional programming language that runs computations, while RStudio is an integrated development environment (IDE) that provides an interface by adding many convenient features and tools. So just as the way of having access to a speedometer, rearview mirrors, and a navigation system makes driving much easier, using RStudio’s interface makes using R much easier as well.
Much as we don’t drive a car by interacting directly with the engine but rather by interacting with elements on the car’s dashboard, we won’t be using R directly but rather we will use RStudio’s interface. After you install R and RStudio on your computer, you’ll have two new programs (also called applications) you can open. We’ll always work in RStudio and not in the R application. Figure Figure 1.5 shows what icon you should be clicking on your computer.
After you open RStudio, you should see something similar to Figure Figure 1.6 where three or four panels dividing the screen.
The Console panel will contain R’s startup message, which shows information about which version of R you’re running. My startup message at the time of writing was as follows:
You can resize the panels as you like, either by clicking and dragging their borders or using the minimise/maximise buttons in the upper right corner of each panel. Clicking Ctrl++ and Ctrl+- allows to make the fonts larger or smaller.
In the Console you can type in code and push Enter to run the line of code. For example, you can calculate:
1+4
[1] 5
While working in the Console is possible, we usually work in RStudio using so-called scripts. These scripts are plain text files with the file extension “.R”. Scripts are discussed in detail in Chapter 3. To create a script, go to the File menu, select New File and then choose R Script.
With the key shortcut Ctrl+Enter
for Windows and Linux user or by Cmd+Enter
for MacOs users (or by clicking Run
) you can run a line of a script, that means you send one line of code to the Console. See Figure 1.7 how this looks like in RStudio.
As shown in Figure 1.8, setting up R on your personal computer (Windows, Mac, Linux) is a three step process: You will first need to download and install R. After that has been successful you can download and install RStudio. Please note that it is important that you install R first and then install RStudio. As a third but optional step you can install R packages.
If you don’t want to install R on your PC or you don’t have admin rights to do so or if you want to run R on your tablet (IPad or Chromebook) or even your smartphone, you can use RStudio online doing cloud computing on https://posit.cloud. Posit Cloud (formerly RStudio Cloud) is a cloud-based solution that allows anyone to use RStudio online and navigate it through your web browser. It is free for individuals with some restrictions and limited capacities.
A package is a collection of functions, data sets and other R objects that are all grouped together under a common name. More than 20,000 packages are available at the official repository (CRAN). CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R, see: https://cran.r-project.org.].
However, before we get started, there’s a critical distinction that you need to understand, which is the difference between having a package installed on your computer, and having a package loaded in R. When you install R on your computer only a small number of packages come bundled with the basic R installation. The installed packages are on your computer. The critical thing to remember is that just because something is on your computer doesn’t mean R can use it. In order for R to be able to use one of your installed packages, that package must also be loaded. Generally, when you open up R, only a few of these packages (about 7 or 8) are actually loaded.
We only need to install a package once on our computer. However, to use the package, we need to load it every time we start a new R environment or R Studio, respectively.
To install an R package you can use the GUI of R Studio or the command line. In R Studio you can click on the Packages tab, then on the Install button, then you must search for a package and click Install. An alternative way to install a package is by typing
install.packages("package_name")
in the console pane of RStudio and pressing Return/Enter on your keyboard. Note you must include the quotation marks around the name of the package.
If you want to update a previously installed package to a newer version, you need to re-install it by repeating the earlier steps or you use update.packages()
. To uninstall packages you can use remove.packages()
.
The installation of packages can take some time. However, if your CPU has many cores, you can speed up the process a lot using the argument Ncpus
like this update.packages(ask = F, Ncpus = 4L)
. This option allows you to adjust the number of parallel processes R can use on your PC. So, if you have a CPU with many cores you can increase that number. A tutorial on how to set the number of cores used by R permanently can be found here.
Recall that after you’ve installed a package, you need to load it. We do this by using the library()
command. For example, to load the ggplot2
package, run the following code in the console pane. What do we mean by “run the following code”? Either type or copy-and-paste the following code into the console pane and then hit the Enter key.
library("ggplot2")
If after running the earlier code, a blinking cursor returns next to the >
“prompt” sign, it means you were successful and the ggplot2
package is now loaded and ready to use. If, however, you get a red “error message” that reads
in library(ggplot2) : there is no package called ‘ggplot2’ Error
It means that you didn’t successfully install it. If you get this error message, go back to section Section 1.9.1 on R package installation and make sure to install the ggplot2
package before proceeding.
One very common mistake new R users make when wanting to use particular packages is they forget to load them first by using the library()
command we just saw. Remember: you have to load each package you want to use every time you start RStudio. If you don’t first load a package, but attempt to use one of its features, you’ll see an error message similar to:
: could not find function Error
R is informing you that you are attempting to use a function from a package that has not yet been loaded. Forgetting to load packages is a common mistake made by new users, and it can be a bit frustrating to get used to at first. However, with practice, it will become second nature for you. Unloading packages can be done with detach(package:ggplot2, unload=TRUE)
.
p_load
I recommend to install and load packages using the p_load()
function of the pacman
package. It is superior because
c()
function.For example, instead of the traditional approach:
install.packages(
c("tidyverse", "janitor", "haven", "readxl")
)library(
c("tidyverse", "janitor", "haven", "readxl")
)
You can streamline the process as follows:
if (!require(pacman)) install.packages("pacman")
::p_load(tidyverse, janitor, haven, readxl) pacman
The line if (!require(pacman)) install.packages("pacman")
ensures the installation of the pacman
package, which is necessary for using the p_load
function.
Before you load packages in a script, I recommend to unload all other packages with
::p_unload(all) pacman
to avoid conflicts of functions (see Section 7.5).
Throughout the lecture notes and in the exercises, I will use different packages. The installation can be time consuming and hence I recommend to install all packages by running the following lines of code in the Console. This takes some minutes depending on your PC and your internet connection. However, after installing all these packages you have all packages that are used in my exercises, my lecture notes How to Use R for Data Science, and the book R for Data Science (2e) by Wickham & Grolemund (2023).
if (!require(pacman)) install.packages("pacman")
::p_load(
pacman
arrow, babynames, car, curl, devtools, dplyr, duckdb, devtools,
expss, gapminder, ggplot2, ggrepel, ggridges, ggpubr,
ggstats, ggthemes, haven, HH, janitor, kableExtra, knitr,
Lahman, labelled, likert, magick, maps, MASS, nycflights13,
openxlsx, palmerpenguins, papaja, plm, psych,
remotes, rempsyc, repurrrsive, rstatix, skimr, sjlabelled,
sjmisc, sjPlot, stargazer, texreg, tidymodels, tidyr,
tidyverse, tinylabels, usethis, WDI, wbstats, writexl )
In addition to these packages, I recommend to install a package that I created to offer you some tutorials and functions. I host this package on my GitHub account and you can install it as follows:
::install_github("hubchev/hubchev") devtools
Upon successfully installing R, you gain access to functions that are part of Base R. This includes standard packages automatically installed and loaded with each R session, such as stats
, utils
, and graphics
, providing a broad spectrum of functionalities for statistical analysis and graphical capabilities (see Venables et al., 2022). However, the syntax in Base R can become complex and less intuitive for users. Consequently, many individuals, including Hadley Wickham, the Chief Data Scientist at Posit (formerly RStudio), and his team, have developed an alternative suite of packages known as the tidyverse
. These packages share a common philosophy and syntax, emphasizing readability and ease of use. We will heavily utilize the tidyverse
in the following sections.
The R package tidyverse
(see Figure 1.9) is a comprehensive collection of R packages including popular packages such are ggplot2
, dplyr
, tidyr
, readr
, purrr
, tibble
, stringr
, and forcats
, which together offer extensive capabilities for data modeling, transformation, and visualization.
How to do data science with tidyverse is the subject of multiple books and tutorials. In particular, the popular book R for Data Science by Wickham & Grolemund (2023) is all about the tidyverse universe. Thus, I highly recommend reading sections Workflow: basics), Data transformation, and Data tidying. Additionally, explore www.tidyverse.org for more resources, and consider completing the tidyverse module in my swirl
package, swirl-it, as detailed in section Chapter 4.
To install and load tidyverse
run the following lines of code:
if (!require(pacman)) install.packages("pacman")
::p_load(tidyverse) pacman
To instruct R to perform a task, we use function calls. When we want R to utilize data, we refer to that data within an object. This leads to two important questions:
<-
”The assignment operator “<-” is used to store data in an object or overwrite an existing object. For example, we can calculate the square root of 5 using the sqrt()
function and assign the result to an object named square_root_of_five
as follows:
<- sqrt(5) square_root_of_five
Now, if you call the object, R will return the result:
square_root_of_five
[1] 2.236068
The assignment operator is also explained in Section 7.1.1.
|>
”There are different ways to chain function calls in R. The base R package allows you to nest one function within another. For example, to calculate the sum of the square root of 5 and the square root of 9 using the sum()
function, you can write:
sum(sqrt(5), sqrt(9))
[1] 5.236068
In this case, we sum the square roots of 5 and 9, with the two functions nested as arguments within sum()
. If you want to round the result to two decimal places, you can use the round(x, digits = 2)
function like this:
round(sum(sqrt(5), sqrt(9)), digits = 2)
[1] 5.24
This is another example of function nesting.
Alternatively, you can use the pipe operator, which can be represented as “|>
” in base R or as “%>%
” if you load the magrittr
package. The pipe operator passes the output of one function to serve as the input for the next. Here’s how you can perform the same calculation using the pipe operator:
c(5, 9) |>
sqrt() |>
sum() |>
round(digits = 2)
[1] 5.24
This can be interpreted step by step as follows:
c(5, 9)
: We combine the values 5 and 9 into a vector using the c()
function …AND THEN…sqrt()
: We calculate the square root of each value in the vector …AND THEN…sum()
: We sum the square roots …AND THEN…round(digits = 2)
: We round the result to two decimal places.As you can see, the pipe operator allows us to read the code as “and then”. This method of sequentially executing tasks has several advantages: it mimics how humans typically approach problems and makes the code easier to read and understand. Consequently, we will frequently utilize this operator throughout the book. The pipe operator is also explained in Section 7.3.2.
Defining your own functions in R is straightforward. For example, if you frequently need to perform the calculation described in Section 1.11, you can create a custom function like this:
<- function(num1, num2) {
process_numbers c(num1, num2) |>
sqrt() |>
sum() |>
round(digits = 2)
}
The function process_numbers()
has two arguments, that are two numbers, as input, performs the calculation, and returns the result. Now, you can use this function to process any two numbers with ease. For example, for the numbers 5 and 9, the function call is:
process_numbers(5, 9)
[1] 5.24
And for the numbers 4 and 9, it is:
process_numbers(4, 9)
[1] 5
This approach not only simplifies your code but also ensures consistency when performing the same operation multiple times.
How to define user-defined function is explained in greater detail in Section 7.5.