1  …R

Learning Objectives
  • Understand the reasons for choosing R as a programming language.
  • Learn effective strategies and resources for mastering R programming.
  • Learn the basic priciples of R.
  • Distinguish between R and RStudio.
  • Demonstrate how to write and execute code in R and RStudio.
  • Learn how to install R, RStudio, and R packages.
  • Explore options to use R and RStudio without installing them on your machine, such as through cloud-based platforms.

1.1 Why R?

R is an open-source programming language that allows to analyse and manipulate data, create state-of-the-art graphics, and many more. It supports larger data sets, reads any type of data, and runs on multiple platforms (Windows, Mac, Linux) and CPU architectures (x86_64, arm64). R makes it easier to automate tasks, organize projects, ensure reproducibility, and find and fix errors, and anyone can contribute packages to improve its functionality. Moreover, the following points are worth to emphasize:

  • R is an artist! Check out:
  • R is an employment insurance! Programming is a core skill in research, economics, and business. If you can write code, you have plenty of opportunities to earn a decent salary. R is one of the most widely used programming languages in the world today. It is used in almost every industry such as finance, banking, medicine or manufacturing. R is used for portfolio management, risk analytics in finance and banking industries. Even if you need to learn a new programming language later, knowing R makes it much easier to pick up another one.
  • R uses the computer and computers are great! Doing statistics on a computer is faster, easier and more powerful than doing it by hand. Computers are an extension to your brain and can do repetitive tasks better and faster without making logical errors. The only reason to do statistical calculations with pencil and paper is for learning purposes.
  • Low-code and no-code applications such as Excel are limited! Using spreadsheets software like Microsoft Excel for research can be problematic. It’s easy to lose track of operations, making the process difficult to oversee and document. Command-line programs are maybe not as easy to learn but offer a more straightforward approach that allows the results to be replicated easily.
  • R is open source! Proprietary software expansive, support can only be provided by the copyright owner which means the software expires and you can’t do anything against it. Moreover, security issues cannot be checked as the source code is not available, and possibilities for customization are limited. R is yours and everybody can contribute to its success.
  • R is big! When you download and install R, you get some basic packages, that contain functions that allow you to do already a lot of things. Beyond that, you can write your own packages or install user-written packages that extend your possibilities. With over 20,684 packages on the CRAN repository and many more available on GitHub and other platforms, R’s extensive library supports a wide variety of data science tasks. Its widespread use and open-source availability have cemented R as a standard tool in data science and ensured that there are multiple approaches to most data handling processes. These can be easily adopted.
R has weaknesses

For newcomers to programming, the learning curve is rather flat at the beginning. One reason is that R tools are spread across many packages, which can overwhelm beginners. There is no centralized support and the helpful and active online community have different backgrounds. It can be difficult for beginners to find the right solution as there are often many different ways to tackle the same problem. Moreover, R can be slower than languages like Python, MATLAB, C/C++ or Java.

1.2 How to learn R

There are many different approaches to learning R. It pretty much depends on your preferences, needs, goals, prerequisites and limitations. It is up to you to search and find a suitable way to achieve your learning goals. While I hope you find my notes helpful, I additionally provide in section Section 1.3 a list of other resources that are worth considering. To start with, I recommend my swirl courses that provide an interactive learning environment, see Chapter 4.

Make your hands dirty!

Learning a programming language can, like learning a foreign language, be daunting and frustrating. However, if you put in the effort and are not afraid to make mistakes, anybody can learn it. You don’t have to be a nerd. To have a guide next to you can help and speed up your progress significantly. The key is taking action and getting involved. I mean, do write code. Try to copy the code that you read here and elsewhere. Explore what the code does on your machine. Don’t be afraid to make errors. Your PC will not explode. In this paper, most of the code is written in a manner that allows you to effortlessly copy and reproduce the output on your PC. Take advantage of this opportunity and go for it! Hands-on practice is far more enjoyable than merely reading through the material.

Figure 1.1: Play around with code

Here are some comments that may help you to learn efficiently:

  • Computers need clear and precise instructions to work: They can’t handle mistakes or unclear directions. They are actually sort of stupid as they do not have an intuition. They just take you literally. Even small errors like a missing comma or an unclosed bracket can cause your code to not work. Computers do exactly what you tell them, no more and no less.
Computers take you literally

Let me illustrate what I mean: Suppose you send your grandfather the following message:

“Let’s eat grandpa.”

He will probably understand that you’re inviting him to dinner. However, if you sent the same message to a computer, it would interpret the sentence literally due to the missing comma:

“Let’s eat, grandpa.”

The comma makes all the difference in clarifying that you’re speaking to your grandpa, not about eating him! Similarly, in programming, an incorrectly placed comma can break your code or change the meaning of your code.

  • Copy, paste, and tweak: While learning code from scratch is sometimes essential, you can speed up your work by modifying code that already exists. I call this the “copy, paste, and tweak” approach. While this is not the only way to learn code, it gets a job done quick, and it is fun, see Figure 1.1.
  • Have a purpose when coding: Rather than learning to code for its own sake, it is more fun and you’ll probably learn faster when you have a goal in mind. Try to analyze data that you are interested in. Another good exercise is replicating a research paper.
  • Practice is key: The best method to improving your coding skills is through lots of practice. Consequently, these notes give you plenty of exercises.
  • Use ChatGPT: The usage of supporting tools is not forbidden. ChatGPT can help you to understand code and brainstorm solutions. However, it’s important to know that ChatGPT might suggest complex methods when there are shorter and more elegant solutions available Absolute beginners might find ChatGPT’s solutions overwhelming and have difficulties to tweak the proposed sketch of a solution. So, use it thoughtfully.

1.3 Learning resources

Thousand of freely available books and resources exist. bookdown.org and the Big Book of R are two vast collections of links to R books that might verify my claim.

In RStudio you find in the right side at the bottom a panel that is called Help. There you find a lot of links, manuals, and references that offer you tons of resources to learn R for free including: education.rstudio.com and Links for Getting Help with R. At the top right of RStudio you find a panel called tutorial. Here you can install the learnr package that offers some nice interactive tutorials.

Since you may feel overwhelmed by the number of resources, I would like to highlight some books:

Figure 1.2: A collection of textbooks

  1. Wickham & Grolemund (2023): R for Data Science: Import, Tidy, Transform, Visualize, and Model Data is the most popular source to learn R. It focuses on introducing the tidyverse package and is freely available online.
  2. Healy (2018): Data Visualization: A Practical Introduction is a hands-on introduction to the principles and practice of looking at and presenting data using R and ggplot.
  3. Irizarry (2022): Introduction to Data Science: Data Analysis and Prediction Algorithms With R is a complete, up to date, and applied introduction.
  4. Venables et al. (2022) An Introduction to R: Notes on R: A Programming Environment for Data Analysis and Graphics is a manual from the R Core Development Team that shows how to use R without having to install and load additional packages.
  5. Neth (2023): Data Science for Psychologists is a comprehensive introduction to R and data science for non experts of both programming and data science. It uses a variety of data types and includes many examples and exercises.
  6. Kabacoff (2024): Modern Data Visualization with R teaches how to create graphs from scratch providing a lot of examples that you can copy, paste and tweak.
Healy, K. (2018). Data visualization: A practical introduction. Accessed January 30, 2023; Princeton University Press. https://socviz.co/
Irizarry, R. A. (2022). Introduction to data science: Data analysis and prediction algorithms with R. Accessed January 30, 2023; CRC Press. https://rafalab.github.io/dsbook/
Neth, H. (2023). ds4psy: Data science for psychologists. Social Psychology; Decision Sciences, University of Konstanz. https://doi.org/10.5281/zenodo.7229812
Kabacoff, R. (2024). Modern data visualization with R. Chapman; Hall/CRC. https://rkabacoff.github.io/datavis/

Some other sources that are worth mentioning are these:

  • The search engine www.rseek.org is R specific and often better than www.google.com as it only searches for content that has to do with the programming language R.
  • On rdocumentation.org you can find the complete documentation of all R packages.
  • Many find these cheatsheets helpful.

1.4 What is a function in R?

R is a functional programming language. If you want R to do something, you need to use a function. Or, in the words of Chambers (2017, p. 4):

“Everything that happens is a function call.”

For example, when you like to exit R, you do it with the function q():

> q()
Save workspace image? [y/n/c]: 

If you want to specify what exactly you want R to do for you, you need to refer to the arguments of a function. For example, if you don’t want to be asked interactively what you want to do with your workspace (this is the place where you store all your objects, see section Section 1.5), you can do this with an argument that is part of the q() function:

> q(save = "no")

To learn more about a function, you can access its documentation by typing a question mark followed by the function name into the Console:

?q()

Unfortunately, the documentation can sometimes be a bit confusing for beginners in applied contexts. However, the documentation for all functions is structured similarly, typically featuring several key sections:

  • Description: A brief overview of what the function does.
  • Usage: How to use the function, including the function name and its arguments.
  • Arguments: Detailed descriptions of each argument the function accepts, including what types of values are expected.
  • Details: Additional details about the function’s behavior and any important notes.
  • Examples: Practical examples demonstrating how to use the function in various contexts.

Understanding these sections can significantly enhance your ability to navigate and utilize R.

An excerpt of the R Documentation for the function q() is shown in Figure 1.3. Here, we observe that the function has three arguments that you can manipulate. If you do not specify any of these arguments explicitly, we see that by default, R sets the three arguments as shown.

Figure 1.3: The R Documentation of q()

1.5 What are objects in R?

R is an object oriented programming language. That means,

“everything that exists in R is an object” (Chambers, 2017, p. 4).

Chambers, J. M. (2017). Extending R. CRC Press.

Objects are the fundamental units that are used to store information. Objects can be a variety of data types, including vectors, matrices, data frames, lists, functions. Moreover, you can store empirical results, tables, figures and many more in form of so-called objects. All objects are shown in the workspace which is shown in the Environment panel.

In R, you can show the content of the workspace with ls(). The function rm() allows to remove objects and with rm(list=ls()) you clear all objects from the workspace.

1.6 What are R and RStudio?

While R has a command line interface, there are multiple third-party graphical user interfaces available that improve the user experience a lot. The most successful graphical user interface or integrated development environment (IDE) is RStudio. Throughout this book, I will assume that you are using R via RStudio. First time users often confuse the two. At its simplest, R is like a car’s engine while RStudio is like a car’s dashboard as illustrated in Figure Figure 1.4.

Figure 1.4: Analogy of difference between R and RStudio

More precisely, R is a functional programming language that runs computations, while RStudio is an integrated development environment (IDE) that provides an interface by adding many convenient features and tools. So just as the way of having access to a speedometer, rearview mirrors, and a navigation system makes driving much easier, using RStudio’s interface makes using R much easier as well.

Much as we don’t drive a car by interacting directly with the engine but rather by interacting with elements on the car’s dashboard, we won’t be using R directly but rather we will use RStudio’s interface. After you install R and RStudio on your computer, you’ll have two new programs (also called applications) you can open. We’ll always work in RStudio and not in the R application. Figure Figure 1.5 shows what icon you should be clicking on your computer.

Figure 1.5: Icons of R versus RStudio on your computer

After you open RStudio, you should see something similar to Figure Figure 1.6 where three or four panels dividing the screen.

Figure 1.6: A sketch of RStudio interface to R

  1. The Environment panel, where a list of all objects is shown.
  2. The Files, Plots and Help panel, allow you to manage files, preview plots, and find help for different functions of R.
  3. The Console panel, used for running code.
  4. The Script panel, used for writing code.

open it by opening an existing R-script or creating a new one. You can create a new on by clicking Ctrl+Shift+N (alternatively, you can use the menu: File\(\rightarrow\)New File\(\rightarrow\)R Script).

The Console panel will contain R’s startup message, which shows information about which version of R you’re running. My startup message at the time of writing was as follows:

 R version 4.3.3 (2024-02-29) -- "Angel Food Cake"
 Copyright (C) 2024 The R Foundation for Statistical Computing
 Platform: x86_64-pc-linux-gnu (64-bit)

 R is free software and comes with ABSOLUTELY NO WARRANTY.
 You are welcome to redistribute it under certain conditions.
 Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

 R is a collaborative project with many contributors.
 Type 'contributors()' for more information and
 'citation()' on how to cite R or R packages in publications.

 Type 'demo()' for some demos, 'help()' for on-line help, or
 'help.start()' for an HTML browser interface to help.
 Type 'q()' to quit R.

You can resize the panels as you like, either by clicking and dragging their borders or using the minimise/maximise buttons in the upper right corner of each panel. Clicking Ctrl++ and Ctrl+- allows to make the fonts larger or smaller.

1.7 How to write and run code in R and RStudio

In the Console you can type in code and push Enter to run the line of code. For example, you can calculate:

1+4
[1] 5

While working in the Console is possible, we usually work in RStudio using so-called scripts. These scripts are plain text files with the file extension “.R”. Scripts are discussed in detail in Chapter 3. To create a script, go to the File menu, select New File and then choose R Script.

With the key shortcut Ctrl+Enter for Windows and Linux user or by Cmd+Enter for MacOs users (or by clicking Run) you can run a line of a script, that means you send one line of code to the Console. See Figure 1.7 how this looks like in RStudio.

Figure 1.7: One plus four in a R script

1.8 How to install R, RStudio, and R packages

Figure 1.8: Set up R in three steps

As shown in Figure 1.8, setting up R on your personal computer (Windows, Mac, Linux) is a three step process: You will first need to download and install R. After that has been successful you can download and install RStudio. Please note that it is important that you install R first and then install RStudio. As a third but optional step you can install R packages.

  1. Do this firstly: Download and install R here.
    • If you are a Windows user: Click on “Download R for Windows”, then click on “base”, then click on the Download link.
    • If you are macOS user: Click on “Download R for (Mac) OS X”, then under “Latest release:” click on R-X.X.X.pkg, where R-X.X.X is the version number. For example, the latest version of R as of March 29, 2024 was R-4.3.3.
    • If you are a Linux user: Click on “Download R for Linux” and choose your distribution for more information on installing R for your setup.
  2. Do this secondly: Download and install RStudio here.
    • Scroll down to “Installers for Supported Platforms” near the bottom of the page.
    • Click on the download link corresponding to your computer’s operating system.
  3. Do this thirdly: Install R packages. This step is optionally as you can install R packages at any time. However, it may be a good idea to install frequently used packages in one take because the installation of some packages can be time consuming. Therefore, I recommend to read Section 1.9 and follow the instructions therein.

If you don’t want to install R on your PC or you don’t have admin rights to do so or if you want to run R on your tablet (IPad or Chromebook) or even your smartphone, you can use RStudio online doing cloud computing on https://posit.cloud. Posit Cloud (formerly RStudio Cloud) is a cloud-based solution that allows anyone to use RStudio online and navigate it through your web browser. It is free for individuals with some restrictions and limited capacities.

1.9 What are R packages?

A package is a collection of functions, data sets and other R objects that are all grouped together under a common name. More than 20,000 packages are available at the official repository (CRAN). CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R, see: https://cran.r-project.org.].

However, before we get started, there’s a critical distinction that you need to understand, which is the difference between having a package installed on your computer, and having a package loaded in R. When you install R on your computer only a small number of packages come bundled with the basic R installation. The installed packages are on your computer. The critical thing to remember is that just because something is on your computer doesn’t mean R can use it. In order for R to be able to use one of your installed packages, that package must also be loaded. Generally, when you open up R, only a few of these packages (about 7 or 8) are actually loaded.

Package management
  1. A package must be installed before it can be loaded.
  2. A package must be loaded before it can be used.

We only need to install a package once on our computer. However, to use the package, we need to load it every time we start a new R environment or R Studio, respectively.

1.9.1 Package installation

To install an R package you can use the GUI of R Studio or the command line. In R Studio you can click on the Packages tab, then on the Install button, then you must search for a package and click Install. An alternative way to install a package is by typing

install.packages("package_name")

in the console pane of RStudio and pressing Return/Enter on your keyboard. Note you must include the quotation marks around the name of the package.

If you want to update a previously installed package to a newer version, you need to re-install it by repeating the earlier steps or you use update.packages(). To uninstall packages you can use remove.packages().

The installation of packages can take some time. However, if your CPU has many cores, you can speed up the process a lot using the argument Ncpus like this update.packages(ask = F, Ncpus = 4L). This option allows you to adjust the number of parallel processes R can use on your PC. So, if you have a CPU with many cores you can increase that number. A tutorial on how to set the number of cores used by R permanently can be found here.

1.9.2 Package loading

Recall that after you’ve installed a package, you need to load it. We do this by using the library() command. For example, to load the ggplot2 package, run the following code in the console pane. What do we mean by “run the following code”? Either type or copy-and-paste the following code into the console pane and then hit the Enter key.

library("ggplot2")

If after running the earlier code, a blinking cursor returns next to the > “prompt” sign, it means you were successful and the ggplot2 package is now loaded and ready to use. If, however, you get a red “error message” that reads

Error in library(ggplot2) : there is no package called ‘ggplot2’

It means that you didn’t successfully install it. If you get this error message, go back to section Section 1.9.1 on R package installation and make sure to install the ggplot2 package before proceeding.

One very common mistake new R users make when wanting to use particular packages is they forget to load them first by using the library() command we just saw. Remember: you have to load each package you want to use every time you start RStudio. If you don’t first load a package, but attempt to use one of its features, you’ll see an error message similar to:

Error: could not find function

R is informing you that you are attempting to use a function from a package that has not yet been loaded. Forgetting to load packages is a common mistake made by new users, and it can be a bit frustrating to get used to at first. However, with practice, it will become second nature for you. Unloading packages can be done with detach(package:ggplot2, unload=TRUE).

1.9.3 Simplified package management with p_load

I recommend to install and load packages using the p_load() function of the pacman package. It is superior because

  • it only installs a package if it is has not been installed yet,
  • it loads the package, and
  • does not require quotes nor the c()function.

For example, instead of the traditional approach:

install.packages(
  c("tidyverse", "janitor", "haven", "readxl")
  )
library(
  c("tidyverse", "janitor", "haven", "readxl")
  )

You can streamline the process as follows:

if (!require(pacman)) install.packages("pacman")
pacman::p_load(tidyverse, janitor, haven, readxl)

The line if (!require(pacman)) install.packages("pacman") ensures the installation of the pacman package, which is necessary for using the p_load function.

Before you load packages in a script, I recommend to unload all other packages with

pacman::p_unload(all)

to avoid conflicts of functions (see Section 7.5).

Tip 1.1: Install everything now

Throughout the lecture notes and in the exercises, I will use different packages. The installation can be time consuming and hence I recommend to install all packages by running the following lines of code in the Console. This takes some minutes depending on your PC and your internet connection. However, after installing all these packages you have all packages that are used in my exercises, my lecture notes How to Use R for Data Science, and the book R for Data Science (2e) by Wickham & Grolemund (2023).

if (!require(pacman)) install.packages("pacman")
pacman::p_load(
      arrow, babynames, car, curl, devtools, dplyr, duckdb, devtools,
      expss, gapminder, ggplot2, ggrepel, ggridges, ggpubr, 
      ggstats, ggthemes, haven, HH, janitor, kableExtra, knitr, 
      Lahman, labelled, likert, magick, maps, MASS, nycflights13,
      openxlsx, palmerpenguins, papaja, plm, psych,
      remotes, rempsyc, repurrrsive, rstatix, skimr, sjlabelled, 
      sjmisc, sjPlot, stargazer, texreg, tidymodels, tidyr, 
      tidyverse, tinylabels, usethis, WDI, wbstats, writexl
)

In addition to these packages, I recommend to install a package that I created to offer you some tutorials and functions. I host this package on my GitHub account and you can install it as follows:

devtools::install_github("hubchev/hubchev")

1.10 Base R and the tidyverse universe

Upon successfully installing R, you gain access to functions that are part of Base R. This includes standard packages automatically installed and loaded with each R session, such as stats, utils, and graphics, providing a broad spectrum of functionalities for statistical analysis and graphical capabilities (see Venables et al., 2022). However, the syntax in Base R can become complex and less intuitive for users. Consequently, many individuals, including Hadley Wickham, the Chief Data Scientist at Posit (formerly RStudio), and his team, have developed an alternative suite of packages known as the tidyverse. These packages share a common philosophy and syntax, emphasizing readability and ease of use. We will heavily utilize the tidyverse in the following sections.

Venables, W. N., Smith, D. M., & R Core Team. (2022). An introduction to R: Notes on R: A programming environment for data analysis and graphics (Version 4.3.2 (2023-10-31)). http://cran.r-project.org/doc/manuals/R-intro.pdf

The R package tidyverse (see Figure 1.9) is a comprehensive collection of R packages including popular packages such are ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats, which together offer extensive capabilities for data modeling, transformation, and visualization.

How to do data science with tidyverse is the subject of multiple books and tutorials. In particular, the popular book R for Data Science by Wickham & Grolemund (2023) is all about the tidyverse universe. Thus, I highly recommend reading sections Workflow: basics), Data transformation, and Data tidying. Additionally, explore www.tidyverse.org for more resources, and consider completing the tidyverse module in my swirl package, swirl-it, as detailed in section Chapter 4.

Wickham, H., & Grolemund, G. (2023). R for data science (2e). https://r4ds.hadley.nz/

To install and load tidyverse run the following lines of code:

if (!require(pacman)) install.packages("pacman")
pacman::p_load(tidyverse)

Exercise 1.1 Set up R, RStudio, and R packages

Open this interactive tutorial and work through it.

1.11 Two key programming operators

To instruct R to perform a task, we use function calls. When we want R to utilize data, we refer to that data within an object. This leads to two important questions:

  1. How do we create objects?
  2. How can we instruct R to execute multiple steps in sequence?

1.11.1 Assignment operator: “<-

The assignment operator “<-” is used to store data in an object or overwrite an existing object. For example, we can calculate the square root of 5 using the sqrt() function and assign the result to an object named square_root_of_five as follows:

square_root_of_five <- sqrt(5)

Now, if you call the object, R will return the result:

square_root_of_five 
[1] 2.236068

The assignment operator is also explained in Section 7.1.1.

1.11.2 Pipe operator: “|>

There are different ways to chain function calls in R. The base R package allows you to nest one function within another. For example, to calculate the sum of the square root of 5 and the square root of 9 using the sum() function, you can write:

sum(sqrt(5), sqrt(9))
[1] 5.236068

In this case, we sum the square roots of 5 and 9, with the two functions nested as arguments within sum(). If you want to round the result to two decimal places, you can use the round(x, digits = 2) function like this:

round(sum(sqrt(5), sqrt(9)), digits = 2)
[1] 5.24

This is another example of function nesting.

Alternatively, you can use the pipe operator, which can be represented as “|>” in base R or as “%>%” if you load the magrittr package. The pipe operator passes the output of one function to serve as the input for the next. Here’s how you can perform the same calculation using the pipe operator:

c(5, 9) |>
  sqrt() |> 
  sum() |> 
  round(digits = 2)  
[1] 5.24

This can be interpreted step by step as follows:

  • c(5, 9) : We combine the values 5 and 9 into a vector using the c() function …AND THEN…
  • sqrt(): We calculate the square root of each value in the vector …AND THEN…
  • sum(): We sum the square roots …AND THEN…
  • round(digits = 2): We round the result to two decimal places.

As you can see, the pipe operator allows us to read the code as “and then”. This method of sequentially executing tasks has several advantages: it mimics how humans typically approach problems and makes the code easier to read and understand. Consequently, we will frequently utilize this operator throughout the book. The pipe operator is also explained in Section 7.3.2.

1.12 Write your own function

Defining your own functions in R is straightforward. For example, if you frequently need to perform the calculation described in Section 1.11, you can create a custom function like this:

process_numbers <- function(num1, num2) {
  c(num1, num2) |> 
    sqrt() |> 
    sum() |> 
    round(digits = 2)
    } 

The function process_numbers() has two arguments, that are two numbers, as input, performs the calculation, and returns the result. Now, you can use this function to process any two numbers with ease. For example, for the numbers 5 and 9, the function call is:

process_numbers(5, 9)
[1] 5.24

And for the numbers 4 and 9, it is:

process_numbers(4, 9)
[1] 5

This approach not only simplifies your code but also ensures consistency when performing the same operation multiple times.

How to define user-defined function is explained in greater detail in Section 7.5.