6  Pitfalls

R newbies often make the same small mistakes that can lead to major confusion, frustration and inefficiency. Some of them can be easily avoided. In this section, I will outline common pitfalls that I have repeatedly observed as an R instructor and offer practical solutions to avoid them.

6.1 No clue about the “working directory”

Problem: Students start their R sessions unaware of their current working directory. This can lead to difficulties when reading and writing files.

Solution: At the beginning of your R script set a working directory using setwd(). Consider using R Studio projects, see Section A.4. For more information, see Appendix A: Navigating the file system and Workflow: scripts and projects of Wickham & Grolemund (2023).

6.2 No consistent directory structure

Problem: Students save files in different directories without a clear scheme. This disorganization often leads to problems: Scripts and data gets lost and code breaks.

Solution: Organize your project into a clear directory structure from the beginning. Here is my suggestion for a directory structure but feel free to come up with your own:

Table 6.1: Typical folder structure
Sub-Directory What to save here
doc/ documentation
dta/ processed data
fig/ figures
lit/ literature and pdfs
ori/ original raw data that you should never change
qmd/ reports
scr/ R scripts
tab/ tables
tmp/ temporary files

This structure will save time and headaches when navigating projects.

Tip 6.1: Do not save processed data unless necessary

It may seem reasonable to save data after editing, but this often isn’t necessary if you’re using scripts to create your data. These scripts can be rerun whenever needed, regenerating the dataset each time. To avoid wasting disk space and maintain an organized project folder, it’s advisable to save processed datasets only when the preprocessing steps are time-consuming. This way, you can keep your project folder more organized and ensure that your data analyses are always reproducible with the latest updates to your code.

6.3 Working manually outside R

Problem: Students want to get their work done quickly. This sometimes leads to them relying on manual processes that they have already mastered for their data work. This approach can lead to serious problems when it comes to the reproducibility of their data work.

Consider a typical three-step process for loading data: (1) downloading the data, (2) unpacking the data, and finally (3) importing the data into R. Many students often take a manual approach by using their Internet browser to download the data, then using their operating system’s unpacking application, and finally importing the data into an R script. While this method is not inherently wrong, there is a risk that students will forget to unpack the downloaded data, resulting in them accidentally working with outdated data. In the kickstart example provided by Section 5.2, I show that all three steps can be performed seamlessly in R. This way you ensure that you are always working with the most up-to-date data.

Solution: Do as much as possible in the script. Invest some time to find out how to download and manipulate the data within R. If it is not possible or if alternatives are superior, describe what you do outside of R explicitly and write a warning note at the top of your script.

6.4 No active R Packages management

Problem: Students often forget to install and/or load the packages correctly at the beginning of a script. Some unnecessarily install packages repeatedly when running a script. All this can lead to errors and interruptions.

Solution: At the beginning of each script, make sure that all required packages are loaded correctly. Use the pacman package, which provides the p_load() function to load and, if necessary, install packages and the p_unload(all) function to unload all packages.

Tip 6.2: Start your script with
if (!require(pacman)) install.packages("pacman")
pacman::p_unload(all)
pacman::p_load(tidyverse, janitor)
setwd("~/your-directory/")
rm(list = ls())

6.5 Confusion between console and script

Problem: Alternating between running code in the console and from the script without a systematic approach can lead to untracked changes and confusion about the current state of objects in the workspace. Additionally, students often borrow code snippets from others and run only the sections that seem immediately relevant. This practice can lead to errors or unexpected results, as such code often relies on previous commands or setups.

Solution: Develop the habit of testing small blocks of code in the console but run the complete script regularly to ensure everything works in sequence. Use shortcuts like Ctrl + Alt + R to source the entire script or Ctrl + Alt + B/E to execute it up to a specific point.

6.6 Misunderstanding data types and formats

Problem: Misusing or misunderstanding R’s data types and structures can lead to errors in data manipulation and analysis. Many functions require certain types of data. For example, the tidyverse packages required data to be “tidy”. Moreover, data often comes with errors and/or missings (NA). Beginners overlook data cleaning and considering missings.

Solution: Familiarize yourself with basic data types and structures like vectors, lists, data frames/tibbles, and factors. For more information, see Section 7.2 and Data tidying of Wickham & Grolemund (2023). Moreover, spend adequate time on data cleaning and preprocessing. Techniques such as handling missing values, normalizing data, and correcting data types are critical. For more information, see Missing values of Wickham & Grolemund (2023).

6.7 Lack of knowledge about data identification

Problem: Students often handle data without understanding which variables uniquely identify the information contained in other variables. It is crucial to recognize these identifying variables and verify their uniqueness to ensure data integrity.

Solution: Perform checks for uniqueness at the beginning of their data analysis process. See exercise Names and duplicates in Section 9.16 and the get_dupes fuction introduced in Section 7.4.3.2.

Tip 6.3: Always check your data with get_dupes

For example, you expect that your dataframe df is a panel dataset. With

get_dupes(df, country, year)

you can check whether the two variables country and year indeed identify each row uniquely.

6.8 Losing track of data due to excessive overwriting

Problem: Students often manipulate their data by repeatedly overwriting the same object. This can lead to confusion about the data`s current state and the transformations applied.

Solution: Minimize the number of assignments to a single object. Instead, create a new object with a descriptive and concise name each time you alter the data. This practice helps maintain clarity about each stage of data manipulation.

For example, if you’re working with data df, you might store the cleanded data as df_cln, then after filtering for specific criteria, you could use df_cln_flt, and finally, if you aggregate the data, name it df_cln_flt_agg. Having some clear naming convention makes it clear what each dataset represents and the transformations it has undergone.

Tip 6.4: Do clear the environment at the beginning of a script with
rm(list = ls())

6.9 No documentation

Problem: Students do not comment code. This makes it hard to remember the purpose of various lines of code and difficult for other people to read and understand the code.

Solution: Regularly comment your code, explaining why something is done, not just what is done. Use clear, concise comments to improve readability and maintainability.

6.10 Ignoring error messages and warnings

Problem: Students see that their code doesn’t work but do not read the error message which often contains hints for solving the problem.

Solution: Read and follow error messages. Do not ignore warnings or errors unless you know what they mean. Study what the error message might mean. Use online resources such as Google and ChatGPT, see Figure 6.1. Finally, have the confidence to implement the suggested solution. Don’t be frustrated if the first attempt does not work: Try again and play around.

Figure 6.1: Googlling the error message
Tip 6.5: Common error messages

Students often come to me with error messages suggesting that they install the tinytex package or the RTools compiler for Windows. Since they are not familiar with these R and software packages, they wonder if it is safe to proceed with the installation. My answer here is: yes, it is recommendable to install it.

Based the report of Noam Ross who examine roughly 10,000 R error messages, the most frequently encountered error messages of R are shown in Table 6.2 with some ideas of mine what to do.

Table 6.2: Most frequent errors
Error Type Some suggestions on what to do
Could not find function Check spelling of the function and whether the respective packages are loaded properly.
Error in if This suggests an issue with non-logical or missing values in a conditional statement. Check syntax and spelling. Maybe use ChatGPT to debug the code.
Error in eval Points to references to non-existent objects.
Cannot open Check if the files exist at the place you try to call them.
No applicable method Check your data type and whether it fits to the requirements of the functions you’re trying to use.
Package errors This can stem from issues with installing, compiling, or loading a package. Maybe you try to re-install the package and their dependencies, or update the packages.

6.11 No attempt to identify the problem and troubleshoot

Problem: Students often do not fully understand the problems they encounter, which can lead to difficulties in seeking solutions. It’s common for students to feel overwhelmed and seek help without attempting to find the source of the problem and without attempting to work out possible solutions first.

Solution: When you encounter an issue in your R code, it’s crucial to methodically dissect the problem. Here’s how you can effectively troubleshoot, that is, a problem-solving skill that is essential for becoming proficient in programming:

  • Identify the problem: Attempt to identify the issue to better understand its nature. Once you know which line of code is causing some trouble, you are often close to a solution. Commenting out parts of your script or going back until there is no error can help here.
  • Active solution search: Once you’ve identified the problem, actively look for solutions. This can include consulting the R documentation, searching for similar issues online, or asking others for help.
  • Trial and error process: Don’t hesitate to experiment with different solutions to see what works best.
  • Seek Help: If you’re stuck, ask for help from more experienced R users or communities.
  • Minimal Reproducible Example (MRE): When asking for help, explain your problem precisely and provide a MRE. That is the simplest version of the code that still produces the error, including only essential data and code. This practice not only aids in self-troubleshooting but also makes it easier for others to help by providing a clear, concise context.
  • Additional information: Sometime the interplay of your the packages loaded, your operating system, the version of R, and/or the RStudio version may play a role in your problem. Thus, when seeking help, be sure to provide information about your machine, including the operating system, the version of R, and the packages you have loaded. You can use the sessionInfo() function to gather this information.
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Debian GNU/Linux 12 (bookworm)

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.21.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.4.1    fastmap_1.1.1     cli_3.6.2        
 [5] tools_4.4.1       htmltools_0.5.8.1 rstudioapi_0.16.0 rmarkdown_2.26   
 [9] knitr_1.46        jsonlite_1.8.8    xfun_0.43         digest_0.6.35    
[13] rlang_1.1.3       evaluate_0.23    

For more information, see Workflow: getting help of Wickham & Grolemund (2023).

For example, the following script is a MRE:

library(ggplot2)
data <- data.frame(x = 1:4, y = c(2, 3, 5, 3.4))
ggplot(data, aes(x, y))

+geom_point()
Error:
! Cannot use `+` with a single argument.
ℹ Did you accidentally put `+` on a new line?

Everybody who copies these few lines of code can reproduce the shown error message and hence can work on a solution. Obviously, the + was set falsly. It must be placed in the line of ggplot:

library(ggplot2)
data <- data.frame(x = 1:4, y = c(2, 3, 5, 3.4))
ggplot(data, aes(x, y)) +
  geom_point()

6.12 Unstylish code

To avoid issues while programming in R, it’s essential to understand and adhere to various conventions, rules, and best practices specific to the language. Following these conventions makes your code more readable and simplifies your own experience with R. Below, you will find a non-exhaustive list of these guidelines.

  1. Do remember that R programming language is case sensitive.
  2. Do start names of objects such as vectors, numbers, variables, and data frames with a letter, not a number.
  3. Do avoid using dots in names of objects.
  4. Do avoid using certain keywords in naming objects, such as if, else, repeat, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN, and NA.
  5. Do use front slash / instead of backslash \ for navigating the file system (see Appendix A).
  6. Do not use whitespace and indentation for naming files, directories, or objects.
  7. Do define objects to represent hard-coded values instead of using them directly in code.
  8. Do remember to (install and) load packages that contain functions you want to use.
  9. Do use <- instead of = for assignment.
Tip 6.6

There are two packages, styler and lintr, that support you writing code according to the The tidyverse style guide of Wickham (2024).