5  Descriptive statistics

Required reading

Lane (2023, ch. 1)

Lane, D. M. (2023). Introduction to statistics: Online statistics education: A multimedia course of study. Accessed January 30, 2023; Online Statistics Education: A Multimedia Course of Study. http://onlinestatbook.com

Learning objectives:

5.1 Univariate data

5.1.1 Arithmetic mean

The arithmetic mean (\(\bar{x}\)) is calculated as the sum of all the values in a dataset divided by the total number of values:

\[\bar{x} = \frac{{\sum_{i=1}^{n} x_i}}{n}\]

where \(\bar{x}\) represents the arithmetic mean, \(x_i\) represents each individual value in the dataset, and \(n\) represents the total number of values in the dataset.

5.1.2 Median

The median is the middle value of a dataset when it is sorted in ascending or descending order. If the dataset has an odd number of values, the median is the middle value. If the dataset has an even number of values, the median is the average of the two middle values.

5.1.3 Mode

The mode is the value or values that appear most frequently in a dataset.

5.1.4 Range

The range is the difference between the maximum and minimum values in a dataset.

\[ \text{{Range}} = \max(x_i) - \min(x_i) \]

where \(\text{{Range}}\) represents the range value, and \(x_i\) represents each individual value in the dataset.

5.1.5 Variance

The variance represents the average of the squared deviations of a random variable from its mean. It quantifies the extent to which a set of numbers deviates from their average value. Variance is commonly denoted as \(Var(X)\), \(\sigma^2\), or \(s^2\). The calculation of variance is as follows: \[ \sigma^2={\frac{1}{n}}\sum _{i=1}^{n}(x_{i}-\mu )^{2} \] However, it is better to use \[ \sigma^2={\frac{1}{n-1}}\sum _{i=1}^{n}(x_{i}-\mu )^{2}. \]

The use of \(n - 1\) instead of \(n\) in the formula for the sample variance is known as Bessel’s correction, which corrects the bias in the estimation of the population variance, and some, but not all of the bias in the estimation of the population standard deviation. Consequently this way to calculate the variance and hence the standard deviation is called the sample standard deviation or the unbiased estimation of standard deviation. In other words, when working with a sample instead of the full population the limited number of observations tend to be closer to the sample mean than to the population mean, see Figure 5.1. Bessels Correction takes that into account.

For a detailed explanation, you can watch the video by StatQuest with Josh Starmer: Why Dividing By N Underestimates the Variance:

Figure 5.1: Bias when using the sample mean1

1 Picture is taken from the video https://youtu.be/sHRBg6BhKjI

5.1.6 Standard deviation

As the variance is hard to interpret, the standard deviation is a more often used measure of dispersion. A low standard deviation indicates that the values tend to be close to the mean. It is often abbreviated with \(sd\), \(SD\), or most often with the Greek letter sigma, \(\sigma\). The underlying idea is to measure the average deviation from the mean. It is calculated as follows: \[ sd(x)=\sqrt{\sigma^2}={\sqrt {{\frac {1}{n-1}}\sum _{i=1}^{n}\left(x_{i}-{\mu}\right)^{2}}}=\sigma \]

5.1.7 Standard error

The standard deviation (SD) measures the amount of variability, or dispersion, for a subject set of data from the mean, while the standard error of the mean (SEM) measures how far the sample mean of the data is likely to be from the true population mean. The SEM is always smaller than the SD. It matters because it helps you estimate how well your sample data represents the whole population.

The standard error of the mean (SEM) can be expressed as: \[ sd(\bar{x})=\sigma_{\bar {x}}\ = s = {\frac {\sigma }{\sqrt {n}}} \] where \(\sigma\) is the standard deviation of the population and \(n\) is the size (number of observations) of the sample.

Also see the video by StatQuest with Josh Starmer: Standard Deviation vs Standard Error, Clearly Explained!!!:

Tip 5.1

Why divide by the square root of \(n\)?

Let \(X_{i}\) be an independent draw from a distribution with mean \(\bar{x}\) and variance \(\sigma^{2}\). What is the variance of \(\bar{x}\)?

By definition: \[ \operatorname{Var}(x)=E\left[\left(x_{i}-E\left[x_{i}\right]\right)^{2}\right]=\sigma^{2} \] so \[\begin{align*} \operatorname{Var}(\bar{x})&=E\left[\left(\frac{\sum x_{i}}{n}-E \left[\frac{\sum x_{i}}{n}\right]\right)^{2}\right]\\ &=E\left[\left(\frac{\sum x_{i}}{n}-\frac{1}{n} E\left[ \sum x_{i}\right]\right)^{2}\right]\\ &=\frac{1}{n^{2}} E\left[\left(\sum x_{i}-E\left[\sum x_{i}\right]\right)^{2}\right]\\ &=\frac{1}{n^{2}} E\left[\left(\sum x_{i}- \sum \bar{x}\right)^{2}\right]\\ &=\frac{1}{n^{2}} E\left[(x_{1}+x_{2}+\cdots+x_{n}-\underbrace{\bar{x}-\bar{x}-\cdots -\bar{x}}_{n \text{ terms }})^{2}\right]\\ &=\frac{1}{n^{2}} E\left[\sum\left(x_{i}-\bar{x}\right)^{2}\right]\\ &=\frac{1}{n^{2}} \sum E\left(x_{i}-\bar{x}\right)^{2}\\ &=\frac{1}{n^{2}} \underbrace{\sum \sigma^{2}}_{n\cdot \sigma^{2}}\\ &=\frac{1}{n} \sigma^{2} \end{align*}\] and hence \[ sd(\bar x)=\sqrt{\operatorname{Var}(\bar{x})}=s={\frac {\sigma }{\sqrt {n}}} \]

5.1.8 Coefficient of variation

The coefficient of variation (\(CoV\)) is a relative measure of variability and is calculated as the ratio of the standard deviation to the mean, expressed as a percentage:

\[ CoV = \frac{\sigma}{\bar{x}} \]

where \(CoV\) represents the coefficient of variation, \(\sigma\) represents the standard deviation, and \(\bar{x}\) represents the arithmetic mean.

5.1.9 Skewness

Skewness is a measure of the asymmetry of a distribution. There are different formulas to calculate skewness, but one common method is using the third standardized moment (\(\gamma_1\)):

\[ \gamma_1 = \frac{{\sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{\sigma}\right)^3}}{n} \] where \(\gamma_1\) represents the skewness, \(x_i\) represents each individual value in the dataset, \(\bar{x}\) represents the arithmetic mean, \(\sigma\) represents the standard deviation, and \(n\) represents the total number of values in the dataset.

5.1.10 Kurtosis

Kurtosis measures the peakedness or flatness of a probability distribution. There are different formulations for kurtosis, and one of the common ones is the fourth standardized moment. The formula for kurtosis is given by:

\[ \text{Kurtosis} = \frac{{\frac{1}{n} \sum_{i=1}^{n}(x_i - \bar{x})^4}}{{\left(\frac{1}{n} \sum_{i=1}^{n}(x_i - \bar{x})^2\right)^2}} \] where \(\text{Kurtosis}\) represents the kurtosis value, \(x_i\) represents each individual value in the dataset, \(\bar{x}\) represents the mean of the dataset, and \(n\) represents the total number of values in the dataset.

5.2 Bivariate data

5.2.1 Covariance

Covariance \(Cov(X,Y)\) (or \(\sigma_{XY}\)) is a measure of the joint variability of two variables (\(x\) and \(y\)) and their observations \(i\), respectively. The covariance is positive when larger values of one variable tend to correspond with larger values of the other variable, or when smaller values of one variable tend to correspond with smaller values of the other variable. On the other hand, a negative covariance suggests an inverse relationship, where larger values of one variable tend to correspond with smaller values of the other variable.

It’s important to note that the magnitude of the covariance is influenced by the units of measurement, making it challenging to interpret directly. Additionally, the spread of the variables also affects the covariance. The formula for calculating covariance is as follows: \[ \operatorname{Cov}(X,Y)=\sigma_{XY}={\frac {1}{n-1}}\sum _{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y}) \] where \(cov(X,Y)\) represents the covariance, \(\sigma_{XY}\) is an alternative notation, \(x_i\) and \(y_i\) are the individual observations of variables \(X\) and \(Y\), \(\bar{x}\) and \(\bar{y}\) are the means of variables \(X\) and \(Y\), and \(n\) is the total number of observations.

Tip 5.2

To gain a better understanding of the concept and calculation of covariance, I highly recommend watching Josh Starmer’s informative and visually engaging video titled Covariance and Correlation Part 1: Covariance:

5.2.2 The correlation coefficient (Bravais-Pearson)

The Pearson correlation coefficient measures the linear relationship between two variables. It is calculated as the covariance of the variables divided by the product of their standard deviations. \[ \rho_{X,Y} = \frac{{\text{Cov}(X, Y)}}{{\sigma_X \sigma_Y}} \] where \(\rho\) represents the Pearson correlation coefficient, \(\text{Cov}(X, Y)\) denotes the covariance between variables \(X\) and \(Y\), \(\sigma_X\) denotes the standard deviation of variable \(X\), and \(\sigma_Y\) denotes the standard deviation of variable \(Y\). It has a value between +1 and -1.

By dividing the covariance of \(X\) and \(Y\) by the multiplication of the standard deviations of \(X\) and \(Y\), the correlation coefficient is normalized by having a minimum of -1 and a maximum of 1. Thus, it can fix the problem of the variance that the scale (unit of measurement) determines the size of the variance.

Tip 5.3

I highly recommend watching the video Pearson’s Correlation, Clearly Explained!!! StatQuest with Josh Starmer. It provides a clear and engaging explanation of the meaning of correlation. The video features informative animations that help visualize the concept:

In interpreting correlations, it is important to remember that they…

  1. … only reflect the strength and direction of linear relationships,
  2. … do not provide information about the slope of the relationship, and
  3. … fail to explain important aspects of nonlinear relationships.
Figure 5.2: Correlations are blind on some eye

Figure 5.2 shows that correlation coefficients are limited in to explaining the relationship of two variables. For example, when the slope of a relationship is zero, the correlation coefficient becomes undefined due to the variance of \(Y\) being zero. Furthermore, Pearson’s correlation coefficient is sensitive to outliers, and all correlation coefficients are prone to sample selection biases. It is crucial to be careful when attempting to correlate two variables, particularly when one represents a part and the other represents the total. It is also worth noting that small correlation values do not necessarily indicate a lack of association between variables. For example, Pearson’s correlation coefficient can underestimates the association between variables exhibiting a quadratic relationship. Therefore, it is always advisable to examine scatterplots in conjunction with correlation analysis.

In Figure 5.3 you see various graphs that all have the same correlation coefficient and share other statistical properties like is investigated in Exercise 5.1.

Figure 5.3: These diagrams all have the same statistical properties2

2 This graph was produced employing the datasauRus R package.

5.2.3 Rank correlation coefficient (Spearman)

Spearman’s rank correlation coefficient is a measure of the strength and direction of the monotonic relationship between two variables. It can be calculated for a sample of size \(n\) by converting the \(n\) raw scores \(X_i, Y_i\) to ranks \(\text{R}(X_i), \text{R}(Y_i)\), then using the following formula:

\[ r_s = \rho_{\operatorname{R}(X),\operatorname{R}(Y)} = \frac{\text{cov}(\operatorname{R}(X), \operatorname{R}(Y))}{\sigma_{\operatorname{R}(X)} \sigma_{\operatorname{R}(Y)}}, \] where \(\rho\) denotes the usual Pearson correlation coefficient, but applied to the rank variables, \(\operatorname{cov}(\operatorname{R}(X), \operatorname{R}(Y))\) is the covariance of the rank variables, \(\sigma_{\operatorname{R}(X)}\) and \(\sigma_{\operatorname{R}(Y)}\) are the standard deviations of the rank variables.

If all \(n\) ranks are distinct integers, you can use the handy formula: \[ \rho = 1 - \frac{6\sum d_i^2}{n(n^2 - 1)} \] where \(\rho\) denotes the correlation coefficient, \(\sum d_i^2\) is the sum of squared differences between the ranks of corresponding pairs of variables, and \(n\) represents the number of pairs of observations.

The coefficient ranges from -1 to 1. A value of 1 indicates a perfect increasing monotonic relationship, while a value of -1 indicates a perfect decreasing monotonic relationship. A value of 0 suggests no monotonic relationship between the variables.

Spearman’s rank correlation coefficient is a non-parametric measure and is often used when the relationship between variables is not linear or when the data is in the form of ranks or ordinal categories.

5.3 Exercises

Exercise 5.1 DatasauRus (Solution online.)

The following exercise shows how to create Figure 5.3 using the programming language R.

Figure 5.4: The logo of the DatasauRus package
  1. Load the packages datasauRus and tidyverse. If necessary, install these packages.

  2. The package datasauRus comes with a dataset in two different formats: datasaurus_dozen and datasaurus_dozen_wide. Store them as ds and ds_wide.

  3. Open and read the R vignette of the datasauRus package. Also open the R documentation of the dataset datasaurus_dozen.

  4. Explore the dataset: What are the dimensions of this dataset? Look at the descriptive statistics.

  5. How many unique values does the variable dataset of the tibble ds have? Hint: The function unique() return the unique values of a variable and the function length() returns the length of a vector, such as the unique elements.

  6. Compute the mean values of the x and y variables for each entry in dataset. Hint: Use the group_by() function to group the data by the appropriate column and then the summarise() function to calculate the mean.

  7. Compute the standard deviation, the correlation, and the median in the same way. Round the numbers.

  8. What can you conclude?

  9. Plot all datasets of ds. Hide the legend. Hint: Use the facet_wrap() and the theme() function.

  10. Create a loop that generates separate scatter plots for each unique datatset of the tibble ds. Export each graph as a png file.

  11. Watch the video Animating the Datasaurus Dozen Dataset in R from The Data Digest on YouTube:

Exercise 5.2 Summary statistics (Solution 5.1)

Calculate for the following datasets: the mode, the median, the 20% quantile, the range, the interquartile range, the variance, the arithmetic mean, the sample standard deviation, the coefficient of variation.

  1. For ten participants in a scientific conference the age has been noted: [25, 21, 18, 37, 56, 89, 46, 23, 21, 34.]
  2. A random sample of 128 visitors of the Cupcake festival yielded the following frequencies regarding the cupcake consumption during their visit:
Table 5.1: Random sample of 128 visitors
Cupcages consumed 1 2 3 4 5 6
Abs. freq. 2 30 37 28 23 8

Solution 5.1.

Metric 1 2
Mode 21 3
Median 29.5 3
P20 21 2
Range 71 5
IQR 25 1.5
Arithmetic mean 37 3.5
\(\sigma^2\) 485.33 1.5591
\(\sigma\) 22.030 1.248621
COV 0.5954 2.4027

Exercise 5.3 Summary statistics in spreadsheet software (Solution 5.2)

Given is the following datset: [0, 0, 40, 50, 50, 60, 70, 90, 100, 100.] Compute the following summary statistics of the data set using a spreadsheet software like or : mean, median, mode, quartiles (Q1, Q2, Q3), range, interquartile range, variance, standard deviation, mean absolute deviation, coefficient of variation and skewness.

Solution 5.2.

Mittelwert 56
Standardfehler 11,4697670227235
Modus 0
Median 55
Erstes Quartil 42,5
Drittes Quartil 85
Varianz 1315,55555555556
Standardabweichung 36,2705880232945
Kurtosis -0,731538352420953
Schräge -0,417051115341008
Bereich 100
Minimum 0
Maximum 100
Summe 560
Anzahl 10

Exercise 5.4 Guess the summary statistics (Solution 5.3)

Given are the following variables:

Table 5.2: Some variables with observations
a b c d e
97 70 1 1 970
98 80 50 2 980
99 90 50 3 990
100 100 50 4 1000
101 110 50 5 1010
102 120 50 6 1020
103 130 99 7 1030

Rank the variables without calculating concrete numbers accordingly to the values of the following descriptive statistics: mode, median, mean, range, variance, standard deviation, coefficient of variation.

Solution 5.3.

varlabel a b c d e
Variance 2 (4.6666) 4 (466.66) 5 (800.33) 1 (4.6666) 3 (466.66)
COV 1 (.02160) 3 (.21602) 5 (.56580) 4 (.54006) 2 (.02160)
Mean 3 (100) 4 (100) 2 (50) 1 (4) 5 (1000)
Median 3 (100) 4 (100) 2 (50) 1 (4) 5 (1000)
Range 1 (6) 4 (60) 5 (98) 2 (6) 3 (60)
SD 1 (2.1602) 4 (21.602) 5 (28.290) 2 (2.1602) 3 (21.602)