2  Data Pitfalls

Recommended reading

Please read “Quantitative Methods” by Huber (2024c). In particular, chapter “Identification” can help to understand how causes of effects can be identified empirically.

The following sections highlight common pitfalls when dealing with data. Being aware of these pitfalls helps build best practices to maintain the integrity of analyses and visualizations. The structure and content align with the excellent book by Jones (2020). I recommend reading this book.

2.1 Epistemic Errors

Learning objectives
  • Understand the concept of epistemic errors and their impact on data analysis.
  • Identify and describe common types of epistemic errors.
  • Apply strategies to avoid epistemic errors in your data work.

Epistemic errors occur when there are mistakes in our understanding and conceptualization of data. These errors arise from cognitive biases, misunderstandings, and incorrect assumptions about the nature of data and the reality it represents. Recognizing and addressing these errors is crucial for accurate data analysis and effective decision-making.

One significant type of epistemic error is the data-reality gap, which refers to the difference between the data we collect and the reality it is supposed to represent. For instance, a survey on customer satisfaction that only includes responses from a self-selected group of highly engaged customers may not accurately reflect the overall customer base. To avoid this pitfall, it is essential to ensure that your data collection methods are representative and unbiased, and to validate your data against external benchmarks or additional data sources.

The Data-Reality Gap

The difference between the data we collect and the reality it is supposed to represent.

Another common epistemic error involves the influence of human biases during data collection and interpretation. Known as the all too human data error, this occurs when personal biases or inaccuracies affect the data. An example would be a researcher’s personal bias influencing the design of a study or the interpretation of its results. To mitigate this, implement rigorous protocols for data collection and analysis, and consider using double-blind studies and peer reviews to minimize bias.

All Too Human Data

Errors introduced by human biases or inaccuracies during data collection and interpretation.

Inconsistent ratings can also lead to epistemic errors. This happens when there is variability in data collection methods, resulting in inconsistent or unreliable data. For example, different evaluators might rate the same product using different criteria or standards. To avoid this issue, standardize data collection processes and provide training for all individuals involved in data collection to ensure consistency.

Inconsistent Ratings

Variability in data collection methods that leads to inconsistent or unreliable data.

The black swan pitfall refers to the failure to account for rare, high-impact events that fall outside regular expectations. Financial models that did not predict the 2008 financial crisis due to the unexpected nature of the events that led to it are an example of this error. To prevent such pitfalls, consider a wide range of possible outcomes in your models and incorporate stress testing to understand the impact of rare events.

The Black Swan Pitfall

The failure to account for rare, high-impact events that fall outside the realm of regular expectations.

Falsifiability and the God pitfall involve the tendency to accept hypotheses that cannot be tested or disproven. This error might occur when assuming that a correlation between two variables implies causation without the ability to test alternative explanations. To avoid this, ensure that your hypotheses are testable and that you actively seek out potential falsifications. Use control groups and randomized experiments to validate causal relationships.

Falsifiability and the God Pitfall

The tendency to accept hypotheses that cannot be tested or disproven.

To avoid epistemic errors, critically assess your assumptions, methodologies, and interpretations. Engage in critical thinking by regularly questioning your assumptions and seeking alternative explanations for your findings. Employ methodological rigor by using standardized and validated methods for data collection and analysis. Engage with peers to review and critique your work, providing a fresh perspective and identifying potential biases. Finally, stay updated with the latest research and best practices in your field to avoid outdated or incorrect methodologies.

Understanding and addressing epistemic errors can significantly improve the reliability and accuracy of your data analyses, leading to better decision-making and more trustworthy insights.

Exercise 2.1  

Figure 2.1: Bananas in various stages of ripeness

Source: Jones (2020, p. 33)

  1. Rate the ripeness level of the bananas pictured by Figure 2.1. Compare your assessment to that of a colleague and discuss any differences in your ratings. What might account for the variance in perception of the bananas’ ripeness between you and your colleague?

  2. Specify how you rated the second and the last bananas on the ripeness scale?

  3. Upon reevaluation, it appears that the second and the last bananas are identical in ripeness. How would you justify your initial decision now? This scenario underscores an important lesson for interpreting polls and surveys: it illustrates how subjective assessments can lead to variance in results. It highlights the necessity of ensuring clarity and consistency in the criteria used for evaluations to minimize subjective discrepancies.

2.2 Technical trespasses

The second pitfall refers to the mishandling of data through a lack of technical expertise or the misapplication of statistical methods. It can lead to significant distortions in data analysis and result in flawed decisions. This might include using inappropriate statistical tests, excessive model complexity, or improper data manipulation. Understanding essential statistics and maintaining data integrity are vital to prevent these errors. Business leaders need to ensure that their teams have the requisite technical knowledge and to foster a culture of cooperation with data science experts to safeguard against these common technical blunders. This will ultimately support more reliable and actionable insights for strategic decision-making.

Here are some examples things that frequently causes wrong conclusions when analyzing data:

Misinterpreting correlation as causation

A classic mistake is interpreting a correlation between two variables as evidence that one causes the other. Without proper experimental design or further statistical analysis, such claims can lead to ill-informed business decisions.

Inappropriate use of average

Averages can be misleading when data is skewed or has outliers. For example, using the mean salary in a company where the CEO’s salary is drastically higher than the rest of the employees can give a false impression of the typical compensation.

Neglecting the test assumptions

Applying statistical tests without checking the necessary assumptions can invalidate the results. For instance, using a t-test on data that is not normally distributed or has unequal variances between groups might lead to incorrect conclusions.

Overfitting predictive models

Creating a model that performs extremely well on historical data but fails to predict future outcomes undermines its utility. Overfit models are often too complex, capturing noise rather than underlying trends in the training data.

Ignoring the effect of sample size

Small samples can produce volatile or non-representative results; business decisions made on such shaky ground may not stand up when scaled to the full population or market.

Cherry-picking data

Selecting data that supports a preconceived hypothesis, while ignoring data that doesn’t, compromises the objectivity of the analysis and can mislead stakeholders about the reality of the situation.

Improper data visualization

Misrepresenting data through poor chart choices or scales can mislead viewers. For example, truncating the y-axis can exaggerate trends, and using pie charts for too many categories can confuse rather than clarify.

Exercise 2.2 Survivorship bias

Read “How Successful Leaders Think” by Roger Martin (2007) and the chapter “Identification” of “Quantitative Methods” by Huber (2024c).

Here is a summary of Martin (2007) taken from the Harvard Business Review Store:

In search of lessons to apply in our own careers, we often try to emulate what effective leaders do. Roger Martin says this focus is misplaced, because moves that work in one context may make little sense in another. A more productive, though more difficult, approach is to look at how such leaders think. After extensive interviews with more than 50 of them, the author discovered that most are integrative thinkers–that is, they can hold in their heads two opposing ideas at once and then come up with a new idea that contains elements of each but is superior to both. Martin argues that this process of consideration and synthesis (rather than superior strategy or faultless execution) is the hallmark of exceptional businesses and the people who run them. To support his point, he examines how integrative thinkers approach the four stages of decision making to craft superior solutions. First, when determining which features of a problem are salient, they go beyond those that are obviously relevant. Second, they consider multidirectional and nonlinear relationships, not just linear ones. Third, they see the whole problem and how the parts fit together. Fourth, they creatively resolve the tensions between opposing ideas and generate new alternatives. According to the author, integrative thinking is an ability everyone can hone. He points to several examples of business leaders who have done so, such as Bob Young, co-founder and former CEO of Red Hat, the dominant distributor of Linux open-source software. Young recognized from the beginning that he didn’t have to choose between the two prevailing software business models. Inspired by both, he forged an innovative third way, creating a service offering for corporate customers that placed Red Hat on a path to tremendous success.

  1. Discuss the concepts introduced by Martin (2007) critically:
  • Does he provide evidence for his ideas to work?
  • Is there a proof that his suggestions can yield success?
  • Is there some evidence about whether his ideas are superior to alternative causes of action?
  • What can we learn from the article?
  • Does his argumentation fulfill highest academic standards?
  • What is his identification strategy with respect to the causes of effects and the effects of causes?
  • Martin (2007, p. 81) speculates:

“At some point, integrative thinking will no longer be just a tacit skill (cultivated knowingly or not) in the heads of a select few.”

  1. If teachers in business schools would have followed his ideas of integrative thinkers being more successful, almost 20 years later, this should be the dominant way to think as a business leader. Is that the case? And if so, can you still gain some competitive advantage by thinking that way?
Figure 2.2: Distribution of bullet holes in returned aircraft

Source: Martin Grandjean (vector), McGeddon (picture), Cameron Moll (concept), CC BY-SA 4.0, Link

  1. Figure 2.2 visualizes the distribution of bullet holes in aircraft that returned from combat in World War II. Imagine you are an aircraft engineer. What does this picture teach you?

  2. Inform yourself about the concept of survivorship bias explained in Wikipedia (2024).

  3. In Martin (2007), the author provides an example of a successful company to support his management ideas. Discuss whether this article relates to survivorship bias.

2.3 Mathematical Miscues

This pitfall deals with the errors that can occur when mathematical reasoning is flawed or when calculations are incorrectly applied within data analysis. This encompasses a range of issues, from simple arithmetic mistakes to more complex misunderstandings of mathematical principles that underlie data models. It might involve misjudging how certain metrics are calculated, which could significantly alter an analysis’s outcomes.

Additionally, this pitfall includes a lack of understanding the nuanced relationships within mathematical functions that can result in poor model performance or misestimation of forecasts and trends. Business leaders should be keenly aware that even advanced data analytics rely on the precise and proper application of mathematics. Therefore, ensuring precise calculations and fostering a thorough understanding of the mathematical techniques used in data analysis is vital to avoid undermining data-driven business decisions with slip-ups in mathematical reasoning.

Here is an incomplete list of things that are often misused:

Misuse of averages

Arithmetic Mean vs. Median: Incorrectly using the arithmetic mean without considering outliers or skewed distributions, which can misrepresent the central tendency of data.

Weighted Averages: Failing to appropriately weight averages based on sample sizes or importance of data points, leading to biased results.

Improper scaling and normalization

Scaling data improperly without considering the original distribution or range, which can distort comparisons and statistical analyses. Improper normalization techniques that fail to account for the underlying data structure or variance, leading to skewed interpretations.

Misapplication of probability and statistics

Misunderstanding statistical significance and confidence intervals, leading to incorrect conclusions about the reliability of findings. Inappropriate use of probability distributions or assumptions, which can invalidate statistical tests or predictions.

Errors in calculating percentages and ratios

Incorrectly calculating percentages or ratios without ensuring consistent bases or understanding the context, leading to misrepresentation of trends or relationships.

2.4 Statistical slipups

This pitfall is about common errors when applying statistical methods. These errors include misinterpreting statistical significance, sampling biases, ignoring multicollinearity in regression analysis, misapplying hypothesis tests, and failing to adjust for multiple comparisons.

For example, misinterpreting a small p-value as evidence of causation rather than correlation can lead to misleading conclusions. Similarly, using a convenience sample that does not accurately represent the population can skew results and compromise the generalizability of findings. In regression analysis, overlooking multicollinearity among predictor variables can result in inaccurate assessments of their individual impacts.

2.4.1 Descriptive debacles

This pitfall refers to errors in data analysis stemming from inadequate or misleading use of descriptive statistics. This bias occurs when analysts present summary statistics that are incomplete, misleading, or not properly contextualized. It also includes misrepresenting data through inappropriate visualizations or failing to provide sufficient detail in describing data characteristics.

In practical terms, this bias leads to misunderstandings or incorrect interpretations of data. For exmple, using measures like mean or median without considering data variability can oversimplify complex datasets. Similarly, misleading graphs or charts may obscure important trends or relationships within the data.

To mitigate this bias, analysts should employ a range of appropriate descriptive statistics tailored to data types and research questions. They should ensure visual representations accurately depict data characteristics without distorting interpretations. Providing adequate context and explanations helps stakeholders understand data limitations and implications more effectively.

2.4.2 Inferential infernos

Frequently, researchers draw erroneous or exaggerated conclusions from statistical analyses. This bias manifests when statistical significance is misinterpreted, leading to unwarranted claims of causation or generalization beyond the scope of the data.

Researchers may overlook alternative explanations or fail to consider confounding variables, thereby overstating the implications of their findings. To mitigate this bias, it is essential to apply statistical tests accurately, interpret results cautiously, and acknowledge the limitations and uncertainties of the data.

Recommended reading

Read Wysocki et al. (2022) which is freely available here. Here you find a good summary of the paper.

2.4.3 Slippery sampling

Systematic errors can be introduced in research studies due to flawed or biased sampling methods. This bias occurs when the sample selected for study does not accurately represent the target population, leading to skewed or misleading conclusions.

In practice, researchers may inadvertently use convenience sampling methods or fail to account for selection biases, resulting in a sample that is not representative of the broader population. As a result, findings based on such samples may not generalize well beyond the specific sample group, undermining the external validity of the study.

Recommended reading

Read chapter Sampling of Huber (2024c).

2.4.4 Insensitivity to sample size

This bias refers to the tendency to overlook or underestimate the impact of sample size on the reliability and validity of study results. This bias occurs when analysts or researchers fail to recognize that smaller sample sizes may produce more variable or less representative data, leading to less reliable conclusions.

In practical terms, studies with smaller sample sizes are more prone to random fluctuations and may not accurately reflect the characteristics of the larger population. Consequently, findings based on small samples may lack statistical power and may not be generalizable to broader contexts or populations.

Here is an incomplete summary of common statistical slipups:

Misunderstanding statistical significance

Incorrect interpretation of p-values and confidence intervals can lead to erroneous conclusions about the strength of relationships in data.

Sampling errors

Biased sampling or inadequate sample size selection can distort results and impair the generalizability of findings.

Ignoring multicollinearity

Failure to account for correlations between predictor variables in regression analysis can lead to unreliable coefficient estimates and misjudgment of variable impacts.

Misapplication of hypothesis tests

Improper formulation of hypotheses or disregarding test assumptions can invalidate statistical results and mislead interpretations.

Failure to adjust for multiple comparisons

Conducting multiple tests without adjusting significance levels increases the likelihood of false positives and misleading conclusions.

2.5 Analytical Aberrations

2.5.1 The intuition/analysis false dichotomy

This error stems from misconception that intuition and analytical thinking involves when trying to use it as approaches to decision-making. In reality, effective decision-making often integrates both intuitive insights and analytical rigor. Intuition can provide valuable initial impressions or creative ideas, while analysis offers systematic evaluation and validation of hypotheses. Acknowledging and balancing these two aspects enhances decision-making processes by leveraging diverse perspectives and ensuring comprehensive consideration of factors influencing outcomes.

Recommended reading

Read the chapter Decision making basics of Huber (2024b)

While intuition comes with multiple cognitive biases. Intuition remains valuable in decision-making for several reasons. Firstly, it often draws on subconscious processing of vast amounts of information, leading to quick insights that can guide effective decisions. Secondly, intuition can fill gaps left by incomplete data, offering solutions when data is scarce or ambiguous. Thirdly, it allows for creativity and innovation by providing unconventional perspectives that data alone may not reveal. Fourthly, intuition serves as a valuable complement to analytical thinking, providing a holistic approach to problem-solving. Lastly, it facilitates swift decision-making in fast-paced environments where immediate action is crucial. Integrating intuition with data-driven approaches enhances decision-making by leveraging both analytical rigor and intuitive wisdom.

2.5.2 Exuberant extrapolations

Exuberant Extrapolations refer to the tendency to make overly optimistic or exaggerated predictions based on limited data or trends. This pitfall occurs when analysts or decision-makers extend current trends far into the future without considering potential deviations or external factors that could influence outcomes. It can lead to unrealistic expectations, flawed projections, and poor decision-making due to overconfidence in extrapolated outcomes. Mitigating this pitfall requires cautious interpretation of trends, consideration of uncertainties, and validation through robust analysis and scenario planning to ensure more realistic and reliable forecasts.

Exercise 2.3 Past and present

Again, let us consider “How Successful Leaders Think” by Roger Martin (2007).

Discuss the insights that he puts forward. Can we really learn from the experience of successfull leaders?

2.5.3 Ill-advised interpolations

Ill-advised interpolations occur when analysts or decision-makers infer trends or make assumptions between data points without sufficient evidence or justification. This pitfall involves filling gaps in data with speculative assumptions or linear extrapolations that do not accurately reflect the underlying patterns or relationships. It can lead to misleading interpretations, erroneous conclusions, and misguided decision-making based on incomplete or flawed information.

2.5.4 Funky forecasts

This pitfall arises when forecasters fail to account for changing circumstances, unexpected events, or complex interactions among factors influencing future trends. It can lead to unreliable projections, erroneous planning, and ineffective decision-making based on overly simplistic or deterministic forecasts. To mitigate this pitfall, analysts should use robust forecasting techniques, incorporate scenario analysis to account for uncertainties, and continuously update forecasts based on new information and changing conditions.

2.5.5 Moronic measures

Moronic measures refer to the use of inappropriate or inadequate metrics for evaluating performance or assessing outcomes. This pitfall occurs when decision-makers rely on metrics that do not accurately capture the intended goals or fail to account for relevant factors influencing performance. It can lead to misleading assessments, misguided resource allocation, and ineffective strategies due to a lack of alignment between chosen metrics and organizational objectives. To mitigate this pitfall, organizations should carefully select metrics that are relevant, measurable, and aligned with strategic priorities. They should also consider using complementary metrics and qualitative assessments to provide a comprehensive understanding of performance outcomes. By using appropriate measures and periodically reviewing their relevance, organizations can enhance decision-making and ensure that performance evaluations contribute meaningfully to achieving desired outcomes.

2.6 Graphical Gaffes

Recommended reading

Read the chapter Visualize data of Huber (2024a) and the literature that you find cited therein.

2.6.1 Challenging charts

Pitfalls in data visualization are pretty commen. Charts are frequently poorly designed, misleading, or difficult to interpret. Actually, it is simply difficult to effectively communicate insights with graphical visualzations of data due to cluttered layouts, inappropriate chart types, distorted scales, or ambiguous labeling. It can lead to confusion, misinterpretation of data trends, and inaccurate decision-making based on flawed visual presentations. To mitigate this pitfall, analysts should adhere to best practices in data visualization, such as using clear and simple designs, choosing appropriate chart types for the data being presented, ensuring accurate scaling and labeling, and providing contextual information to aid interpretation.

2.6.2 Data dogmatism

Data Dogmatism refers to the pitfall of rigidly adhering to data-driven decisions without considering broader context, qualitative insights, or expert judgment. This pitfall occurs when organizations or decision-makers prioritize data alone as the sole basis for decision-making, ignoring nuanced factors that may influence outcomes. It can lead to overly deterministic decisions, missed opportunities, and ineffective strategies that fail to account for human expertise, market dynamics, or unforeseen variables. To mitigate this pitfall, organizations should adopt a balanced approach that integrates data-driven insights with qualitative analysis, expert opinions, and contextual understanding. By recognizing the limitations of data and embracing a more holistic decision-making framework, organizations can enhance flexibility, innovation, and strategic agility in response to complex challenges.

2.6.3 The optimize/satisfice false dichotomy

This dichotomy refers to the misconception that decision-making strategies must exclusively prioritize either optimizing for the best possible outcome or satisficing by accepting a satisfactory but not necessarily optimal solution. This pitfall arises when decision-makers perceive a dichotomy between these approaches, potentially overlooking opportunities for adaptive strategies that blend elements of both optimization and satisficing. It can lead to suboptimal decisions, missed opportunities for innovation, or inefficient resource allocation due to rigid adherence to a singular decision-making approach. To mitigate this pitfall, organizations should adopt a flexible and adaptive decision-making framework that considers both optimizing for excellence where feasible and satisficing when constraints or uncertainties warrant. By embracing a nuanced approach that balances ambition with pragmatism, organizations can enhance resilience, innovation, and effectiveness in achieving strategic objectives.

2.7 Design Dangers

Recommended reading

Read the chapter Visualize data of Huber (2024a) and the literature that you find cited therein.

This section encompasses pitfalls related to the design and implementation of data-driven projects or experiments. It includes errors such as biased sampling methods, inadequate experimental design, or flawed data collection protocols.

2.7.1 Confusing Colors

In data visualization the choice of colors in charts matter. When colors are poorly contrasted, misaligned with data meaning, or inconsistently applied across visual elements, it may be difficult to distinguish different data categories or trends. The wrong colors can lead to misunderstandings and inaccurate conclusions. And by the way, many people are color blind.

2.7.2 Usability uh-ohs

This pitfall occurs when systems lack intuitive interfaces, are overly complex, or fail to meet user needs and expectations. It can result in frustration, inefficiencies, or resistance to adoption among users, leading to underutilization of valuable data resources. To mitigate this pitfall, organizations should prioritize user-centric design principles, conduct usability testing with diverse user groups, and solicit feedback to refine interfaces and functionalities. By enhancing usability, organizations can improve user satisfaction, facilitate smoother workflows, and maximize the utility of data-driven tools for informed decision-making.

2.7.2.1 Good visualizations are discoverable and understandable

Good visualizations are designed to be easily discovered and understood by users. They prioritize clarity and accessibility, allowing users to quickly grasp the presented information without ambiguity or unnecessary complexity.

2.7.2.2 Don’t blame people for getting confused or making errors

Users should not be blamed for confusion or errors in understanding visualizations. Instead, designers should strive to create intuitive and user-friendly designs that minimize cognitive load and facilitate accurate interpretation of data.

2.7.2.3 Designing for pleasure and emotion is important

Effective visualizations consider the emotional and aesthetic impact on users. Design elements that evoke positive emotions and engage users can enhance comprehension and retention of information, making the visualization more impactful and memorable.

2.7.2.4 Complexity is good, confusion is bad

Complexity in visualizations can enrich understanding by presenting detailed information and relationships. However, complexity should not lead to confusion or overwhelm users. Designers must balance complexity with clarity to ensure that insights are communicated effectively without causing cognitive overload.

2.7.2.5 Absolute precision isn’t always necessary

Visualizations do not always need to convey absolute precision in every detail. Depending on the context and audience, approximations or visual representations that convey trends and patterns effectively may be sufficient. Designers should prioritize meaningful insights over exhaustive accuracy in all data points.