Experiment Design: You Thought the Experiment Was Fair, but It Never Was

In 1924, the Western Electric Company ran an experiment at its Hawthorne factory outside Chicago: they wanted to see whether improving lighting conditions would boost worker productivity.

The experimental group worked under better lighting, and productivity went up. Then they dimmed the lights — productivity still went up. The researchers kept trying various changes — shorter hours, different break schedules, rearranged workstations. Nearly every change improved productivity, regardless of the direction of the change.

Eventually they realized: what boosted productivity was not the lighting or the hours — it was “being observed” itself. Workers knew they were being studied, so they worked harder.

This is the thorniest problem in experiment design: you ask a question, and the way you ask it has already shaped the answer. You observe a behavior, and the act of observing has already changed that behavior. You design an experiment, and the structure of the experiment itself has already determined what you can see.

No experiment is perfectly neutral. Good experiment design does not eliminate these problems — it brings them into known territory so their impact can be estimated and controlled.


42. Hawthorne Effect: Being Observed Changes Behavior

Subjects change their behavior because they are aware they are being observed. In app user research, the act of inviting users to “come test a new feature” already makes this group different from normal users — they are more engaged, more willing to try things, less likely to give up. What you are measuring is not “how ordinary users interact with the new feature” but “how users who know they are being tested behave.”

In remote work culture, this effect is even harder to handle: when you announce “this week we are observing engineering productivity,” what you measure is the productivity of “engineers who know they are being observed,” not their usual output. A common mistake managers make is treating data from the observation period as a baseline, then setting targets that are impossible to sustain under normal conditions.

43. Placebo Effect: Nothing Changed, but Users Feel Different

Subjects experience a real effect because they believe they received an intervention — even if the intervention itself does nothing.

After receiving a notification saying “We’ve updated our algorithm to provide a more personalized experience,” users may genuinely feel the experience improved — even if the backend changed nothing. In drug clinical trials, it is common for 20–30% of placebo group patients to report symptom improvement — they received no real medication, but they believed they did.

The Hawthorne effect and the placebo effect look similar but have different mechanisms: the Hawthorne effect is subjects changing because they “know they are being observed”; the placebo effect is subjects changing because they “believe the intervention is effective.” The former is a change in performance; the latter is a change in subjective experience — sometimes both happen simultaneously.

44. Experimenter Expectancy Effect: The Researcher’s Expectations Influence the Subject

The researcher’s eye contact, tone of voice, and follow-up questions unconsciously telegraph the “correct answer” to subjects. Even without deliberate guidance, subtle body language and word choices can systematically influence results.

The Rosenthal effect (also called the Pygmalion effect) is a famous example: teachers were told certain students were “unlimited potential, fast learners” (actually selected at random), and a year later those students genuinely performed better than their peers. The teachers’ expectations changed their teaching behavior, which in turn affected student performance.

Double-blind design exists precisely to counter these two effects — even the researchers themselves do not know which group is which. When even the researcher does not know, there are no expectations to transmit.

45. Intervention Bias: Something Other Than Your Variable Was Different

The control group and experimental group were treated differently in ways beyond the experimental variable.

The most common form: users in the experimental group received a “Thank you for participating in our new feature test” email, while the control group did not. That email is an intervention — regardless of whether the feature is good, the mere act of receiving special attention can influence user behavior. You cannot separate “the effect of the feature” from “the effect of being treated as a special user.”

In A/B testing, if the experimental and control groups were acquired at different times (say, the experimental group on Friday and the control group on Monday), the characteristics of those days could become an uncontrolled intervention.

46. Non-Response Bias: The People Who Did Not Fill Out the Survey Are Fundamentally Different

A survey with a 20% response rate — are the 80% who did not respond the same kind of people as the 20% who did? Almost certainly not.

Why do people fill out surveys? They have time, they have strong feelings (very satisfied or very dissatisfied), they care about the product, or they are simply more cooperative with research. Why do people skip them? They are too busy, their feelings are not strong, they have long since stopped using the product, or they cannot even find the survey link.

In satisfaction surveys, that silent 80% may be exactly the people you most need to understand: the “invisible majority” who are neither particularly satisfied nor dissatisfied, or people who churned and never interacted with the product again. Listening only to the 20% gives you a systematically skewed picture.

47. Questionnaire Bias: The Wording and Design of Questions Determine the Direction of Answers

The phrasing of questions, design of response options, and order of questions can all systematically influence answers. This is not necessarily deliberate, but the effect is the same.

“Do you like this new feature?” and “What are your thoughts on this new feature?” generate completely different data. The former already implies a positive lean — users must actively “disagree” to give a negative answer; the latter is open-ended, allowing users to naturally express any feeling.

“Do you think the price is reasonable? (1-5)” and “How much do you think this feature is worth?” also yield different results. The former uses the word “reasonable,” implying you think it should be reasonable; the latter lets users form their own judgment.

Option design matters too: if your satisfaction scale offers “Very Satisfied, Satisfied, Okay,” you have already removed the negative options, forcing users to choose among positive ones.

48. Information Bias: Your Data Labels Are Wrong

Garbage in, garbage out. In machine learning, if training data labels are completed by humans, the annotators’ biases, fatigue, and inconsistent understanding will systematically contaminate your model.

Different annotators may define “positive sentiment” differently. One annotator labels “this app is okay” as positive; another labels it as neutral. Your 10 million training examples are a mixture labeled by different people under different standards and different states. The model learns “annotator biases,” not “actual user sentiment.”

In medical data, diagnostic inconsistencies (different hospitals, different doctors) mean the same symptoms get labeled as different diseases, or the same disease gets different diagnostic codes. A model trained on this data is actually learning the patterns of “diagnostic inconsistency,” not the patterns of the disease itself.

49. Detection Bias: Looking Harder Finds More, but That Does Not Mean the Phenomenon Is More Common

Differing intensity of effort in searching for a phenomenon causes differences in detection rates to be mistaken for differences in the phenomenon itself.

Cancer screening is more intensive in wealthy areas, so wealthy areas “discover” more early-stage cancers. This is not because wealthy areas have more cancer — it is because they looked harder. If you only look at “diagnosis rates,” you reach the completely wrong conclusion that “wealthy areas have higher cancer incidence.”

In tech: if you only do detailed error tracking for paying users while free users only get basic logging, you will find that paying users encounter “more” bugs. In reality, it is because you are tracking them more closely. Using this gap as a basis for “paying users have more problems that need solving” leads to misallocated resources.

50. Exclusion Bias: Removing Outliers May Remove the Most Important Signal

During data cleaning, “removing obvious outliers” sounds reasonable. But those “anomalies” may be exactly the users you most need to understand.

If you remove “users with session times over 8 hours” as outliers when analyzing user behavior, you may be deleting your most loyal power users. Their usage pattern simply differs from normal users — that is their defining characteristic, not a data error.

If you exclude “crashed session data” as noise, you are excluding the very fact that “the app crashed.” A user leaving because of a crash is not “abnormal” — it is the problem you need to fix.

There are two kinds of outliers: genuine data errors (a clearly impossible value was recorded) and extreme but real situations in the actual world. The former should be cleaned; the latter needs to be understood, not deleted.


The challenge of experiment design almost always centers on “how to make what you measure as close as possible to what you truly want to know.” Observation itself changes what is observed, questions themselves influence answers, exclusion criteria themselves shape what you can see. Good experiment design does not find a method immune to these problems — it quantifies and controls them so that their contamination of your conclusions is known rather than unknown.