High-yield overview

False Positives | False Negatives | Error Rates

Type IFalse Positive (Alpha)

Type IIFalse Negative (Beta)

5%Conventional Alpha

20%Conventional Beta (80% Power)

Error Types by Truth

Type I Error (Alpha)

PatternReject null when null is true

TreatmentFalse Positive - find difference that does not exist

Type II Error (Beta)

PatternFail to reject null when alternative is true

TreatmentFalse Negative - miss difference that exists

Correct Rejection

PatternReject null when alternative is true

TreatmentTrue Positive - correctly find real difference

Correct Acceptance

PatternAccept null when null is true

TreatmentTrue Negative - correctly find no difference

Critical Must-Knows

Type I Error (Alpha): Concluding there IS an effect when there is NOT (false positive). Set before study, usually 0.05.
Type II Error (Beta): Concluding there is NO effect when there IS (false negative). Related to power: Power = 1 minus Beta.
Trade-off: Reducing alpha (e.g., 0.01) reduces Type I error but increases Type II error risk unless sample size increases.
Clinical Consequences: Type I leads to adopting ineffective treatments; Type II leads to discarding effective treatments.
Multiple Comparisons: Testing many hypotheses inflates Type I error (family-wise error) - need correction (Bonferroni).

Clinical Pearls

"
Alpha is set BEFORE study (usually 0.05), p-value is calculated AFTER from data
"
Underpowered studies have high Type II error risk - may miss real treatment effects
"
Type I error is considered worse in many contexts - adopting harmful treatment worse than missing beneficial one
"
Multiple testing without correction can inflate Type I error above 0.05

Clinical Imaging

Imaging Gallery

Critical Error Concepts

Type I Error (False Positive)

Definition: Rejecting null hypothesis when null is actually true. Example: Concluding new treatment is better when it actually is not. Alpha = 0.05 accepts 5% risk.

Type II Error (False Negative)

Definition: Failing to reject null when alternative is true. Example: Concluding treatments are equivalent when new treatment is actually better. Beta = 0.20 (power 80%) accepts 20% risk.

Alpha-Beta Trade-off

Relationship: Reducing alpha (stricter threshold) increases beta (Type II error risk) unless sample size increases. Cannot minimize both errors simultaneously with fixed sample.

Clinical Consequences

Type I Consequence: Adopt ineffective or harmful treatment. Type II Consequence: Discard effective treatment. Which is worse depends on context - severity of disease, treatment risks.

Mnemonic

CRWAType I vs Type II Errors

C	Crying Wolf Type I = False alarm (say difference exists when it does not)
R	Reality check Check if null is actually true - if yes and you reject, Type I error
W	Wolf present but missed Type II = Missing real threat (say no difference when there is one)
A	Acceptance when shouldn't Accept null when alternative is true = Type II error

C	Crying Wolf Type I = False alarm (say difference exists when it does not)	W	Wolf present but missed Type II = Missing real threat (say no difference when there is one)
R	Reality check Check if null is actually true - if yes and you reject, Type I error	A	Acceptance when shouldn't Accept null when alternative is true = Type II error

Hook:The Boy Who Cried Wolf - Type I is crying wolf falsely (false positive), Type II is missing the real wolf (false negative)!

Mnemonic

PAWSError Consequences and Prevention

P	Pre-set Alpha Set Type I error rate before study (usually 0.05)
A	Adequate Power Ensure 80% power to minimize Type II error (beta = 0.20)
W	Watch Multiple Comparisons Bonferroni correction for multiple tests to control Type I error
S	Sample Size Larger sample reduces both errors (mainly impacts Type II)

P	Pre-set Alpha Set Type I error rate before study (usually 0.05)	W	Watch Multiple Comparisons Bonferroni correction for multiple tests to control Type I error
A	Adequate Power Ensure 80% power to minimize Type II error (beta = 0.20)	S	Sample Size Larger sample reduces both errors (mainly impacts Type II)

Hook:Use your PAWS to prevent errors - proper planning prevents poor performance!

Overview/Introduction

What is Type I Error?

Definition: Rejecting the null hypothesis when the null hypothesis is actually true.

Common Name: False Positive

Example: Concluding a new surgical technique is superior when it actually has no benefit.

Consequences:

Adopt ineffective or harmful treatment
Waste resources implementing change
Potential harm to patients
False confidence in intervention

Alpha Level Selection

Alpha Thresholds and Implications

Alpha	Type I Error Risk	When Used	Trade-off
0.01	1% false positive rate	When Type I error is very costly (e.g., drug approval)	Requires larger sample or accepts higher Type II error
0.05	5% false positive rate	Conventional in most research	Balance between Type I and Type II errors
0.10	10% false positive rate	Exploratory or pilot studies	Easier to find significance but higher false positive risk

Key Point: Alpha is set BEFORE the study. The p-value is calculated AFTER from the data. If p less than alpha, reject null.

Principles of Error Testing

Core Principles

The Error Trade-Off:

Decreasing Type I error (lower alpha) increases Type II error risk
Decreasing Type II error (higher power) increases sample size needed
Cannot minimize both simultaneously without increasing sample size

Control Strategies:

Type I (Alpha): Pre-specify alpha, use appropriate corrections for multiple testing
Type II (Beta): Adequate sample size, appropriate effect size assumptions

Clinical Decision Framework: When is each error more serious?

Type I more serious: Invasive treatment, irreversible decision, expensive intervention
Type II more serious: Missing life-saving treatment, rare disease with few options

Understanding these principles guides appropriate study design.

Understanding Type II Error (Beta)

What is Type II Error?

Definition: Failing to reject the null hypothesis when the alternative hypothesis is actually true.

Common Name: False Negative

Example: Concluding two treatments are equivalent when one is actually superior.

Consequences:

Discard effective treatment
Delay progress in patient care
Wasted research effort (failed trial)
Miss therapeutic opportunity

Relationship to Power: Power = 1 minus Beta

Beta and Power

Beta and Power Relationship

Beta	Power	Interpretation	Sample Size
0.05	95%	Very high power - 95% chance detecting real effect	Very large sample needed
0.10	90%	High power - 90% chance detecting real effect	Large sample needed
0.20	80%	Adequate power - 80% chance detecting real effect	Moderate sample, conventional target
0.50	50%	Underpowered - coin flip chance of detection	Small sample, high Type II error risk

Understanding Type II error is critical for interpreting negative study results.

Error Matrix and Decision Framework

The 2x2 Truth Table

Statistical Decision vs Reality Matrix

Type I and Type II Errors

Your Decision	Null is TRUE	Alternative is TRUE
Reject Null (p less than alpha)	TYPE I ERROR (False Positive) - Alpha = 0.05	CORRECT DECISION (True Positive) - Power
Accept Null (p greater than alpha)	CORRECT DECISION (True Negative) - 1 minus Alpha	TYPE II ERROR (False Negative) - Beta = 0.20

Key Insight: We never know which column we are in (true state of nature is unknown). We set alpha and beta to control error rates.

Multiple Comparisons and Type I Error Inflation

The Multiple Testing Problem

Problem: Testing multiple hypotheses inflates overall Type I error rate.

Example: Testing 20 different outcomes at alpha = 0.05 each.

Expected false positives: 20 × 0.05 = 1 false positive on average
Family-wise error rate (FWER): Probability of at least one Type I error increases with each test

Formula for FWER: 1 minus (1 minus alpha)^n

For 20 tests at alpha = 0.05: FWER = 1 minus 0.95^20 = 0.64 (64% chance of at least one false positive)

Bonferroni Correction

Method: Divide alpha by number of tests to maintain overall Type I error.

Formula: Adjusted alpha = 0.05 / n

Example: Testing 5 outcomes → Adjusted alpha = 0.05 / 5 = 0.01

Use p less than 0.01 as threshold for each test to maintain overall Type I error at 0.05

Trade-off: Conservative - may increase Type II error (reduce power).

When to Correct for Multiple Comparisons

Correct: When testing multiple related hypotheses (e.g., multiple outcome measures in same trial).

May NOT need correction: Pre-specified primary outcome vs secondary/exploratory outcomes. Only primary outcome requires alpha = 0.05.

Understanding multiple comparisons prevents inflated Type I error rates.

Clinical Application

Which Error is Worse?

Context-Dependent: Type I (false positive) often considered worse - adopting harmful treatment. But Type II (false negative) can be worse if missing life-saving treatment. Balance depends on disease severity and treatment risk.

Screening Tests

Type I in Screening: False positive → unnecessary workup, anxiety. Type II: False negative → missed diagnosis, delayed treatment. Serious diseases (cancer) prioritize minimizing Type II (high sensitivity).

Underpowered Studies

High Beta Risk: Many orthopaedic trials underpowered (power under 80%, beta greater than 0.20). Negative results may be Type II errors. Always check power before accepting negative result.

Meta-Analysis Solution

Combining Studies: Meta-analysis increases power by pooling data from multiple studies. Reduces Type II error risk, provides more precise effect estimate.

Controversies and Areas of Uncertainty

Should alpha stay at 0.05?

A 2017 proposal argued for lowering the default threshold for new claims to 0.005 to curb false positives; critics countered this simply trades a higher Type I rate for a higher Type II rate and demands much larger samples. No global consensus exists, and 0.05 remains the working convention.

Abandon significance testing?

Some statisticians advocate retiring the word "significant" altogether in favour of estimation (effect sizes with confidence intervals) and Bayesian reasoning. Exam answers should still command the classical framework but can acknowledge this debate.

When to correct for multiplicity

Whether and how to adjust for multiple comparisons is genuinely contested (Perneger vs proponents of strict family-wise control). The defensible middle ground: pre-specify one primary outcome; treat all else as hypothesis-generating.

Post-hoc power

Calculating power after a non-significant result using the observed effect is statistically circular and discouraged - it merely re-expresses the p-value. Judge underpowering from the a priori calculation and the confidence interval width instead.

Guidelines, Registries & Global Practice

Global Reporting Standards

Error control is enforced internationally through reporting and regulatory frameworks rather than country-specific rules - the concepts are universal across FRCS, FRACS, EBOT, ABOS, DNB and SICOT curricula.

How Major Frameworks Address Type I and Type II Errors

Framework / Body	Scope	Type I control	Type II control
CONSORT 2010 (global)	Reporting of parallel-group RCTs	Pre-specified primary outcome; declare subgroup/multiple analyses	Mandatory sample-size justification (effect size, alpha, power)
ICH E9 (international regulatory)	Statistical principles for clinical trials	Pre-defined analysis plan, multiplicity strategy, alpha spending	Power and sample-size assumptions stated a priori
FDA / EMA guidance	Drug and device approval (US / Europe)	Often demands two adequate well-controlled trials or stricter alpha	Adequate power required for pivotal endpoints
Cochrane / GRADE	Evidence synthesis and certainty rating	Meta-analysis reduces spurious single-study positives	Pooling raises power; imprecision downgrades certainty
STROBE	Observational study reporting	Encourages reporting of all analyses to limit selective positives	Reporting of study size and its rationale

Registries and Large Datasets

National joint replacement registries (NJR for England/Wales, AOANJRR Australia, SHAR Sweden, the Norwegian and New Zealand registries, and AJRR in the US) hold hundreds of thousands of procedures. Their value for this topic is power: rare events such as implant revision are detectable with adequate precision, dramatically reducing Type II error compared with single-centre series. The trade-off is that with such large samples, trivial differences become statistically significant, so the emphasis shifts to clinical significance and effect size (e.g. hazard ratios for revision) rather than the p-value alone.

High- vs Limited-Resource Practice Variation

Setting	Typical reality	Error implication
Well-resourced / registry-linked	Multicentre RCTs, registries, pre-registration	Better powered; main risk is over-interpreting tiny but significant effects (Type I in spirit)
Limited-resource	Small single-centre series, few RCTs	High Type II error risk; negative results frequently inconclusive
Global synthesis	Cochrane reviews pool across regions	Improves power and generalisability; heterogeneity must be assessed

The teaching point is universal: interpret a "negative" study in the light of its power, and a "positive" study in the light of multiplicity and effect size - independent of country.

Evidence Base

Type-II Error Rates of Randomised Trials in Orthopaedic Trauma

Lochner HV, Bhandari M, Tornetta P • J Bone Joint Surg Am (2001)

Key Findings:

Systematic review of 117 randomised fracture-care trials (1968 to 1999) enrolling 19,942 patients
Mean study power for the primary outcome was only 24.65 percent (range 2 to 99 percent)
Type-II (beta) error rate for primary outcomes was 90.52 percent - the great majority were underpowered
Sample sizes were small (mean 95 patients) and primary outcomes were often not pre-specified
A priori threshold for acceptable power was set at 80 percent (beta 0.20 or less)

Clinical Implication: Most negative orthopaedic trauma trials are critically underpowered, so a non-significant result usually reflects Type II error rather than true equivalence. Always check the power calculation before accepting a negative trial.

Limitation: Reflects trials up to 1999; methodological reporting has improved since, though underpowering remains common.

Verify on PubMed (PMID 11701786)

What's Wrong with Bonferroni Adjustments

Perneger TV • BMJ (1998)

Key Findings:

Routine Bonferroni correction is often too conservative and inflates the Type II error rate
Bonferroni controls the family-wise error rate but reduces power to detect real effects
The pre-specified primary outcome does not require multiplicity adjustment
Hypothesis-driven secondary outcomes should be reported with effect sizes and interpreted cautiously rather than mechanically corrected
What constitutes the relevant family of tests is itself ambiguous, making blanket correction problematic

Clinical Implication: Multiplicity control is a trade-off: aggressive Type I control via Bonferroni buys false negatives. Define a single primary outcome rather than over-correcting.

Limitation: A viewpoint article; opposing statisticians argue uncorrected multiple testing seriously inflates false positives.

Verify on PubMed (PMID 9553006)

Multiplicity in Randomised Trials II: Subgroup and Interim Analyses

Schulz KF, Grimes DA • Lancet (2005)

Key Findings:

Testing enough subgroups guarantees a false-positive (Type I) result by chance alone
Subgroup claims should rest on tests of interaction, not separate within-subgroup p-values
Repeated interim looks inflate the false-positive rate unless formal stopping rules are used
O'Brien-Fleming and Peto group-sequential boundaries preserve the intended alpha and power
Trials stopped early for benefit systematically exaggerate the treatment effect (a random high)

Clinical Implication: Be sceptical of significant subgroup effects and early-stopped trials in viva and journal club - both are classic sources of inflated Type I error.

Limitation: Methodological guidance rather than empirical data; assumes adherence to pre-specified analysis plans.

Verify on PubMed (PMID 15885299)

Exam Viva Scenarios

Use these scenarios to practise clinical reasoning and management decisions

CLINICAL SCENARIOStandard

Scenario 1: Error Type Identification

CLINICAL PROMPT

"A study concludes that a new fixation technique reduces nonunion rates compared to standard technique (p = 0.03). However, the new technique actually has the same nonunion rate as standard. What type of error has occurred?"

PRACTICAL APPROACH

This is a Type I error, also known as a false positive. The study concluded there is a difference - they rejected the null hypothesis of no difference - when in reality the null hypothesis is true and there is no difference between techniques. This means they found a statistically significant result (p = 0.03 less than alpha = 0.05) purely by chance, despite the treatments being truly equivalent. The probability of this happening is alpha, which is conventionally set at 0.05 or 5 percent. This means we accept a 5 percent risk of false positive findings. The consequences of this Type I error would be adopting the new technique unnecessarily, potentially incurring higher costs, longer operative time, or different complications, without any actual benefit in terms of nonunion reduction. To minimize Type I error risk, we could use a lower alpha threshold like 0.01, but this would require a larger sample size and would increase Type II error risk. This example highlights why we need replication studies and meta-analyses - a single statistically significant finding could be a Type I error.

KEY CLINICAL POINTS

Type I error = False Positive = Reject null when null is true

Occurred because p less than alpha (0.03 less than 0.05) by chance alone

Alpha = 0.05 means 5% risk of Type I error is accepted

Consequence = Adopt new technique unnecessarily without real benefit

COMMON PITFALLS

Confusing Type I and Type II errors

Not explaining that p-value less than alpha by chance

Not mentioning alpha = 0.05 convention and its meaning

Not discussing clinical consequences of the error

FURTHER QUESTIONS

"How could you reduce the risk of Type I error?"

"What is the difference between alpha and p-value?"

"What would be a Type II error in this scenario?"

CLINICAL SCENARIOChallenging

Scenario 2: Multiple Comparisons

CLINICAL PROMPT

"You are reviewing an RCT that tested 10 different outcome measures. One outcome showed p = 0.04. How do you interpret this result?"

PRACTICAL APPROACH

This requires careful interpretation because of the multiple comparisons problem. When testing 10 outcomes at alpha = 0.05 each, the family-wise error rate - the probability of at least one false positive - is inflated above 0.05. Using the formula 1 minus 0.95 to the power of 10, the FWER is approximately 0.40 or 40 percent. This means there is a 40 percent chance of finding at least one significant result (p less than 0.05) purely by chance even if all null hypotheses are true. The single finding of p = 0.04 could easily be a Type I error. To properly interpret this, I would first ask whether a primary outcome was pre-specified. If yes and this is the primary outcome, p = 0.04 is acceptable at alpha = 0.05 without correction. If this is one of ten secondary outcomes with no correction, I would apply Bonferroni correction: adjusted alpha = 0.05 divided by 10 = 0.005. Since p = 0.04 is greater than 0.005, this result is NOT significant after correction. Alternatively, if the authors used hierarchical testing or pre-specified that only 2-3 outcomes would be tested, the correction would be less stringent. The key point is that multiple testing without correction inflates Type I error, and I would interpret this p = 0.04 finding with skepticism unless it was the pre-specified primary outcome or survives Bonferroni correction.

KEY CLINICAL POINTS

Multiple comparisons inflate family-wise Type I error rate

10 tests at alpha 0.05 each gives 40% chance of at least one false positive

Bonferroni correction: adjusted alpha = 0.05 / 10 = 0.005

p = 0.04 greater than 0.005, so NOT significant after correction

Primary pre-specified outcome does not require correction

COMMON PITFALLS

Not recognizing the multiple comparisons problem

Not calculating or explaining family-wise error rate inflation

Not applying Bonferroni correction

Not distinguishing primary vs secondary outcomes

FURTHER QUESTIONS

"How do you calculate family-wise error rate?"

"What is the difference between primary and secondary outcomes?"

"Are there alternatives to Bonferroni correction?"

CLINICAL SCENARIOChallenging

Scenario 3: Interpreting a Negative Trial

CLINICAL PROMPT

"A single-centre RCT of 40 patients compares a new locking plate with a standard plate for distal radius fractures and finds no significant difference in function (p = 0.28). The authors conclude the implants are equivalent. As the examiner asks: is that conclusion justified?"

PRACTICAL APPROACH

No, that conclusion is not justified, and the central issue is Type II error. A non-significant p-value does not prove that there is no difference - absence of evidence is not evidence of absence. With only 40 patients this trial is almost certainly underpowered, so a true clinically important difference could easily have been missed; this is precisely the pattern shown in the orthopaedic literature, where Lochner, Bhandari and Tornetta found a mean power of about 25 percent and a Type II error rate around 90 percent in trauma trials. To judge the result properly I would do three things. First, look for the a priori power calculation - what effect size and power were assumed, and was the achieved sample size adequate? Post-hoc power calculated from the observed effect is circular and unhelpful. Second, examine the confidence interval around the treatment effect rather than the p-value alone: a wide interval that still includes a clinically important benefit means the trial cannot exclude a real difference, whereas a narrow interval tightly around zero would be more reassuring. Third, I would note that to legitimately claim equivalence the authors would need a purpose-designed equivalence or non-inferiority trial with a pre-specified margin and its own power calculation - which this superiority trial is not. The appropriate conclusion is that the trial is inconclusive, not that the implants are equivalent, and the data would be best contributed to a prospective meta-analysis to gain power.

KEY CLINICAL POINTS

Non-significant result indicates possible Type II error, not proven equivalence

Small single-centre trials are typically underpowered (echoes Lochner/Bhandari/Tornetta)

Judge underpowering from a priori power and the confidence interval, not post-hoc power

Claiming equivalence requires a dedicated equivalence/non-inferiority design with a pre-set margin

Correct conclusion: inconclusive; pool in meta-analysis for power

COMMON PITFALLS

Accepting p greater than 0.05 as proof of no difference

Quoting post-hoc power to defend the negative result

Confusing a failed superiority trial with an equivalence trial

Ignoring the width of the confidence interval

FURTHER QUESTIONS

"What is the difference between a superiority and a non-inferiority trial?"

"Why is post-hoc power calculation discouraged?"

"How would a confidence interval help you interpret this result?"

MCQ Practice Points

Type I Error Definition

Q: What is a Type I error? A: Rejecting null hypothesis when null is actually true (false positive). Concluding there IS a difference when there is NOT. Probability is alpha (usually 0.05 or 5%).

Type II Error Definition

Q: What is a Type II error? A: Failing to reject null hypothesis when alternative is true (false negative). Concluding there is NO difference when there IS. Probability is beta (usually 0.20 or 20% for power = 80%).

Multiple Comparisons

Q: Why does testing multiple outcomes increase Type I error risk? A: Each test has 5% chance of false positive. Testing 20 outcomes means expecting 20 × 0.05 = 1 false positive on average. Family-wise error rate (probability of at least one false positive) increases with each additional test. Bonferroni correction divides alpha by number of tests to control overall Type I error.

Management Algorithm

TYPE I AND TYPE II ERRORS

Clinical summary

Error Definitions

•Type I = False Positive = Reject null when null is true = Alpha
•Type II = False Negative = Accept null when alternative is true = Beta
•Power = 1 minus Beta = Probability of correctly rejecting false null
•Alpha set BEFORE study (usually 0.05), p-value calculated AFTER from data
•If p less than alpha, reject null (risk Type I if null actually true)

Error Consequences

•Type I consequence = Adopt ineffective or harmful treatment
•Type II consequence = Discard effective treatment, miss opportunity
•Type I often considered worse (false adoption) but context-dependent
•Screening: Type II worse for serious diseases (miss cancer)
•Treatment: Type I worse for risky interventions (adopt harmful therapy)

Error Control

•Reduce Type I = Lower alpha (0.01 instead of 0.05) OR increase sample
•Reduce Type II = Increase power (0.90 instead of 0.80) OR increase sample
•Trade-off: Lowering alpha increases beta unless sample increases
•Conventional: Alpha = 0.05 (5% Type I), Beta = 0.20 (20% Type II, 80% power)
•Large sample reduces both errors

Multiple Comparisons

•Testing n outcomes inflates Type I error (family-wise error rate)
•FWER = 1 minus (1 minus alpha)^n
•20 tests at alpha 0.05: FWER = 64% (not 5%)
•Bonferroni correction: Adjusted alpha = 0.05 / n
•Primary outcome: No correction. Secondary outcomes: Correct or interpret cautiously

Clinical Application

•Underpowered studies have high Type II error risk (beta greater than 0.20)
•Negative result from underpowered study = Inconclusive, NOT definitive
•Pre-specify primary outcome to avoid multiple comparison issues
•Meta-analysis reduces Type II error by pooling studies (increases power)
•Always check power when interpreting negative results

Alpha

Type I Error Risk

When Used

Trade-off

0.01

1% false positive rate

When Type I error is very costly (e.g., drug approval)

Requires larger sample or accepts higher Type II error

0.05

5% false positive rate

Conventional in most research

Balance between Type I and Type II errors

0.10

10% false positive rate

Exploratory or pilot studies

Easier to find significance but higher false positive risk

Beta

Power

Interpretation

Sample Size

0.05

95%

Very high power - 95% chance detecting real effect

Very large sample needed

0.10

90%

High power - 90% chance detecting real effect

Large sample needed

0.20

80%

Adequate power - 80% chance detecting real effect

Moderate sample, conventional target

0.50

50%

Underpowered - coin flip chance of detection

Small sample, high Type II error risk

Your Decision

Null is TRUE

Alternative is TRUE

Reject Null (p less than alpha)

TYPE I ERROR (False Positive) - Alpha = 0.05

CORRECT DECISION (True Positive) - Power

Accept Null (p greater than alpha)

CORRECT DECISION (True Negative) - 1 minus Alpha

TYPE II ERROR (False Negative) - Beta = 0.20

Framework / Body

Scope

Type I control

Type II control

CONSORT 2010 (global)

Reporting of parallel-group RCTs

Pre-specified primary outcome; declare subgroup/multiple analyses

Mandatory sample-size justification (effect size, alpha, power)

ICH E9 (international regulatory)

Statistical principles for clinical trials

Pre-defined analysis plan, multiplicity strategy, alpha spending

Power and sample-size assumptions stated a priori

FDA / EMA guidance

Drug and device approval (US / Europe)

Often demands two adequate well-controlled trials or stricter alpha

Adequate power required for pivotal endpoints

Cochrane / GRADE

Evidence synthesis and certainty rating

Meta-analysis reduces spurious single-study positives

Pooling raises power; imprecision downgrades certainty

STROBE

Observational study reporting

Encourages reporting of all analyses to limit selective positives

Reporting of study size and its rationale

Setting

Typical reality

Error implication

Well-resourced / registry-linked

Multicentre RCTs, registries, pre-registration

Better powered; main risk is over-interpreting tiny but significant effects (Type I in spirit)

Limited-resource

Small single-centre series, few RCTs

High Type II error risk; negative results frequently inconclusive

Global synthesis

Cochrane reviews pool across regions

Improves power and generalisability; heterogeneity must be assessed

Type-II Error Rates of Randomised Trials in Orthopaedic Trauma

Lochner HV, Bhandari M, Tornetta P • J Bone Joint Surg Am (2001)

Key Findings:

Systematic review of 117 randomised fracture-care trials (1968 to 1999) enrolling 19,942 patients
Mean study power for the primary outcome was only 24.65 percent (range 2 to 99 percent)
Type-II (beta) error rate for primary outcomes was 90.52 percent - the great majority were underpowered
Sample sizes were small (mean 95 patients) and primary outcomes were often not pre-specified
A priori threshold for acceptable power was set at 80 percent (beta 0.20 or less)

Limitation: Reflects trials up to 1999; methodological reporting has improved since, though underpowering remains common.

Verify on PubMed (PMID 11701786)

What's Wrong with Bonferroni Adjustments

Perneger TV • BMJ (1998)

Key Findings:

Routine Bonferroni correction is often too conservative and inflates the Type II error rate
Bonferroni controls the family-wise error rate but reduces power to detect real effects
The pre-specified primary outcome does not require multiplicity adjustment
Hypothesis-driven secondary outcomes should be reported with effect sizes and interpreted cautiously rather than mechanically corrected
What constitutes the relevant family of tests is itself ambiguous, making blanket correction problematic

Clinical Implication: Multiplicity control is a trade-off: aggressive Type I control via Bonferroni buys false negatives. Define a single primary outcome rather than over-correcting.

Limitation: A viewpoint article; opposing statisticians argue uncorrected multiple testing seriously inflates false positives.

Verify on PubMed (PMID 9553006)

Multiplicity in Randomised Trials II: Subgroup and Interim Analyses

Schulz KF, Grimes DA • Lancet (2005)

Key Findings:

Testing enough subgroups guarantees a false-positive (Type I) result by chance alone
Subgroup claims should rest on tests of interaction, not separate within-subgroup p-values
Repeated interim looks inflate the false-positive rate unless formal stopping rules are used
O'Brien-Fleming and Peto group-sequential boundaries preserve the intended alpha and power
Trials stopped early for benefit systematically exaggerate the treatment effect (a random high)

Clinical Implication: Be sceptical of significant subgroup effects and early-stopped trials in viva and journal club - both are classic sources of inflated Type I error.

Limitation: Methodological guidance rather than empirical data; assumes adherence to pre-specified analysis plans.

Verify on PubMed (PMID 15885299)

Study Conclusion	If H₀ is TRUE (no real difference)	If H₁ is TRUE (new implant better)
New implant is better (p less than 0.05)	TYPE I ERROR - Adopt new implant unnecessarily, higher cost for no benefit	CORRECT - Adopt superior implant, improve patient outcomes
No difference found (p greater than 0.05)	CORRECT - Continue with standard implant, avoid unnecessary change	TYPE II ERROR - Miss opportunity to improve outcomes, continue inferior treatment

Concept	What it is	Controls / measures	Classic trap
Type I error (alpha)	Probability of a false positive when null is true	Set a priori, usually 0.05	Confusing alpha (pre-set) with the p-value (data-derived)
Type II error (beta)	Probability of a false negative when alternative is true	Determined by power, effect size and sample size	Treating a non-significant result as proof of no effect
p-value	Probability of data this extreme if null were true	Calculated from the observed data	Reading p as the probability the null is true
Statistical power (1 minus beta)	Chance of detecting a real effect of given size	Increased by larger n, larger effect, lower variance	Reporting post-hoc power to explain a negative result
Confidence interval	Range of plausible values for the true effect	Width reflects precision (driven by sample size)	Ignoring a wide CI that crosses no-effect in a small study
Equivalence / non-inferiority margin	Pre-defined limit within which treatments are deemed equal	Specified before the study, with its own power	Claiming equivalence from a failed superiority trial (absence of evidence is not evidence of absence)

Study Conclusion	If H₀ is TRUE (no real difference)	If H₁ is TRUE (new implant better)
New implant is better (p less than 0.05)	TYPE I ERROR - Adopt new implant unnecessarily, higher cost for no benefit	CORRECT - Adopt superior implant, improve patient outcomes
No difference found (p greater than 0.05)	CORRECT - Continue with standard implant, avoid unnecessary change	TYPE II ERROR - Miss opportunity to improve outcomes, continue inferior treatment

Concept	What it is	Controls / measures	Classic trap
Type I error (alpha)	Probability of a false positive when null is true	Set a priori, usually 0.05	Confusing alpha (pre-set) with the p-value (data-derived)
Type II error (beta)	Probability of a false negative when alternative is true	Determined by power, effect size and sample size	Treating a non-significant result as proof of no effect
p-value	Probability of data this extreme if null were true	Calculated from the observed data	Reading p as the probability the null is true
Statistical power (1 minus beta)	Chance of detecting a real effect of given size	Increased by larger n, larger effect, lower variance	Reporting post-hoc power to explain a negative result
Confidence interval	Range of plausible values for the true effect	Width reflects precision (driven by sample size)	Ignoring a wide CI that crosses no-effect in a small study
Equivalence / non-inferiority margin	Pre-defined limit within which treatments are deemed equal	Specified before the study, with its own power	Claiming equivalence from a failed superiority trial (absence of evidence is not evidence of absence)

Type I and Type II Errors

Type I and Type II Errors

Error Types by Truth

Critical Must-Knows

Clinical Pearls

Clinical Imaging

Imaging Gallery

Critical Error Concepts

Type I Error (False Positive)

Type II Error (False Negative)

Alpha-Beta Trade-off

Clinical Consequences

CRWAType I vs Type II Errors

PAWSError Consequences and Prevention

Overview/Introduction

What is Type I Error?

Alpha Level Selection

Alpha Thresholds and Implications

Principles of Error Testing

Core Principles

Understanding Type II Error (Beta)

What is Type II Error?

Beta and Power

Beta and Power Relationship

Error Matrix and Decision Framework

The 2x2 Truth Table

Statistical Decision vs Reality Matrix

Type I and Type II Errors

Clinical Trial Example

Possible Outcomes

Multiple Comparisons and Type I Error Inflation

The Multiple Testing Problem

Bonferroni Correction

When to Correct for Multiple Comparisons

Clinical Application

Which Error is Worse?

Screening Tests

Underpowered Studies

Meta-Analysis Solution

Distinguishing Commonly Confused Concepts

Type I/II Errors vs Related Statistical Concepts

Absence of Evidence is Not Evidence of Absence

Controversies and Areas of Uncertainty

Should alpha stay at 0.05?

Abandon significance testing?

When to correct for multiplicity

Post-hoc power

Guidelines, Registries & Global Practice

Global Reporting Standards

How Major Frameworks Address Type I and Type II Errors

Registries and Large Datasets

High- vs Limited-Resource Practice Variation

Evidence Base

Type-II Error Rates of Randomised Trials in Orthopaedic Trauma

What's Wrong with Bonferroni Adjustments

Multiplicity in Randomised Trials II: Subgroup and Interim Analyses

Power Failure: Why Small Sample Size Undermines Reliability

Why Most Published Research Findings Are False

The Continuing Unethical Conduct of Underpowered Clinical Trials

CONSORT 2010 Statement: Reporting of Parallel-Group Randomised Trials

Exam Viva Scenarios

Scenario 1: Error Type Identification

Scenario 2: Multiple Comparisons

Scenario 3: Interpreting a Negative Trial

MCQ Practice Points

Management Algorithm

TYPE I AND TYPE II ERRORS

Error Definitions

Error Consequences

Error Control

Multiple Comparisons

Clinical Application

Type I and Type II Errors

Type I and Type II Errors

Error Types by Truth

Critical Must-Knows

Clinical Pearls

Clinical Imaging

Imaging Gallery

Critical Error Concepts