High-yield overview

Study Planning | Adequate Sampling | Effect Detection

80%Conventional Power Target

5%Type I Error (Alpha)

20%Type II Error (Beta)

MCIDMinimum Clinically Important Difference

Power and Sample Size Relationships

High Power (80-90%)

PatternAdequate sample to detect true effect

TreatmentWell-designed study

Underpowered (under 80%)

PatternToo small sample, high Type II error risk

TreatmentMay miss real effect

Overpowered (over 95%)

PatternExcessively large sample

TreatmentDetects trivial differences

Critical Must-Knows

Power: Probability of detecting a true effect (1 minus Beta). Conventional target is 80 percent.
Sample Size Calculation Requires: Effect size (MCID), Alpha (usually 0.05), Power (usually 0.80), Variability (SD)
Effect Size: The magnitude of difference you want to detect - must be clinically meaningful (MCID), not just statistically significant
Underpowered Studies: Risk Type II error (false negative) - failing to detect real treatment effect
Factors Increasing Sample Size: Smaller effect size, lower power, higher variability, lower alpha

Clinical Pearls

"
Power = 80% means 20% chance of Type II error (missing a true effect)
"
MCID (Minimal Clinically Important Difference) defines what effect size matters to patients
"
Larger sample size increases power but also increases cost and time
"
Pilot studies help estimate variability (SD) for sample size calculations

Critical Power Concepts

What is Power?

Power = 1 minus Beta. Probability of correctly rejecting null hypothesis when alternative is true. Power = 80% means 80% chance of detecting real effect if it exists.

Sample Size Determinants

Four key inputs: (1) Alpha (Type I error, usually 0.05), (2) Power (1-Beta, usually 0.80), (3) Effect Size (MCID), (4) Variability (Standard Deviation).

Underpowered Studies

Risk: Type II error (false negative). Study may fail to detect real treatment benefit. Common in orthopaedic trials with small sample sizes.

Clinical vs Statistical Significance

Statistical Significance: p less than 0.05. Clinical Significance: Difference exceeds MCID. Large studies detect trivial differences; small studies miss important ones.

Mnemonic

APESSample Size Calculation Inputs

A	Alpha Type I error rate (usually 0.05 or 5%)
P	Power 1 minus Beta (usually 0.80 or 80%)
E	Effect Size MCID - clinically meaningful difference
S	Standard Deviation Variability of outcome measure

A	Alpha Type I error rate (usually 0.05 or 5%)	E	Effect Size MCID - clinically meaningful difference
P	Power 1 minus Beta (usually 0.80 or 80%)	S	Standard Deviation Variability of outcome measure

Hook:APES calculate sample size - Alpha, Power, Effect, SD are the four essentials!

Mnemonic

SHAPEFactors that Increase Required Sample Size

S	Smaller effect size Detecting small differences needs more patients
H	Higher power 90% vs 80% power requires more patients
A	Alpha reduction 0.01 vs 0.05 needs larger sample
P	Population variability Higher SD increases sample needed
E	Expected dropout Must inflate for anticipated loss to follow-up

S	Smaller effect size Detecting small differences needs more patients	P	Population variability Higher SD increases sample needed
H	Higher power 90% vs 80% power requires more patients	E	Expected dropout Must inflate for anticipated loss to follow-up
A	Alpha reduction 0.01 vs 0.05 needs larger sample

Hook:SHAPE your sample size - these five factors determine how many participants you need!

Overview/Introduction

What is Power?

Definition: Statistical power is the probability that a study will detect an effect when there truly is an effect to detect.

Formula: Power = 1 minus Beta (Type II error rate)

Interpretation:

Power = 80%: 80% chance of detecting true effect, 20% chance of missing it (Type II error)
Power = 50%: Coin flip - as likely to miss effect as to find it (underpowered)
Power = 95%: 95% chance of detecting true effect, but requires much larger sample

Power Levels and Interpretation

Power	Meaning	Adequacy	Sample Size
Greater than 90%	Very high chance of detecting true effect	Excellent but may be excessive	Very large sample needed
80-90%	High chance of detecting true effect	Conventional and adequate	Moderate sample size
50-80%	Moderate chance, meaningful risk of missing effect	Underpowered - risky	Smaller sample
Under 50%	More likely to miss effect than find it	Severely underpowered	Very small sample

Understanding power is essential for designing adequately powered studies.

Principles of Power Analysis

Core Principles

Relationship Between Power and Sample Size:

Larger sample size increases power
Doubling sample size does NOT double power (diminishing returns)
Power increases steeply initially, then plateaus

Trade-offs in Study Design:

Higher power requires larger sample (more cost, time)
Smaller effect size (more clinically conservative) requires larger sample
Lower alpha (more statistically conservative) requires larger sample

Key Relationships:

Power ∝ Sample Size
Power ∝ Effect Size
Power ∝ Alpha
Power inversely proportional to Variability (SD)

Understanding these principles allows rational study design decisions.

Sample Size Calculation

Four Essential Inputs

Every sample size calculation requires four inputs:

Alpha: Type I Error Rate

Definition: Probability of falsely rejecting null hypothesis (false positive).

Conventional Choice: Alpha = 0.05 (5%)

Meaning: Willing to accept 5% chance of finding difference when none exists.

Trade-off: Lower alpha (e.g., 0.01) reduces false positives but requires larger sample size.

Bonferroni Correction: When testing multiple outcomes, divide alpha by number of tests to maintain overall Type I error rate.

Understanding alpha is critical for interpreting p-values and planning studies.

Performing Sample Size Calculation

Sample Size Formula (Continuous Outcome, Two Groups)

For comparing means between two groups:

n = 2 × (Zα + Zβ)² × SD² / MCID²

Where:

n = sample size per group
Zα = Z-score for alpha (1.96 for alpha = 0.05 two-tailed)
Zβ = Z-score for beta (0.84 for power = 0.80)
SD = standard deviation
MCID = effect size (minimal clinically important difference)

Worked Example

Question: How many patients needed per group to detect 10-point improvement in WOMAC score?

Given:

MCID = 10 points
SD = 20 points (from literature)
Alpha = 0.05 (Zα = 1.96)
Power = 0.80 (Zβ = 0.84)

Calculation:

n = 2 × (1.96 + 0.84)² × 20² / 10²
n = 2 × 7.84 × 400 / 100
n = 2 × 31.36
n = 63 patients per group

Accounting for Dropout:

If expecting 15% dropout: n = 63 / 0.85 = 74 patients per group
Total enrollment: 148 patients

Understanding sample size calculation ensures adequately powered studies.

Types of Power Analysis

A Priori vs Post Hoc Power Analysis

Types of Power Analysis

Type	When Performed	Purpose	Validity
A Priori (Prospective)	Before study begins	Calculate required sample size	Valid and recommended
Post Hoc (Retrospective)	After study completed	Calculate achieved power	Controversial - often misleading
Sensitivity Analysis	During planning	Assess power across range of assumptions	Useful for uncertainty

A Priori Power Analysis (Recommended):

Calculate sample size BEFORE enrolling patients
Uses estimated effect size and SD from literature or pilot
Ensures study designed with adequate power

Post Hoc Power Analysis (Problematic):

Calculating power AFTER study complete using observed data
Often done to explain non-significant results
Mathematically redundant - p-value and post hoc power are directly related

Clinical Application

Underpowered Studies in Orthopaedics

Common Problem: Many orthopaedic RCTs are underpowered. Small sample sizes fail to detect clinically meaningful differences. Results are inconclusive, not negative.

MCID vs Statistical Significance

Clinical Relevance: A statistically significant finding (p less than 0.05) may not be clinically important. Always check if difference exceeds MCID.

Pilot Studies

Purpose: Estimate variability (SD) and feasibility before full trial. Helps refine sample size calculation. Do NOT use pilot for hypothesis testing.

Multicenter Trials

Solution: When single center cannot recruit adequate sample, multicenter collaboration achieves power. AOANJRR and international registries provide large samples.

Software and Calculation Tools

Common Power Analysis Software

Sample Size Calculation Tools

Software	Cost	Features	Best For
G*Power	Free	Wide range of tests, user-friendly	Academic researchers, most designs
PS (Power and Sample Size)	Free	Simple interface, basic designs	Quick calculations, beginners
nQuery	Commercial	Comprehensive, regulatory accepted	Industry trials, complex designs
PASS	Commercial	Extensive documentation, FDA submissions	Regulatory submissions

Online Calculators:

ClinCalc sample size calculator (free online)
OpenEpi power calculation (epidemiological studies)
Sealed Envelope (clinical trial tools)

Manual Calculation Reference

Statistic	Formula Component	Value (Common)
Zα (two-tailed, α=0.05)	Z-score for alpha	1.96
Zα (one-tailed, α=0.05)	Z-score for alpha	1.645
Zβ (power=0.80)	Z-score for beta	0.84
Zβ (power=0.90)	Z-score for beta	1.28

Addressing Underpowered Studies

Strategies to Increase Power

Methods to Address Low Power

Strategy	Approach	Advantages	Disadvantages
Increase sample size	Enroll more participants	Direct power increase	More cost, time, resources
Multicenter collaboration	Pool recruitment across sites	Achieves larger sample	Heterogeneity, logistics complexity
Reduce variability	Stricter inclusion criteria, standardized protocols	Increases precision	Reduces generalizability
Use more sensitive outcome	Choose outcome with lower SD	More precise measurement	May not be clinically preferred

When Power Cannot Be Achieved

Rare conditions: May need registry-based or multi-national studies
Ethical constraints: Cannot enroll more for safety reasons
Resource limitations: Accept lower power with pre-specified disclosure

Alternative Approaches:

Bayesian analysis (can provide evidence even with small samples)
Meta-analysis (combine with existing studies)
Confidence interval interpretation (focus on precision)

Common Pitfalls and Errors

Errors in Sample Size Calculation

Common Pitfalls in Power Analysis

Pitfall	Problem	Consequence	Solution
Underestimating dropout	Sample shrinks below powered size	Underpowered final analysis	Inflate by 15-25% for attrition
Unrealistic effect size	MCID too large or optimistic	Study underpowered for true effect	Use conservative, validated MCID
Wrong SD estimate	Variability higher than expected	Lower power than calculated	Use upper bound of SD estimate
Multiple comparisons ignored	Many outcomes without correction	Inflated Type I error	Adjust alpha (Bonferroni) or define primary outcome

Interpretation Errors

Common Mistakes:

Concluding treatments are "equivalent" from underpowered negative study
Using post hoc power to justify non-significant results
Ignoring confidence intervals when assessing clinical relevance
Confusing statistical significance with clinical importance

Evidence Base

Type-II Error Rates in Orthopaedic Trauma RCTs

Lochner HV, Bhandari M, Tornetta P • J Bone Joint Surg Am (2001)

Key Findings:

Systematic survey of 117 randomised fracture-care trials (1968-1999) including 19,942 patients
Mean study power for the primary outcome was only 24.65% (range 2% to 99%)
The Type-II error rate (beta) for primary outcomes was 90.52% - far above the accepted 20% threshold
Most trials were grossly underpowered; performing a priori power and sample-size calculations is the corrective

Clinical Implication: The seminal demonstration that orthopaedic trauma trials are overwhelmingly underpowered, so their negative results cannot be read as evidence of no difference.

Limitation: Trauma-specific cohort ending in 1999; reporting has improved with CONSORT, but underpowering remains common across subspecialties.

Verify on PubMed (PMID 11701786)

Understanding the MCID: Concepts and Methods

Copay AG, Subach BR, Glassman SD, Polly DW, Schuler TC • Spine J (2007)

Key Findings:

Defines the MCID as the smallest improvement a patient considers worthwhile - the effect size that should anchor sample-size calculations
MCID is derived by anchor-based methods (linked to an external patient-reported criterion) or distribution-based methods (e.g. 0.5 standard deviation, standard error of measurement)
Three core limitations: multiplicity of MCID estimates, loss of the patient's perspective, and dependence on the baseline score
No single MCID is universal; the value depends on the instrument, population and method used

Clinical Implication: Power calculations must use a validated, condition-specific MCID rather than an arbitrary or statistically convenient effect size, or the trial will be powered to detect a difference patients do not value.

Limitation: A methodological review rather than a source of definitive MCID values; reported MCIDs vary widely by population and instrument.

Verify on PubMed (PMID 17448732)

CONSORT 2010: Reporting Standard for Sample Size

Guideline

Schulz KF, Altman DG, Moher D (CONSORT Group) • BMJ (2010)

Key Findings:

CONSORT Item 7a requires authors to report how the sample size was determined
CONSORT Item 7b requires reporting of interim analyses and stopping guidelines where applicable
Sample-size justification should state the target effect size, power, alpha and the assumed standard deviation
Adopted worldwide by leading journals to make trial adequacy transparent to readers

Clinical Implication: CONSORT 2010 is the international reference standard against which examiners and journals judge whether an RCT's sample size was adequately justified.

Limitation: A reporting guideline, not a method - adherence varies and does not by itself correct an underpowered design.

Verify on PubMed (PMID 20332509)

Exam Viva Scenarios

Use these scenarios to practise clinical reasoning and management decisions

CLINICAL SCENARIOStandard

Scenario 1: Sample Size Calculation

CLINICAL PROMPT

"You are planning an RCT to compare two surgical approaches for rotator cuff repair. What information do you need to calculate the required sample size?"

PRACTICAL APPROACH

To calculate sample size, I need four key pieces of information. First, **Alpha** - the Type I error rate, conventionally set at 0.05, meaning I accept a 5 percent chance of false positive result. Second, **Power** or 1 minus Beta - the probability of detecting a true effect if it exists, conventionally 0.80 or 80 percent, meaning 20 percent risk of Type II error or missing a real difference. Third, **Effect Size** - the minimal clinically important difference (MCID) that I want to detect. For rotator cuff outcomes, this might be 10 points on the ASES score or Constant score. This should be based on what patients consider meaningful improvement, not just statistical detectability. Fourth, **Variability** - the standard deviation of the outcome measure in the population. I would estimate this from published literature on the same outcome measure or from a pilot study. Once I have these four inputs, I can use the formula n equals 2 times the quantity Zα plus Zβ squared times SD squared divided by MCID squared to calculate the sample size per group. I would also inflate this by approximately 15 to 20 percent to account for anticipated loss to follow-up.

KEY CLINICAL POINTS

Four inputs: Alpha, Power, Effect Size (MCID), SD

Conventional values: Alpha = 0.05, Power = 0.80

MCID must be clinically meaningful, not just statistically detectable

Inflate for dropout (typically 15-20%)

COMMON PITFALLS

Not mentioning all four required inputs

Confusing effect size with p-value or alpha

Not inflating for dropout

Not explaining what MCID means and why it matters

FURTHER QUESTIONS

"What happens if your study is underpowered?"

"How would you estimate the standard deviation if no prior data exists?"

"What is the trade-off between power and sample size?"

CLINICAL SCENARIOChallenging

Scenario 2: Interpreting Underpowered Study

CLINICAL PROMPT

"You read an RCT comparing two implants for THA. The study found no significant difference (p = 0.15) with 40 patients per group. The power calculation shows the study had 35 percent power. How do you interpret this result?"

PRACTICAL APPROACH

This is a classic underpowered study, and the negative result is inconclusive, not definitive. With power of only 35 percent, this study had a 65 percent chance of missing a true difference even if one exists - essentially worse than a coin flip. The p-value of 0.15 suggests a trend toward difference but insufficient sample to achieve statistical significance. I cannot conclude that the implants are equivalent - only that this study was too small to detect a difference. This is a Type II error risk situation. To properly interpret this, I would examine the confidence intervals around the effect estimate. If the confidence interval is wide and includes both no difference and clinically important differences, the study is inconclusive. If I wanted to definitively answer this question, I would need to calculate the sample size required for adequate power - typically 80 percent - using the observed effect size and variability from this pilot data. This might require 200 to 300 patients per group. Alternatively, a meta-analysis combining this study with other similar trials could increase power. The key message is: absence of evidence is not evidence of absence when a study is underpowered.

KEY CLINICAL POINTS

Power of 35% means 65% risk of Type II error (missing true difference)

Negative result from underpowered study is inconclusive, not definitive

Examine confidence intervals for clinical relevance

Need adequate power (80%) to draw conclusions about equivalence or difference

COMMON PITFALLS

Concluding implants are equivalent based on underpowered negative study

Not explaining Type II error risk

Not mentioning confidence intervals

Not suggesting solutions (larger study or meta-analysis)

FURTHER QUESTIONS

"What is the difference between statistical equivalence and lack of statistical difference?"

"How would you design a study to prove two treatments are equivalent (non-inferiority trial)?"

"What is the relationship between confidence intervals and power?"

CLINICAL SCENARIOChallenging

Scenario 3: Post Hoc Power and the Fragility Index

CLINICAL PROMPT

"A reviewer asks you to add a post hoc power calculation to your non-significant orthopaedic RCT to show it was adequately powered. The trial had 50 patients per arm. How do you respond, and how else might you convey the robustness of your findings?"

PRACTICAL APPROACH

I would respectfully decline to add an observed (post hoc) power calculation, and explain why. Post hoc power computed from the observed effect size is a direct, one-to-one mathematical function of the p-value: a non-significant result will always yield low observed power, so the calculation is circular and adds no information beyond the p-value itself. It cannot tell us whether the trial was adequately designed. The correct way to convey the design adequacy is the a priori power calculation that was specified in the protocol, stating the target MCID, alpha, power and assumed standard deviation. To communicate what the data actually show, I would present the point estimate of the treatment effect with its 95 percent confidence interval and interpret it against the MCID. If the confidence interval excludes the MCID, I can argue the difference is unlikely to be clinically important; if it includes the MCID, the trial is genuinely inconclusive rather than negative. For a dichotomous primary outcome I could also report the Fragility Index - the number of events that would need to change to flip statistical significance - which gives readers an intuitive sense of how robust or delicate the result is. Khan and colleagues showed the median Fragility Index in orthopaedic sports trials was only two patients, so this context matters. The overarching message is that confidence intervals, the pre-specified power calculation and fragility metrics, not post hoc power, are the legitimate tools.

KEY CLINICAL POINTS

Post hoc (observed) power is a deterministic function of the p-value and is uninformative

Design adequacy is judged by the a priori, protocol-specified power calculation

Confidence intervals interpreted against the MCID convey clinical meaning of a result

The Fragility Index quantifies robustness of a significant dichotomous outcome

COMMON PITFALLS

Agreeing to compute post hoc power to defend a non-significant result

Equating a non-significant p-value with proof of no effect

Failing to distinguish a priori from post hoc power

Not offering confidence intervals or the Fragility Index as the correct alternatives

FURTHER QUESTIONS

"Why is observed power mathematically tied to the p-value?"

"How is the Fragility Index calculated and what are its limitations?"

"When is a wide confidence interval more informative than a single p-value?"

MCQ Practice Points

Power Definition

Q: What is statistical power? A: The probability of detecting a true effect when it exists, calculated as 1 minus Beta (Type II error rate). Power = 80% means 80% chance of finding real difference if present, 20% risk of missing it.

Sample Size Determinants

Q: Which factor does NOT increase required sample size? A: Higher alpha (e.g., 0.10 vs 0.05) actually decreases required sample size. Factors that increase sample size: smaller effect size, higher power, higher variability (SD), lower alpha.

MCID Importance

Q: Why is MCID important for sample size calculation? A: MCID defines the clinically meaningful effect size - the smallest difference that matters to patients. Using MCID ensures study is powered to detect differences that are clinically relevant, not just statistically significant. Without MCID, large studies may detect trivial differences.

Controversies and Areas of Uncertainty

Where Experts Still Disagree

Unsettled Questions in Power and Sample Size

Controversy	One View	Opposing View	Pragmatic Position
Is 80% power enough?	80% is the long-standing convention and keeps trials feasible	20% chance of missing a true effect is too high for definitive surgical questions; 90% should be standard	Use 90% for pivotal or hard-to-repeat trials; justify the choice explicitly
Distribution-based MCID (0.5 SD)	Convenient when no anchor-based MCID exists	Detached from the patient's perspective; may not reflect what patients value	Prefer validated anchor-based MCID; use 0.5 SD only as a transparent fallback
Value of small pilot/feasibility trials	Estimate SD, recruitment and feasibility before a full trial	Pilots estimate SD imprecisely and are misused for hypothesis testing	Use adequately sized pilots for feasibility only, never for efficacy claims
Fragility Index as a routine metric	Intuitive measure of how robust a significant result is	Conflates significance with sample size and ignores effect magnitude	Report alongside, not instead of, effect size and confidence intervals
p-value threshold of 0.05	Familiar, regulator-accepted default	Arbitrary; some call for 0.005 or abandoning thresholds for estimation	Pre-specify alpha; emphasise estimation with confidence intervals over dichotomising at 0.05

The Examiner's Favourite Trap

A non-significant result in a small trial means the study was inconclusive, not that the treatments are equivalent. Proving equivalence requires a purpose-designed equivalence or non-inferiority trial with a pre-specified margin - it can never be inferred from a failure to reach p less than 0.05 in an underpowered superiority trial.

Guidelines, Registries & Global Practice

Why Power and Sample Size Are a Global Concern

Underpowering is not confined to any one country. The landmark survey of orthopaedic trauma RCTs found a mean primary-outcome power of roughly 25 percent and a Type-II error rate over 90 percent, and fragility analyses of sports-surgery trials worldwide report a median Fragility Index of only two patients. Adequately powered, registry-linked and multinational collaborations are the international response.

How Major Bodies Frame Sample Size and Reporting

Body / Standard	Region	Core Requirement	Emphasis
CONSORT 2010	Global (journals worldwide)	Items 7a/7b: report how sample size was determined and any interim analyses	Transparency of a priori justification
ICH E9 (Statistical Principles)	Global regulatory (FDA, EMA, PMDA)	Pre-specify primary outcome, effect size, alpha, power and analysis set	Confirmatory trial rigour
NICE / UK NIHR-HTA	UK	Funded trials must show a robust, MCID-anchored sample-size calculation	Value for public research funding
SPIRIT 2013	Global	Protocols must state sample-size assumptions before recruitment	Pre-registration of design
AO Foundation / clinical research units	Global trauma	Promote multicentre recruitment to reach powered samples	Overcoming single-centre limits

Registries as a Power Solution

Large national arthroplasty and trauma registries supply the sample sizes single centres cannot. The AOANJRR (Australia), NJR (England, Wales, Northern Ireland and Isle of Man), AJRR (USA), SHAR (Sweden), the Norwegian Arthroplasty Register and the NZJR pool hundreds of thousands of procedures, giving the statistical power to detect small but clinically important differences in implant survival and revision rates that no individual RCT could achieve.

High- versus Limited-Resource Practice Variation

Well-Resourced Settings

Access to dedicated trial units, biostatisticians, multicentre networks and mature registries makes adequately powered RCTs and registry studies feasible. Power and MCID-anchored calculations are an expected norm for grant funding and publication.

Limited-Resource Settings

Smaller catchments, fewer statisticians and funding constraints make large RCTs difficult. Pragmatic responses include international collaboration, registry-based studies, Bayesian designs that extract more information from small samples, and honest pre-specified disclosure of limited power.

Management Algorithm

STATISTICAL POWER AND SAMPLE SIZE

Clinical summary

Core Concepts

•Power = 1 minus Beta = Probability of detecting true effect
•Conventional power = 80% (20% risk of Type II error)
•Sample size needs 4 inputs: Alpha, Power, Effect Size (MCID), SD
•Underpowered study = High risk of missing real effect (Type II error)

Sample Size Calculation Inputs

•Alpha = Type I error (usually 0.05) - false positive rate
•Power = 1 minus Beta (usually 0.80) - true positive rate
•Effect Size = MCID (clinically meaningful difference)
•SD = Variability (from literature or pilot study)
•Inflate by 15-20% for anticipated dropout

Factors Increasing Sample Size

•Smaller effect size (harder to detect)
•Higher power (90% vs 80%)
•Lower alpha (0.01 vs 0.05)
•Higher variability (larger SD)
•Expected dropout or loss to follow-up

Interpreting Power

•Power greater than 90% = Excellent, may be excessive
•Power 80-90% = Adequate and conventional
•Power 50-80% = Underpowered, risky
•Power under 50% = Severely underpowered, likely to fail
•Negative result from underpowered study = Inconclusive

Clinical Application

•MCID defines clinical relevance, not just statistical significance
•Many orthopaedic RCTs are underpowered (power under 80%)
•Pilot studies estimate SD and feasibility, NOT for hypothesis testing
•Absence of evidence is NOT evidence of absence (underpowered studies)
•Wide confidence intervals indicate insufficient precision

Power

Meaning

Adequacy

Sample Size

Greater than 90%

Very high chance of detecting true effect

Excellent but may be excessive

Very large sample needed

80-90%

High chance of detecting true effect

Conventional and adequate

Moderate sample size

50-80%

Moderate chance, meaningful risk of missing effect

Underpowered - risky

Smaller sample

Under 50%

More likely to miss effect than find it

Severely underpowered

Very small sample

Type

When Performed

Purpose

Validity

A Priori (Prospective)

Before study begins

Calculate required sample size

Valid and recommended

Post Hoc (Retrospective)

After study completed

Calculate achieved power

Controversial - often misleading

Sensitivity Analysis

During planning

Assess power across range of assumptions

Useful for uncertainty

Software

Cost

Features

Best For

G*Power

Free

Wide range of tests, user-friendly

Academic researchers, most designs

PS (Power and Sample Size)

Free

Simple interface, basic designs

Quick calculations, beginners

nQuery

Commercial

Comprehensive, regulatory accepted

Industry trials, complex designs

PASS

Commercial

Extensive documentation, FDA submissions

Regulatory submissions

Statistic

Formula Component

Value (Common)

Zα (two-tailed, α=0.05)

Z-score for alpha

1.96

Zα (one-tailed, α=0.05)

Z-score for alpha

1.645

Zβ (power=0.80)

Z-score for beta

0.84

Zβ (power=0.90)

Z-score for beta

1.28

Strategy

Approach

Advantages

Disadvantages

Increase sample size

Enroll more participants

Direct power increase

More cost, time, resources

Multicenter collaboration

Pool recruitment across sites

Achieves larger sample

Heterogeneity, logistics complexity

Reduce variability

Stricter inclusion criteria, standardized protocols

Increases precision

Reduces generalizability

Use more sensitive outcome

Choose outcome with lower SD

More precise measurement

May not be clinically preferred

Pitfall

Problem

Consequence

Solution

Underestimating dropout

Sample shrinks below powered size

Underpowered final analysis

Inflate by 15-25% for attrition

Unrealistic effect size

MCID too large or optimistic

Study underpowered for true effect

Use conservative, validated MCID

Wrong SD estimate

Variability higher than expected

Lower power than calculated

Use upper bound of SD estimate

Multiple comparisons ignored

Many outcomes without correction

Inflated Type I error

Adjust alpha (Bonferroni) or define primary outcome

Type-II Error Rates in Orthopaedic Trauma RCTs

Lochner HV, Bhandari M, Tornetta P • J Bone Joint Surg Am (2001)

Key Findings:

Systematic survey of 117 randomised fracture-care trials (1968-1999) including 19,942 patients
Mean study power for the primary outcome was only 24.65% (range 2% to 99%)
The Type-II error rate (beta) for primary outcomes was 90.52% - far above the accepted 20% threshold
Most trials were grossly underpowered; performing a priori power and sample-size calculations is the corrective

Clinical Implication: The seminal demonstration that orthopaedic trauma trials are overwhelmingly underpowered, so their negative results cannot be read as evidence of no difference.

Limitation: Trauma-specific cohort ending in 1999; reporting has improved with CONSORT, but underpowering remains common across subspecialties.

Verify on PubMed (PMID 11701786)

Understanding the MCID: Concepts and Methods

Copay AG, Subach BR, Glassman SD, Polly DW, Schuler TC • Spine J (2007)

Key Findings:

Defines the MCID as the smallest improvement a patient considers worthwhile - the effect size that should anchor sample-size calculations
MCID is derived by anchor-based methods (linked to an external patient-reported criterion) or distribution-based methods (e.g. 0.5 standard deviation, standard error of measurement)
Three core limitations: multiplicity of MCID estimates, loss of the patient's perspective, and dependence on the baseline score
No single MCID is universal; the value depends on the instrument, population and method used

Limitation: A methodological review rather than a source of definitive MCID values; reported MCIDs vary widely by population and instrument.

Verify on PubMed (PMID 17448732)

CONSORT 2010: Reporting Standard for Sample Size

Guideline

Schulz KF, Altman DG, Moher D (CONSORT Group) • BMJ (2010)

Key Findings:

CONSORT Item 7a requires authors to report how the sample size was determined
CONSORT Item 7b requires reporting of interim analyses and stopping guidelines where applicable
Sample-size justification should state the target effect size, power, alpha and the assumed standard deviation
Adopted worldwide by leading journals to make trial adequacy transparent to readers

Clinical Implication: CONSORT 2010 is the international reference standard against which examiners and journals judge whether an RCT's sample size was adequately justified.

Limitation: A reporting guideline, not a method - adherence varies and does not by itself correct an underpowered design.

Verify on PubMed (PMID 20332509)

Controversy

One View

Opposing View

Pragmatic Position

Is 80% power enough?

80% is the long-standing convention and keeps trials feasible

20% chance of missing a true effect is too high for definitive surgical questions; 90% should be standard

Use 90% for pivotal or hard-to-repeat trials; justify the choice explicitly

Distribution-based MCID (0.5 SD)

Convenient when no anchor-based MCID exists

Detached from the patient's perspective; may not reflect what patients value

Prefer validated anchor-based MCID; use 0.5 SD only as a transparent fallback

Value of small pilot/feasibility trials

Estimate SD, recruitment and feasibility before a full trial

Pilots estimate SD imprecisely and are misused for hypothesis testing

Use adequately sized pilots for feasibility only, never for efficacy claims

Fragility Index as a routine metric

Intuitive measure of how robust a significant result is

Conflates significance with sample size and ignores effect magnitude

Report alongside, not instead of, effect size and confidence intervals

p-value threshold of 0.05

Familiar, regulator-accepted default

Arbitrary; some call for 0.005 or abandoning thresholds for estimation

Pre-specify alpha; emphasise estimation with confidence intervals over dichotomising at 0.05

Body / Standard

Region

Core Requirement

Emphasis

CONSORT 2010

Global (journals worldwide)

Items 7a/7b: report how sample size was determined and any interim analyses

Transparency of a priori justification

ICH E9 (Statistical Principles)

Global regulatory (FDA, EMA, PMDA)

Pre-specify primary outcome, effect size, alpha, power and analysis set

Confirmatory trial rigour

NICE / UK NIHR-HTA

Funded trials must show a robust, MCID-anchored sample-size calculation

Value for public research funding

SPIRIT 2013

Global

Protocols must state sample-size assumptions before recruitment

Pre-registration of design

AO Foundation / clinical research units

Global trauma

Promote multicentre recruitment to reach powered samples

Overcoming single-centre limits

Study Design	Power Analysis Method	Key Considerations	Complexity
Parallel RCT	Standard two-group comparison	Effect size, SD, alpha, power	Basic
Crossover RCT	Paired comparison	Within-subject SD (smaller), carryover	Moderate
Cluster RCT	Account for clustering	ICC, cluster size, number of clusters	Complex
Non-inferiority	One-sided, margin defined	Non-inferiority margin, assay sensitivity	Complex

Study Design	Power Analysis Method	Key Considerations	Complexity
Parallel RCT	Standard two-group comparison	Effect size, SD, alpha, power	Basic
Crossover RCT	Paired comparison	Within-subject SD (smaller), carryover	Moderate
Cluster RCT	Account for clustering	ICC, cluster size, number of clusters	Complex
Non-inferiority	One-sided, margin defined	Non-inferiority margin, assay sensitivity	Complex

Statistical Power and Sample Size

Statistical Power and Sample Size

Power and Sample Size Relationships

Critical Must-Knows

Clinical Pearls

Critical Power Concepts

What is Power?

Sample Size Determinants

Underpowered Studies

Clinical vs Statistical Significance

APESSample Size Calculation Inputs

SHAPEFactors that Increase Required Sample Size

Overview/Introduction

What is Power?

Power Levels and Interpretation

Principles of Power Analysis

Core Principles

Sample Size Calculation

Four Essential Inputs

Alpha: Type I Error Rate

Power: 1 minus Type II Error

Effect Size: Clinically Meaningful Difference

Variability: Standard Deviation

Performing Sample Size Calculation

Sample Size Formula (Continuous Outcome, Two Groups)

Worked Example

Types of Power Analysis

A Priori vs Post Hoc Power Analysis

Types of Power Analysis

Power Analysis by Study Design

Sample Size Approaches by Design

Clinical Application

Underpowered Studies in Orthopaedics

MCID vs Statistical Significance

Pilot Studies

Multicenter Trials

Software and Calculation Tools

Common Power Analysis Software

Sample Size Calculation Tools

Manual Calculation Reference

Simulation-Based Power Analysis

MCID Sources for Orthopaedics

Addressing Underpowered Studies

Strategies to Increase Power

Methods to Address Low Power

When Power Cannot Be Achieved

Adaptive Trial Designs

Meta-Analysis as Solution

Common Pitfalls and Errors

Errors in Sample Size Calculation

Common Pitfalls in Power Analysis

Interpretation Errors

Design-Specific Pitfalls

Post Hoc Power Fallacy

Checklist for Sample Size Reporting

Evidence Base

Type-II Error Rates in Orthopaedic Trauma RCTs

Understanding the MCID: Concepts and Methods

CONSORT 2010: Reporting Standard for Sample Size

The Fragility of Significant RCT Findings in Sports Surgery

Sizing the Pilot Trial to Estimate the SD

Exam Viva Scenarios

Scenario 1: Sample Size Calculation

Scenario 2: Interpreting Underpowered Study

Scenario 3: Post Hoc Power and the Fragility Index

MCQ Practice Points

Controversies and Areas of Uncertainty

Where Experts Still Disagree

Unsettled Questions in Power and Sample Size

Guidelines, Registries & Global Practice

Why Power and Sample Size Are a Global Concern

How Major Bodies Frame Sample Size and Reporting

Registries as a Power Solution

High- versus Limited-Resource Practice Variation

Well-Resourced Settings

Limited-Resource Settings

Management Algorithm

STATISTICAL POWER AND SAMPLE SIZE

Core Concepts

Sample Size Calculation Inputs