Study Planning | Adequate Sampling | Effect Detection
Power and Sample Size Relationships
Critical Must-Knows
- Power: Probability of detecting a true effect (1 minus Beta). Conventional target is 80 percent.
- Sample Size Calculation Requires: Effect size (MCID), Alpha (usually 0.05), Power (usually 0.80), Variability (SD)
- Effect Size: The magnitude of difference you want to detect - must be clinically meaningful (MCID), not just statistically significant
- Underpowered Studies: Risk Type II error (false negative) - failing to detect real treatment effect
- Factors Increasing Sample Size: Smaller effect size, lower power, higher variability, lower alpha
Clinical Pearls
- "Power = 80% means 20% chance of Type II error (missing a true effect)
- "MCID (Minimal Clinically Important Difference) defines what effect size matters to patients
- "Larger sample size increases power but also increases cost and time
- "Pilot studies help estimate variability (SD) for sample size calculations
Critical Power Concepts
What is Power?
Power = 1 minus Beta. Probability of correctly rejecting null hypothesis when alternative is true. Power = 80% means 80% chance of detecting real effect if it exists.
Sample Size Determinants
Four key inputs: (1) Alpha (Type I error, usually 0.05), (2) Power (1-Beta, usually 0.80), (3) Effect Size (MCID), (4) Variability (Standard Deviation).
Underpowered Studies
Risk: Type II error (false negative). Study may fail to detect real treatment benefit. Common in orthopaedic trials with small sample sizes.
Clinical vs Statistical Significance
Statistical Significance: p less than 0.05. Clinical Significance: Difference exceeds MCID. Large studies detect trivial differences; small studies miss important ones.
APESSample Size Calculation Inputs
| A | Alpha Type I error rate (usually 0.05 or 5%) |
| P | Power 1 minus Beta (usually 0.80 or 80%) |
| E | Effect Size MCID - clinically meaningful difference |
| S | Standard Deviation Variability of outcome measure |
| A | Alpha Type I error rate (usually 0.05 or 5%) | E | Effect Size MCID - clinically meaningful difference |
| P | Power 1 minus Beta (usually 0.80 or 80%) | S | Standard Deviation Variability of outcome measure |
Hook:APES calculate sample size - Alpha, Power, Effect, SD are the four essentials!
SHAPEFactors that Increase Required Sample Size
| S | Smaller effect size Detecting small differences needs more patients |
| H | Higher power 90% vs 80% power requires more patients |
| A | Alpha reduction 0.01 vs 0.05 needs larger sample |
| P | Population variability Higher SD increases sample needed |
| E | Expected dropout Must inflate for anticipated loss to follow-up |
| S | Smaller effect size Detecting small differences needs more patients | P | Population variability Higher SD increases sample needed |
| H | Higher power 90% vs 80% power requires more patients | E | Expected dropout Must inflate for anticipated loss to follow-up |
| A | Alpha reduction 0.01 vs 0.05 needs larger sample |
Hook:SHAPE your sample size - these five factors determine how many participants you need!
Overview/Introduction
What is Power?
Definition: Statistical power is the probability that a study will detect an effect when there truly is an effect to detect.
Formula: Power = 1 minus Beta (Type II error rate)
Interpretation:
- Power = 80%: 80% chance of detecting true effect, 20% chance of missing it (Type II error)
- Power = 50%: Coin flip - as likely to miss effect as to find it (underpowered)
- Power = 95%: 95% chance of detecting true effect, but requires much larger sample
Power Levels and Interpretation
| Power | Meaning | Adequacy | Sample Size |
|---|---|---|---|
| Greater than 90% | Very high chance of detecting true effect | Excellent but may be excessive | Very large sample needed |
| 80-90% | High chance of detecting true effect | Conventional and adequate | Moderate sample size |
| 50-80% | Moderate chance, meaningful risk of missing effect | Underpowered - risky | Smaller sample |
| Under 50% | More likely to miss effect than find it | Severely underpowered | Very small sample |
Understanding power is essential for designing adequately powered studies.
Principles of Power Analysis
Core Principles
Relationship Between Power and Sample Size:
- Larger sample size increases power
- Doubling sample size does NOT double power (diminishing returns)
- Power increases steeply initially, then plateaus
Trade-offs in Study Design:
- Higher power requires larger sample (more cost, time)
- Smaller effect size (more clinically conservative) requires larger sample
- Lower alpha (more statistically conservative) requires larger sample
Key Relationships:
- Power ∝ Sample Size
- Power ∝ Effect Size
- Power ∝ Alpha
- Power inversely proportional to Variability (SD)
Understanding these principles allows rational study design decisions.
Sample Size Calculation
Four Essential Inputs
Every sample size calculation requires four inputs:
Alpha: Type I Error Rate
Definition: Probability of falsely rejecting null hypothesis (false positive).
Conventional Choice: Alpha = 0.05 (5%)
Meaning: Willing to accept 5% chance of finding difference when none exists.
Trade-off: Lower alpha (e.g., 0.01) reduces false positives but requires larger sample size.
Bonferroni Correction: When testing multiple outcomes, divide alpha by number of tests to maintain overall Type I error rate.
Understanding alpha is critical for interpreting p-values and planning studies.
Performing Sample Size Calculation
Sample Size Formula (Continuous Outcome, Two Groups)
For comparing means between two groups:
n = 2 × (Zα + Zβ)² × SD² / MCID²
Where:
- n = sample size per group
- Zα = Z-score for alpha (1.96 for alpha = 0.05 two-tailed)
- Zβ = Z-score for beta (0.84 for power = 0.80)
- SD = standard deviation
- MCID = effect size (minimal clinically important difference)
Worked Example
Question: How many patients needed per group to detect 10-point improvement in WOMAC score?
Given:
- MCID = 10 points
- SD = 20 points (from literature)
- Alpha = 0.05 (Zα = 1.96)
- Power = 0.80 (Zβ = 0.84)
Calculation:
- n = 2 × (1.96 + 0.84)² × 20² / 10²
- n = 2 × 7.84 × 400 / 100
- n = 2 × 31.36
- n = 63 patients per group
Accounting for Dropout:
- If expecting 15% dropout: n = 63 / 0.85 = 74 patients per group
- Total enrollment: 148 patients
Understanding sample size calculation ensures adequately powered studies.
Types of Power Analysis
A Priori vs Post Hoc Power Analysis
Types of Power Analysis
| Type | When Performed | Purpose | Validity |
|---|---|---|---|
| A Priori (Prospective) | Before study begins | Calculate required sample size | Valid and recommended |
| Post Hoc (Retrospective) | After study completed | Calculate achieved power | Controversial - often misleading |
| Sensitivity Analysis | During planning | Assess power across range of assumptions | Useful for uncertainty |
A Priori Power Analysis (Recommended):
- Calculate sample size BEFORE enrolling patients
- Uses estimated effect size and SD from literature or pilot
- Ensures study designed with adequate power
Post Hoc Power Analysis (Problematic):
- Calculating power AFTER study complete using observed data
- Often done to explain non-significant results
- Mathematically redundant - p-value and post hoc power are directly related
Clinical Application
Underpowered Studies in Orthopaedics
Common Problem: Many orthopaedic RCTs are underpowered. Small sample sizes fail to detect clinically meaningful differences. Results are inconclusive, not negative.
MCID vs Statistical Significance
Clinical Relevance: A statistically significant finding (p less than 0.05) may not be clinically important. Always check if difference exceeds MCID.
Pilot Studies
Purpose: Estimate variability (SD) and feasibility before full trial. Helps refine sample size calculation. Do NOT use pilot for hypothesis testing.
Multicenter Trials
Solution: When single center cannot recruit adequate sample, multicenter collaboration achieves power. AOANJRR and international registries provide large samples.
Software and Calculation Tools
Common Power Analysis Software
Sample Size Calculation Tools
| Software | Cost | Features | Best For |
|---|---|---|---|
| G*Power | Free | Wide range of tests, user-friendly | Academic researchers, most designs |
| PS (Power and Sample Size) | Free | Simple interface, basic designs | Quick calculations, beginners |
| nQuery | Commercial | Comprehensive, regulatory accepted | Industry trials, complex designs |
| PASS | Commercial | Extensive documentation, FDA submissions | Regulatory submissions |
Online Calculators:
- ClinCalc sample size calculator (free online)
- OpenEpi power calculation (epidemiological studies)
- Sealed Envelope (clinical trial tools)
Manual Calculation Reference
| Statistic | Formula Component | Value (Common) |
|---|---|---|
| Zα (two-tailed, α=0.05) | Z-score for alpha | 1.96 |
| Zα (one-tailed, α=0.05) | Z-score for alpha | 1.645 |
| Zβ (power=0.80) | Z-score for beta | 0.84 |
| Zβ (power=0.90) | Z-score for beta | 1.28 |
Addressing Underpowered Studies
Strategies to Increase Power
Methods to Address Low Power
| Strategy | Approach | Advantages | Disadvantages |
|---|---|---|---|
| Increase sample size | Enroll more participants | Direct power increase | More cost, time, resources |
| Multicenter collaboration | Pool recruitment across sites | Achieves larger sample | Heterogeneity, logistics complexity |
| Reduce variability | Stricter inclusion criteria, standardized protocols | Increases precision | Reduces generalizability |
| Use more sensitive outcome | Choose outcome with lower SD | More precise measurement | May not be clinically preferred |
When Power Cannot Be Achieved
- Rare conditions: May need registry-based or multi-national studies
- Ethical constraints: Cannot enroll more for safety reasons
- Resource limitations: Accept lower power with pre-specified disclosure
Alternative Approaches:
- Bayesian analysis (can provide evidence even with small samples)
- Meta-analysis (combine with existing studies)
- Confidence interval interpretation (focus on precision)
Common Pitfalls and Errors
Errors in Sample Size Calculation
Common Pitfalls in Power Analysis
| Pitfall | Problem | Consequence | Solution |
|---|---|---|---|
| Underestimating dropout | Sample shrinks below powered size | Underpowered final analysis | Inflate by 15-25% for attrition |
| Unrealistic effect size | MCID too large or optimistic | Study underpowered for true effect | Use conservative, validated MCID |
| Wrong SD estimate | Variability higher than expected | Lower power than calculated | Use upper bound of SD estimate |
| Multiple comparisons ignored | Many outcomes without correction | Inflated Type I error | Adjust alpha (Bonferroni) or define primary outcome |
Interpretation Errors
Common Mistakes:
- Concluding treatments are "equivalent" from underpowered negative study
- Using post hoc power to justify non-significant results
- Ignoring confidence intervals when assessing clinical relevance
- Confusing statistical significance with clinical importance
Evidence Base
Type-II Error Rates in Orthopaedic Trauma RCTs
- Systematic survey of 117 randomised fracture-care trials (1968-1999) including 19,942 patients
- Mean study power for the primary outcome was only 24.65% (range 2% to 99%)
- The Type-II error rate (beta) for primary outcomes was 90.52% - far above the accepted 20% threshold
- Most trials were grossly underpowered; performing a priori power and sample-size calculations is the corrective
Understanding the MCID: Concepts and Methods
- Defines the MCID as the smallest improvement a patient considers worthwhile - the effect size that should anchor sample-size calculations
- MCID is derived by anchor-based methods (linked to an external patient-reported criterion) or distribution-based methods (e.g. 0.5 standard deviation, standard error of measurement)
- Three core limitations: multiplicity of MCID estimates, loss of the patient's perspective, and dependence on the baseline score
- No single MCID is universal; the value depends on the instrument, population and method used
CONSORT 2010: Reporting Standard for Sample Size
- CONSORT Item 7a requires authors to report how the sample size was determined
- CONSORT Item 7b requires reporting of interim analyses and stopping guidelines where applicable
- Sample-size justification should state the target effect size, power, alpha and the assumed standard deviation
- Adopted worldwide by leading journals to make trial adequacy transparent to readers
Exam Viva Scenarios
Use these scenarios to practise clinical reasoning and management decisions
Scenario 1: Sample Size Calculation
"You are planning an RCT to compare two surgical approaches for rotator cuff repair. What information do you need to calculate the required sample size?"
Scenario 2: Interpreting Underpowered Study
"You read an RCT comparing two implants for THA. The study found no significant difference (p = 0.15) with 40 patients per group. The power calculation shows the study had 35 percent power. How do you interpret this result?"
Scenario 3: Post Hoc Power and the Fragility Index
"A reviewer asks you to add a post hoc power calculation to your non-significant orthopaedic RCT to show it was adequately powered. The trial had 50 patients per arm. How do you respond, and how else might you convey the robustness of your findings?"
MCQ Practice Points
Power Definition
Q: What is statistical power? A: The probability of detecting a true effect when it exists, calculated as 1 minus Beta (Type II error rate). Power = 80% means 80% chance of finding real difference if present, 20% risk of missing it.
Sample Size Determinants
Q: Which factor does NOT increase required sample size? A: Higher alpha (e.g., 0.10 vs 0.05) actually decreases required sample size. Factors that increase sample size: smaller effect size, higher power, higher variability (SD), lower alpha.
MCID Importance
Q: Why is MCID important for sample size calculation? A: MCID defines the clinically meaningful effect size - the smallest difference that matters to patients. Using MCID ensures study is powered to detect differences that are clinically relevant, not just statistically significant. Without MCID, large studies may detect trivial differences.
Controversies and Areas of Uncertainty
Where Experts Still Disagree
Unsettled Questions in Power and Sample Size
| Controversy | One View | Opposing View | Pragmatic Position |
|---|---|---|---|
| Is 80% power enough? | 80% is the long-standing convention and keeps trials feasible | 20% chance of missing a true effect is too high for definitive surgical questions; 90% should be standard | Use 90% for pivotal or hard-to-repeat trials; justify the choice explicitly |
| Distribution-based MCID (0.5 SD) | Convenient when no anchor-based MCID exists | Detached from the patient's perspective; may not reflect what patients value | Prefer validated anchor-based MCID; use 0.5 SD only as a transparent fallback |
| Value of small pilot/feasibility trials | Estimate SD, recruitment and feasibility before a full trial | Pilots estimate SD imprecisely and are misused for hypothesis testing | Use adequately sized pilots for feasibility only, never for efficacy claims |
| Fragility Index as a routine metric | Intuitive measure of how robust a significant result is | Conflates significance with sample size and ignores effect magnitude | Report alongside, not instead of, effect size and confidence intervals |
| p-value threshold of 0.05 | Familiar, regulator-accepted default | Arbitrary; some call for 0.005 or abandoning thresholds for estimation | Pre-specify alpha; emphasise estimation with confidence intervals over dichotomising at 0.05 |
The Examiner's Favourite Trap
A non-significant result in a small trial means the study was inconclusive, not that the treatments are equivalent. Proving equivalence requires a purpose-designed equivalence or non-inferiority trial with a pre-specified margin - it can never be inferred from a failure to reach p less than 0.05 in an underpowered superiority trial.
Guidelines, Registries & Global Practice
Why Power and Sample Size Are a Global Concern
Underpowering is not confined to any one country. The landmark survey of orthopaedic trauma RCTs found a mean primary-outcome power of roughly 25 percent and a Type-II error rate over 90 percent, and fragility analyses of sports-surgery trials worldwide report a median Fragility Index of only two patients. Adequately powered, registry-linked and multinational collaborations are the international response.
How Major Bodies Frame Sample Size and Reporting
| Body / Standard | Region | Core Requirement | Emphasis |
|---|---|---|---|
| CONSORT 2010 | Global (journals worldwide) | Items 7a/7b: report how sample size was determined and any interim analyses | Transparency of a priori justification |
| ICH E9 (Statistical Principles) | Global regulatory (FDA, EMA, PMDA) | Pre-specify primary outcome, effect size, alpha, power and analysis set | Confirmatory trial rigour |
| NICE / UK NIHR-HTA | UK | Funded trials must show a robust, MCID-anchored sample-size calculation | Value for public research funding |
| SPIRIT 2013 | Global | Protocols must state sample-size assumptions before recruitment | Pre-registration of design |
| AO Foundation / clinical research units | Global trauma | Promote multicentre recruitment to reach powered samples | Overcoming single-centre limits |
Registries as a Power Solution
Large national arthroplasty and trauma registries supply the sample sizes single centres cannot. The AOANJRR (Australia), NJR (England, Wales, Northern Ireland and Isle of Man), AJRR (USA), SHAR (Sweden), the Norwegian Arthroplasty Register and the NZJR pool hundreds of thousands of procedures, giving the statistical power to detect small but clinically important differences in implant survival and revision rates that no individual RCT could achieve.
High- versus Limited-Resource Practice Variation
Well-Resourced Settings
Access to dedicated trial units, biostatisticians, multicentre networks and mature registries makes adequately powered RCTs and registry studies feasible. Power and MCID-anchored calculations are an expected norm for grant funding and publication.
Limited-Resource Settings
Smaller catchments, fewer statisticians and funding constraints make large RCTs difficult. Pragmatic responses include international collaboration, registry-based studies, Bayesian designs that extract more information from small samples, and honest pre-specified disclosure of limited power.
Management Algorithm

STATISTICAL POWER AND SAMPLE SIZE
Clinical summary
Core Concepts
- •Power = 1 minus Beta = Probability of detecting true effect
- •Conventional power = 80% (20% risk of Type II error)
- •Sample size needs 4 inputs: Alpha, Power, Effect Size (MCID), SD
- •Underpowered study = High risk of missing real effect (Type II error)
Sample Size Calculation Inputs
- •Alpha = Type I error (usually 0.05) - false positive rate
- •Power = 1 minus Beta (usually 0.80) - true positive rate
- •Effect Size = MCID (clinically meaningful difference)
- •SD = Variability (from literature or pilot study)
- •Inflate by 15-20% for anticipated dropout
Factors Increasing Sample Size
- •Smaller effect size (harder to detect)
- •Higher power (90% vs 80%)
- •Lower alpha (0.01 vs 0.05)
- •Higher variability (larger SD)
- •Expected dropout or loss to follow-up
Interpreting Power
- •Power greater than 90% = Excellent, may be excessive
- •Power 80-90% = Adequate and conventional
- •Power 50-80% = Underpowered, risky
- •Power under 50% = Severely underpowered, likely to fail
- •Negative result from underpowered study = Inconclusive
Clinical Application
- •MCID defines clinical relevance, not just statistical significance
- •Many orthopaedic RCTs are underpowered (power under 80%)
- •Pilot studies estimate SD and feasibility, NOT for hypothesis testing
- •Absence of evidence is NOT evidence of absence (underpowered studies)
- •Wide confidence intervals indicate insufficient precision