High-yield overview

Question | validity | effect size | applicability

PICOturns a vague question into an answerable one

Biasdecides whether the estimate can be believed

95% CIshows precision and plausible effect range

MCIDasks whether the difference matters clinically

Appraisal Domains

Clinical question

PatternWhat problem, patient, intervention, comparator and outcome are being tested?

TreatmentUse PICO before reading methods.

Internal validity

PatternAre the methods protected from bias?

TreatmentAssess allocation, blinding, follow-up, measurement and confounding.

Result size

PatternHow large and precise is the effect?

TreatmentUse absolute effect, relative effect and confidence intervals.

Applicability

PatternDoes the result apply to the patient, surgeon, implant, system and outcome that matter?

TreatmentChange practice only when benefit, harm and feasibility align.

Critical Must-Knows

Critical appraisal is not a memory test of study designs. It is a structured judgement about whether a result should influence care.
A randomised trial can still be unreliable. Poor allocation concealment, missing data, crossover and selective reporting can destroy credibility.
A significant p value is not enough. Report effect size, absolute risk, confidence interval and clinical importance.
Different questions need different designs. RCTs suit treatment efficacy; cohort studies suit prognosis; diagnostic studies need a reference standard.
Evidence-based practice combines evidence, clinical expertise and patient values. It is not blind obedience to a paper.

Clinical Pearls

“
Start journal club by stating the PICO in one sentence.
“
Always separate internal validity from external applicability.
“
For binary outcomes, ask for absolute risk reduction and number needed to treat or harm.
“
A clinically trivial difference can be statistically significant in a large study.
“
A negative study may be underpowered rather than proof of no difference.

Do not confuse statistical significance with clinical importance

A p value answers whether the observed difference is compatible with chance under a statistical model. It does not tell you whether the effect is large enough to matter, whether harms are acceptable, or whether the result applies to your patient.

Critical appraisal workflow — A safe appraisal sequence moves from clinical question to PICO, study design, internal validity, effect size, clinical importance and patient applicability.Credit: Original OrthoVellum illustration

Mnemonic

PAPERRead A Paper

PICO

Define the patient, intervention, comparator and outcome.

Appraise validity

Look for bias before trusting the result.

Precision

Read the confidence interval, not just the p value.

Effect size

Translate relative and absolute effect into clinical terms.

Relevance

Decide whether it applies to your patient and setting.

Hook:Do not finish the PAPER until you know whether it should change practice.

Mnemonic

BIASEDBias Screen

Baseline balance

Were groups similar at the start?

Intervention fidelity

Were treatments delivered as intended?

Allocation concealment

Could enrolment be predicted or manipulated?

Selective reporting

Were all prespecified outcomes reported?

Endpoint blinding

Were outcomes measured fairly?

Dropouts

Was follow-up complete and balanced?

Hook:A BIASED paper can have a beautiful p value.

Mnemonic

CARESApply Evidence

Clinical importance

Is the effect larger than a meaningful threshold?

Applicability

Does the population match the patient?

Risks

Balance complications, reoperation and downstream harm.

Expertise

Can the technique be delivered safely in your system?

Shared decision

Does the evidence fit the patient's values and goals?

Hook:Evidence CARES only when it changes a real decision safely.

Overview

Evidence-based orthopaedics means using the best available research, clinical judgement and patient values to make decisions. It does not mean automatically following the newest paper, the biggest trial, the loudest conference presentation or the most quoted meta-analysis.

The practical question is always:

Can I Believe It?

This is internal validity. Ask whether the methods protected the result from bias, confounding, measurement error and missing data.

Should I Use It?

This is applicability. Ask whether the patient, intervention, comparator, outcome, surgeon skill and health system match your clinical decision.

The one-sentence appraisal opening

Begin by saying: "This paper asks whether intervention X compared with Y improves outcome Z in patients like this, and the key question is whether the methods and effect size are strong enough to change practice."

Concepts and Study Design

PICO before methods

PICO prevents vague appraisal. A paper about "fixation is better" is not appraisable until you define the patient, intervention, comparator and outcome.

Patient

Question: Who exactly is being treated?
Example: Older adults with displaced intracapsular femoral neck fracture who were ambulatory before injury.

Intervention

Question: What is the treatment, implant or pathway?
Example: Total hip arthroplasty through a specified approach.

Comparator

Question: What is it being compared with?
Example: Hemiarthroplasty, non-operative treatment, another implant or another rehabilitation pathway.

Outcome

Question: What matters and when?
Example: Reoperation, function, pain, dislocation, infection, mortality, revision, cost and patient-reported outcome at a defined time.

PICO in Orthopaedics
Element	Question	Example
Patient	Who exactly is being treated?	Older adults with displaced intracapsular femoral neck fracture who were ambulatory before injury.
Intervention	What is the treatment, implant or pathway?	Total hip arthroplasty through a specified approach.
Comparator	What is it being compared with?	Hemiarthroplasty, non-operative treatment, another implant or another rehabilitation pathway.
Outcome	What matters and when?	Reoperation, function, pain, dislocation, infection, mortality, revision, cost and patient-reported outcome at a defined time.

Match study design to question

Randomised trial or high-quality systematic review

What To Check: Random sequence, allocation concealment, blinding where possible, intention-to-treat and follow-up.
Orthopaedic Trap: Surgical trials may be hard to blind, so outcome assessment and crossover matter.

Registry or cohort study

What To Check: Confounding control, selection bias, surgeon/implant learning curve and outcome definition.
Orthopaedic Trap: Registry survival may not capture pain, function or radiographic failure.

Treatment Questions
Best Designs	What To Check	Orthopaedic Trap
Randomised trial or high-quality systematic review	Random sequence, allocation concealment, blinding where possible, intention-to-treat and follow-up.	Surgical trials may be hard to blind, so outcome assessment and crossover matter.
Registry or cohort study	Confounding control, selection bias, surgeon/implant learning curve and outcome definition.	Registry survival may not capture pain, function or radiographic failure.

Effect size reporting in orthopaedic critical appraisal — Effect size should be reported in a form that matches the outcome type. Absolute effect and confidence interval are usually more useful than a p value alone.Credit: Original OrthoVellum illustration

Superiority, equivalence and non-inferiority

Not every trial sets out to show one treatment is better. Orthopaedic device and technique studies are frequently non-inferiority trials: a new implant only needs to be "not meaningfully worse" than the gold standard while offering another advantage (lower cost, simpler technique, fewer complications).

Superiority

Question: Is the new treatment better?
How it is judged: The confidence interval for the difference excludes zero (no difference)

Equivalence

Question: Is it neither better nor worse, within a margin?
How it is judged: The confidence interval lies entirely within plus or minus the equivalence margin

Non-inferiority

Question: Is it not meaningfully worse?
How it is judged: The worse-side limit of the confidence interval does not cross the pre-set non-inferiority margin

Three Trial Objectives
Design	Question	How it is judged
Superiority	Is the new treatment better?	The confidence interval for the difference excludes zero (no difference)
Equivalence	Is it neither better nor worse, within a margin?	The confidence interval lies entirely within plus or minus the equivalence margin
Non-inferiority	Is it not meaningfully worse?	The worse-side limit of the confidence interval does not cross the pre-set non-inferiority margin

The non-inferiority margin (delta) must be defined a priori and clinically justified - it is the largest loss of efficacy that would still be acceptable. Setting it too wide can let a genuinely inferior treatment "pass".
The analysis population is reversed. In a superiority trial, intention-to-treat is conservative. In a non-inferiority trial, intention-to-treat is anti-conservative because crossover and non-compliance blur the groups together and falsely favour non-inferiority, so a per-protocol analysis must agree with intention-to-treat before non-inferiority is accepted.
Assay sensitivity: a non-inferiority trial is only interpretable if the gold-standard comparator genuinely works in that setting; otherwise "non-inferior" may simply mean "both ineffective".

Why intention-to-treat is not always the safe choice

Q: Why is intention-to-treat analysis insufficient on its own in a non-inferiority trial? A: Intention-to-treat dilutes differences between groups, which is conservative when proving superiority but anti-conservative when proving non-inferiority - it makes a worse treatment look acceptably close. A non-inferiority conclusion should therefore be supported by both intention-to-treat and per-protocol analyses, with a clinically justified margin and demonstrated assay sensitivity.

Clinical Relevance

Internal validity: can I believe the result?

Bias checks before believing a study — Bias checks should be performed before accepting the conclusion. A large number can still be wrong if bias drives the result.Credit: Original OrthoVellum illustration

Selection bias

Question To Ask: Were patients allocated or selected in a way that created unfair groups?
Orthopaedic Example: Healthier patients receive surgery while frailer patients receive non-operative care.

Performance bias

Question To Ask: Were co-interventions and rehabilitation similar?
Orthopaedic Example: One ACL group receives more supervised physiotherapy than the other.

Detection bias

Question To Ask: Were outcomes assessed fairly and blindly?
Orthopaedic Example: Surgeon-assessed radiographic union favours their preferred implant.

Attrition bias

Question To Ask: Who was lost to follow-up?
Orthopaedic Example: Painful failures do not return to clinic and are counted as successes.

Confounding

Question To Ask: What else explains the result?
Orthopaedic Example: High-volume surgeons use one implant and low-volume surgeons use another.

Reporting bias

Question To Ask: Were all prespecified outcomes reported?
Orthopaedic Example: The published paper reports range of motion but omits reoperation.

Core Bias Checks
Bias	Question To Ask	Orthopaedic Example
Selection bias	Were patients allocated or selected in a way that created unfair groups?	Healthier patients receive surgery while frailer patients receive non-operative care.
Performance bias	Were co-interventions and rehabilitation similar?	One ACL group receives more supervised physiotherapy than the other.
Detection bias	Were outcomes assessed fairly and blindly?	Surgeon-assessed radiographic union favours their preferred implant.
Attrition bias	Who was lost to follow-up?	Painful failures do not return to clinic and are counted as successes.
Confounding	What else explains the result?	High-volume surgeons use one implant and low-volume surgeons use another.
Reporting bias	Were all prespecified outcomes reported?	The published paper reports range of motion but omits reoperation.

Effect size: what does the number mean?

Mean difference

What To Translate: Difference in points, degrees, millimetres or time.
Decision Question: Is it greater than the minimum clinically important difference?

Risk ratio or odds ratio

What To Translate: Relative change plus absolute baseline risk.
Decision Question: How many events are actually prevented or caused?

Absolute risk reduction

What To Translate: Event rate difference between groups.
Decision Question: What is the number needed to treat or harm?

Hazard ratio

What To Translate: Relative event rate over time.
Decision Question: Are proportional hazards plausible and follow-up long enough?

Sensitivity and specificity

What To Translate: Test performance against reference standard.
Decision Question: How does the result change post-test probability?

Interpreting Common Results
Reported Result	What To Translate	Decision Question
Mean difference	Difference in points, degrees, millimetres or time.	Is it greater than the minimum clinically important difference?
Risk ratio or odds ratio	Relative change plus absolute baseline risk.	How many events are actually prevented or caused?
Absolute risk reduction	Event rate difference between groups.	What is the number needed to treat or harm?
Hazard ratio	Relative event rate over time.	Are proportional hazards plausible and follow-up long enough?
Sensitivity and specificity	Test performance against reference standard.	How does the result change post-test probability?

Applicability: should I use it?

A valid result still may not apply. Check:

Patient match: age, frailty, bone quality, comorbidity, activity level and pathology severity.
Intervention match: implant, surgical approach, rehabilitation protocol and perioperative care.
Surgeon/system match: volume, learning curve, imaging access, theatre resources and follow-up capability.
Outcome match: patient-reported outcomes, revision, reoperation, complications, cost and survivorship.
Time horizon: short-term function may conflict with long-term revision risk.

Registry data and RCTs answer different questions

Registry studies often excel at large-scale implant survivorship and rare revision outcomes. Randomised trials better test efficacy in controlled populations. Neither replaces the other.

Differential: easily confused concepts

The commonest appraisal errors in vivas come from mixing up paired concepts that sound similar but answer different questions. Knowing the distinction is high yield.

Statistical significance vs clinical importance

What It Actually Means: A p value tests compatibility with chance; clinical importance tests whether the effect exceeds a meaningful threshold.
Why It Matters: A large trial can make a trivial difference significant; a small trial can miss an important one.

Relative vs absolute effect

What It Actually Means: Relative risk or odds ratio is a ratio; absolute risk reduction is the actual event-rate difference.
Why It Matters: A halving of risk (relative) may be a fraction of a percent (absolute) when baseline risk is low.

Allocation concealment vs blinding

What It Actually Means: Concealment hides the upcoming assignment before enrolment; blinding hides the received treatment afterwards.
Why It Matters: Concealment protects randomisation integrity; blinding protects performance and detection.

Confidence interval vs p value

What It Actually Means: A confidence interval shows the range of effects compatible with the data; the p value gives a single threshold answer.
Why It Matters: A non-significant result with a wide interval is uncertainty, not proof of no effect.

Per-protocol vs intention-to-treat

What It Actually Means: Intention-to-treat keeps patients in their randomised group; per-protocol analyses only compliant patients.
Why It Matters: Per-protocol can reintroduce selection bias and exaggerate surgical benefit.

Efficacy vs effectiveness

What It Actually Means: Efficacy is performance under ideal trial conditions; effectiveness is performance in routine practice.
Why It Matters: A result from an expert centre may not transfer to a general unit (the efficacy-effectiveness gap).

Distinguishing Commonly Confused Appraisal Concepts
Often Confused	What It Actually Means	Why It Matters
Statistical significance vs clinical importance	A p value tests compatibility with chance; clinical importance tests whether the effect exceeds a meaningful threshold.	A large trial can make a trivial difference significant; a small trial can miss an important one.
Relative vs absolute effect	Relative risk or odds ratio is a ratio; absolute risk reduction is the actual event-rate difference.	A halving of risk (relative) may be a fraction of a percent (absolute) when baseline risk is low.
Allocation concealment vs blinding	Concealment hides the upcoming assignment before enrolment; blinding hides the received treatment afterwards.	Concealment protects randomisation integrity; blinding protects performance and detection.
Confidence interval vs p value	A confidence interval shows the range of effects compatible with the data; the p value gives a single threshold answer.	A non-significant result with a wide interval is uncertainty, not proof of no effect.
Per-protocol vs intention-to-treat	Intention-to-treat keeps patients in their randomised group; per-protocol analyses only compliant patients.	Per-protocol can reintroduce selection bias and exaggerate surgical benefit.
Efficacy vs effectiveness	Efficacy is performance under ideal trial conditions; effectiveness is performance in routine practice.	A result from an expert centre may not transfer to a general unit (the efficacy-effectiveness gap).

Type I and Type II Error, and Statistical Power

Behind every p value and every "underpowered" comment sit two error types examiners expect you to define precisely.

Type I (alpha)

Definition: False positive - concluding a difference exists when it does not
Governed by: The significance level (conventionally 0.05)
Orthopaedic consequence: Adopting an implant or technique that is in truth no better

Type II (beta)

Definition: False negative - missing a real difference
Governed by: Sample size and effect size (beta conventionally 0.10 to 0.20)
Orthopaedic consequence: Discarding a genuinely better treatment because the trial was too small

The Two Error Types
Error	Definition	Governed by	Orthopaedic consequence
Type I (alpha)	False positive - concluding a difference exists when it does not	The significance level (conventionally 0.05)	Adopting an implant or technique that is in truth no better
Type II (beta)	False negative - missing a real difference	Sample size and effect size (beta conventionally 0.10 to 0.20)	Discarding a genuinely better treatment because the trial was too small

Power equals one minus beta - the probability of detecting a true effect of a given size. Trials are usually designed for 80 to 90 percent power.
The sample-size (power) calculation is set a priori from the expected effect size (often the MCID), the baseline event rate or variance, alpha and the desired power. A small expected effect or a rare event demands a large sample - the central reason so many surgical RCTs end up underpowered.
Absence of evidence is not evidence of absence: a non-significant result in an underpowered study does not prove equivalence. Check whether the study had the power to detect a clinically important difference, and read the confidence interval.

The underpowered 'negative' trial

Q: A surgical RCT reports 'no significant difference' (p = 0.3) between two implants. Does this prove they are equivalent? A: No. This is most often a Type II error in an underpowered study. Check the a priori power calculation and the confidence interval - if the interval still includes a clinically important difference, the trial simply could not exclude one. Proving equivalence requires a purpose-designed equivalence or non-inferiority trial, not a failed superiority trial.

Guidelines, Registries and Global Practice

Critical appraisal is the engine behind every guideline and registry, so candidates from any system should know how the major bodies build and grade recommendations and how their orthopaedic registries are used as evidence.

How major bodies grade evidence

AAOS (US)

Approach: Clinical practice guidelines using a structured strength-of-recommendation system from systematic reviews.
What To Note: Recommendation strength reflects evidence quality plus consistency; many orthopaedic topics rest on limited or moderate evidence.

NICE and BOA/BOAST (UK)

Approach: NICE uses GRADE-based methods; BOAST standards translate evidence into auditable care standards.
What To Note: BOAST documents convert evidence into concise, measurable practice points.

AO Foundation

Approach: Education and consensus around fracture management with evidence summaries.
What To Note: Strong on technique and classification; consensus may outpace high-level trial data.

EFORT and European societies

Approach: Consensus statements and instructional reviews across Europe.
What To Note: Useful where regional practice and implant availability differ.

Cochrane / GRADE working group

Approach: Systematic reviews with explicit GRADE certainty ratings.
What To Note: Often rate orthopaedic surgical evidence as low or moderate certainty.

Evidence-Grading Approaches Across Bodies
Body	Approach	What To Note
AAOS (US)	Clinical practice guidelines using a structured strength-of-recommendation system from systematic reviews.	Recommendation strength reflects evidence quality plus consistency; many orthopaedic topics rest on limited or moderate evidence.
NICE and BOA/BOAST (UK)	NICE uses GRADE-based methods; BOAST standards translate evidence into auditable care standards.	BOAST documents convert evidence into concise, measurable practice points.
AO Foundation	Education and consensus around fracture management with evidence summaries.	Strong on technique and classification; consensus may outpace high-level trial data.
EFORT and European societies	Consensus statements and instructional reviews across Europe.	Useful where regional practice and implant availability differ.
Cochrane / GRADE working group	Systematic reviews with explicit GRADE certainty ratings.	Often rate orthopaedic surgical evidence as low or moderate certainty.

Registries as evidence

National joint replacement registries are a defining global source of orthopaedic effectiveness and safety data. They capture rare revision events and long-term survivorship that trials cannot.

AOANJRR (Australia), NJR (England, Wales, NI and Isle of Man), AJRR (US), SHAR (Sweden), NARA (Nordic) and NZJR (New Zealand) provide implant survivorship, revision rates and bearing or fixation comparisons across hundreds of thousands of procedures.
Strengths: large numbers, real-world populations, early detection of poorly performing implants (the metal-on-metal hip signal is the classic example).
Limitations: confounding by indication, limited patient-reported outcomes, variable capture of function and pain, and dependence on accurate data entry.

High-resource versus limited-resource practice

Access to evidence

Higher-resource setting: Subscription journals, guideline databases and registries.
Limited-resource setting: Reliance on open-access sources, WHO guidance and society summaries.

Applicability of trials

Higher-resource setting: Populations often match trial cohorts.
Limited-resource setting: Implant availability, follow-up capacity and case mix may differ from trial settings.

Outcome priorities

Higher-resource setting: Revision, patient-reported outcomes and survivorship.
Limited-resource setting: Limb salvage, infection control and return to function may dominate.

Applying Evidence Across Resource Settings
Dimension	Higher-resource setting	Limited-resource setting
Access to evidence	Subscription journals, guideline databases and registries.	Reliance on open-access sources, WHO guidance and society summaries.
Applicability of trials	Populations often match trial cohorts.	Implant availability, follow-up capacity and case mix may differ from trial settings.
Outcome priorities	Revision, patient-reported outcomes and survivorship.	Limb salvage, infection control and return to function may dominate.

Global framing

A guideline or registry finding from any one country is evidence contributing to a global picture; the appraisal question is always whether the population, implant and system match the patient in front of you, not which country produced the data.

Controversies and Areas of Uncertainty

Critical appraisal is itself debated. Examiners reward candidates who can discuss the limits of the evidence hierarchy rather than reciting it.

Hierarchy vs real-world evidence

The classical pyramid places randomised trials above observational data, yet large registries and target-trial-emulation methods can answer questions (rare implant failure, long-term survivorship) that no feasible trial can. Many surgical questions cannot be ethically or practically randomised.

Surgical RCTs are hard

Blinding the surgeon is impossible, learning curves bias early results, equipoise is often lacking, and expertise-based designs are uncommon. A poorly conducted surgical trial may be weaker than a well-designed cohort study.

Reproducibility and reporting

Selective outcome reporting, spin in abstracts, underpowered studies and unregistered protocols remain common in the orthopaedic literature. Trial registration and core outcome sets are partial but incomplete remedies.

MCID is not fixed

The minimum clinically important difference varies by population, anchor method and baseline severity, so the same point change can be "important" in one study and not another. Treat any single MCID value with caution.

A mature appraisal answer

Say explicitly that the evidence hierarchy ranks designs on average but that a specific study must be judged on its own conduct, and that for many surgical questions high-quality observational and registry data are the best obtainable evidence.

Clinical Scenarios

Practise clinical reasoning and management decisions out loud

Viva scenarioStandard

Clinical prompt

“You are shown a randomised trial comparing two fixation methods. The conclusion says one implant is statistically superior with p = 0.04.”

Viva scenarioAdvanced

Clinical prompt

“A meta-analysis reports that a surgical technique reduces revision risk. The forest plot looks convincing, but the included studies are heterogeneous observational cohorts.”

Viva scenarioAdvanced

Clinical prompt

“A paper reports that a new clinical test for a meniscal tear has a sensitivity of 90 percent and a specificity of 60 percent, validated against arthroscopy in a specialist sports clinic.”

Exam day cheat sheet

Critical Appraisal Cheat Sheet

Start

State the PICO
Identify study design
Ask if design matches question
Find primary outcome
Check follow-up duration

Believe

Selection bias
Allocation concealment
Blinding/outcome assessment
Missing data
Confounding and reporting bias

Use

Absolute and relative effect
Confidence interval
Clinical importance
Benefits versus harms
Applicability to patient and setting

“Define the question, test the validity, quantify the effect and decide whether it applies.”

Evidence Base

Evidence

Evidence-based medicine definition

Foundational article

Key Findings:

Evidence-based medicine integrates best evidence with clinical expertise and patient values.
It is not cookbook medicine.
External evidence can inform but not replace clinical judgement.

Clinical implication: The page should teach evidence as decision support, not automatic rule-following.

Limitation: Conceptual article rather than empirical trial.

Source: Sackett et al., BMJ, 1996

Verify on PubMed (PMID 8555924)

Evidence

GRADE approach

Methodology consensus

Key Findings:

GRADE separates certainty of evidence from strength of recommendation.
Evidence can be downgraded for risk of bias, inconsistency, indirectness, imprecision and publication bias.
Recommendations also depend on values, harms and resource use.

Clinical implication: A strong recommendation is not just a high-level study; it is a judgement across benefits, harms and certainty.

Limitation: Methodology framework requiring judgement.

Source: Guyatt et al., BMJ, 2008

Verify on PubMed (PMID 18436948)

Evidence

Reporting guidelines

Reporting standards

Key Findings:

Different study types require different reporting checklists.
Transparent reporting helps readers judge bias and applicability.
Poor reporting does not always mean poor methods, but it prevents confident appraisal.

Clinical implication: Use the appropriate checklist when reading a trial, observational study, diagnostic study or systematic review.

Limitation: Reporting standards improve transparency but do not guarantee methodological quality.

Source: CONSORT 2010, STROBE 2007, STARD 2015, PRISMA 2020

Evidence

AMSTAR 2

Critical appraisal tool

Key Findings:

AMSTAR 2 provides a structured method for appraising systematic reviews of healthcare interventions.
It distinguishes critical from non-critical weaknesses.
A meta-analysis can be misleading if the review question, search, bias assessment or synthesis is weak.

Clinical implication: Do not trust a forest plot before appraising the review methods.

Limitation: Tool requires informed judgement and does not produce a simple numerical quality score.

Source: Shea et al., BMJ, 2017

Verify on PubMed (PMID 28935701)

Evidence

Levels of evidence in orthopaedics

Level V (editorial framework)

Key Findings:

Major orthopaedic journals adopted a five-level hierarchy (Level I randomised trials through Level V expert opinion) applied separately to therapeutic, prognostic, diagnostic and economic questions.
The level assigned depends on both study design and methodological quality, so a flawed randomised trial can drop below Level I.
The grading was introduced to help readers rapidly gauge the strength of orthopaedic evidence.

Clinical implication: Use the level of evidence as a starting filter, but still appraise the methods rather than trusting the label.

Limitation: A simple hierarchy oversimplifies bias; a Level I tag does not guarantee freedom from allocation, blinding or attrition problems.

Source: Wright, Swiontkowski, Heckman, JBJS Br, 2006

Verify on PubMed (PMID 16943485)

Evidence

Large orthopaedic RCT as appraisal exemplar (SPRINT)

Level I (multicentre randomised trial design)

Key Findings:

A multicentre randomised trial enrolling 1200 skeletally mature patients across 29 sites in Canada, the United States and the Netherlands compared reamed with non-reamed intramedullary nailing of tibial shaft fractures.
Patients, outcome assessors and data analysts were blinded, and a blinded committee adjudicated the composite reoperation primary outcome.
An a priori subgroup analysis separated open from closed fractures, illustrating prespecified rather than data-driven subgrouping.

Clinical implication: Demonstrates the methodological safeguards (concealed allocation, blinded adjudication, prespecified subgroups, standardised co-interventions) that a credible surgical trial should show.

Limitation: This is the rationale and design publication; effect estimates are reported in the trial's separate results paper, and surgical interventions cannot blind the operating surgeon.

Source: SPRINT Investigators (Bhandari et al.), BMC Musculoskelet Disord, 2008

Verify on PubMed (PMID 18573205)

PICO in Orthopaedics

Element

Question

Example

Patient

Who exactly is being treated?

Older adults with displaced intracapsular femoral neck fracture who were ambulatory before injury.

Intervention

What is the treatment, implant or pathway?

Total hip arthroplasty through a specified approach.

Comparator

What is it being compared with?

Hemiarthroplasty, non-operative treatment, another implant or another rehabilitation pathway.

Outcome

What matters and when?

Reoperation, function, pain, dislocation, infection, mortality, revision, cost and patient-reported outcome at a defined time.

Treatment Questions

Best Designs

What To Check

Orthopaedic Trap

Randomised trial or high-quality systematic review

Random sequence, allocation concealment, blinding where possible, intention-to-treat and follow-up.

Surgical trials may be hard to blind, so outcome assessment and crossover matter.

Registry or cohort study

Confounding control, selection bias, surgeon/implant learning curve and outcome definition.

Registry survival may not capture pain, function or radiographic failure.

Three Trial Objectives

Design

Question

How it is judged

Superiority

Is the new treatment better?

The confidence interval for the difference excludes zero (no difference)

Equivalence

Is it neither better nor worse, within a margin?

The confidence interval lies entirely within plus or minus the equivalence margin

Non-inferiority

Is it not meaningfully worse?

The worse-side limit of the confidence interval does not cross the pre-set non-inferiority margin

Core Bias Checks

Bias

Question To Ask

Orthopaedic Example

Selection bias

Were patients allocated or selected in a way that created unfair groups?

Healthier patients receive surgery while frailer patients receive non-operative care.

Performance bias

Were co-interventions and rehabilitation similar?

One ACL group receives more supervised physiotherapy than the other.

Detection bias

Were outcomes assessed fairly and blindly?

Surgeon-assessed radiographic union favours their preferred implant.

Attrition bias

Who was lost to follow-up?

Painful failures do not return to clinic and are counted as successes.

Confounding

What else explains the result?

High-volume surgeons use one implant and low-volume surgeons use another.

Reporting bias

Were all prespecified outcomes reported?

The published paper reports range of motion but omits reoperation.

Interpreting Common Results

Reported Result

What To Translate

Decision Question

Mean difference

Difference in points, degrees, millimetres or time.

Is it greater than the minimum clinically important difference?

Risk ratio or odds ratio

Relative change plus absolute baseline risk.

How many events are actually prevented or caused?

Absolute risk reduction

Event rate difference between groups.

What is the number needed to treat or harm?

Hazard ratio

Relative event rate over time.

Are proportional hazards plausible and follow-up long enough?

Sensitivity and specificity

Test performance against reference standard.

How does the result change post-test probability?

Distinguishing Commonly Confused Appraisal Concepts

Often Confused

What It Actually Means

Why It Matters

Statistical significance vs clinical importance

A p value tests compatibility with chance; clinical importance tests whether the effect exceeds a meaningful threshold.

A large trial can make a trivial difference significant; a small trial can miss an important one.

Relative vs absolute effect

Relative risk or odds ratio is a ratio; absolute risk reduction is the actual event-rate difference.

A halving of risk (relative) may be a fraction of a percent (absolute) when baseline risk is low.

Allocation concealment vs blinding

Concealment hides the upcoming assignment before enrolment; blinding hides the received treatment afterwards.

Concealment protects randomisation integrity; blinding protects performance and detection.

Confidence interval vs p value

A confidence interval shows the range of effects compatible with the data; the p value gives a single threshold answer.

A non-significant result with a wide interval is uncertainty, not proof of no effect.

Per-protocol vs intention-to-treat

Intention-to-treat keeps patients in their randomised group; per-protocol analyses only compliant patients.

Per-protocol can reintroduce selection bias and exaggerate surgical benefit.

Efficacy vs effectiveness

Efficacy is performance under ideal trial conditions; effectiveness is performance in routine practice.

A result from an expert centre may not transfer to a general unit (the efficacy-effectiveness gap).

The Two Error Types

Error

Definition

Governed by

Orthopaedic consequence

Type I (alpha)

False positive - concluding a difference exists when it does not

The significance level (conventionally 0.05)

Adopting an implant or technique that is in truth no better

Type II (beta)

False negative - missing a real difference

Sample size and effect size (beta conventionally 0.10 to 0.20)

Discarding a genuinely better treatment because the trial was too small

Evidence-Grading Approaches Across Bodies

Body

Approach

What To Note

AAOS (US)

Clinical practice guidelines using a structured strength-of-recommendation system from systematic reviews.

Recommendation strength reflects evidence quality plus consistency; many orthopaedic topics rest on limited or moderate evidence.

NICE and BOA/BOAST (UK)

NICE uses GRADE-based methods; BOAST standards translate evidence into auditable care standards.

BOAST documents convert evidence into concise, measurable practice points.

AO Foundation

Education and consensus around fracture management with evidence summaries.

Strong on technique and classification; consensus may outpace high-level trial data.

EFORT and European societies

Consensus statements and instructional reviews across Europe.

Useful where regional practice and implant availability differ.

Cochrane / GRADE working group

Systematic reviews with explicit GRADE certainty ratings.

Often rate orthopaedic surgical evidence as low or moderate certainty.

Applying Evidence Across Resource Settings

Dimension

Higher-resource setting

Limited-resource setting

Access to evidence

Subscription journals, guideline databases and registries.

Reliance on open-access sources, WHO guidance and society summaries.

Applicability of trials

Populations often match trial cohorts.

Implant availability, follow-up capacity and case mix may differ from trial settings.

Outcome priorities

Revision, patient-reported outcomes and survivorship.

Limb salvage, infection control and return to function may dominate.

Evidence

Evidence-based medicine definition

Foundational article

Key Findings:

Evidence-based medicine integrates best evidence with clinical expertise and patient values.
It is not cookbook medicine.
External evidence can inform but not replace clinical judgement.

Clinical implication: The page should teach evidence as decision support, not automatic rule-following.

Limitation: Conceptual article rather than empirical trial.

Source: Sackett et al., BMJ, 1996

Verify on PubMed (PMID 8555924)

Evidence

GRADE approach

Methodology consensus

Key Findings:

GRADE separates certainty of evidence from strength of recommendation.
Evidence can be downgraded for risk of bias, inconsistency, indirectness, imprecision and publication bias.
Recommendations also depend on values, harms and resource use.

Clinical implication: A strong recommendation is not just a high-level study; it is a judgement across benefits, harms and certainty.

Limitation: Methodology framework requiring judgement.

Source: Guyatt et al., BMJ, 2008

Verify on PubMed (PMID 18436948)

Evidence

Reporting guidelines

Reporting standards

Key Findings:

Different study types require different reporting checklists.
Transparent reporting helps readers judge bias and applicability.
Poor reporting does not always mean poor methods, but it prevents confident appraisal.

Clinical implication: Use the appropriate checklist when reading a trial, observational study, diagnostic study or systematic review.

Limitation: Reporting standards improve transparency but do not guarantee methodological quality.

Source: CONSORT 2010, STROBE 2007, STARD 2015, PRISMA 2020

Evidence

AMSTAR 2

Critical appraisal tool

Key Findings:

AMSTAR 2 provides a structured method for appraising systematic reviews of healthcare interventions.
It distinguishes critical from non-critical weaknesses.
A meta-analysis can be misleading if the review question, search, bias assessment or synthesis is weak.

Clinical implication: Do not trust a forest plot before appraising the review methods.

Limitation: Tool requires informed judgement and does not produce a simple numerical quality score.

Source: Shea et al., BMJ, 2017

Verify on PubMed (PMID 28935701)

Evidence

Levels of evidence in orthopaedics

Level V (editorial framework)

Key Findings:

Major orthopaedic journals adopted a five-level hierarchy (Level I randomised trials through Level V expert opinion) applied separately to therapeutic, prognostic, diagnostic and economic questions.
The level assigned depends on both study design and methodological quality, so a flawed randomised trial can drop below Level I.
The grading was introduced to help readers rapidly gauge the strength of orthopaedic evidence.

Clinical implication: Use the level of evidence as a starting filter, but still appraise the methods rather than trusting the label.

Limitation: A simple hierarchy oversimplifies bias; a Level I tag does not guarantee freedom from allocation, blinding or attrition problems.

Source: Wright, Swiontkowski, Heckman, JBJS Br, 2006

Verify on PubMed (PMID 16943485)

Evidence

Large orthopaedic RCT as appraisal exemplar (SPRINT)

Level I (multicentre randomised trial design)

Key Findings:

A multicentre randomised trial enrolling 1200 skeletally mature patients across 29 sites in Canada, the United States and the Netherlands compared reamed with non-reamed intramedullary nailing of tibial shaft fractures.
Patients, outcome assessors and data analysts were blinded, and a blinded committee adjudicated the composite reoperation primary outcome.
An a priori subgroup analysis separated open from closed fractures, illustrating prespecified rather than data-driven subgrouping.

Limitation: This is the rationale and design publication; effect estimates are reported in the trial's separate results paper, and surgical interventions cannot blind the operating surgeon.

Source: SPRINT Investigators (Bhandari et al.), BMC Musculoskelet Disord, 2008

Verify on PubMed (PMID 18573205)

Best Designs	What To Check	Orthopaedic Trap
Inception cohort	Consecutive patients at a similar disease stage, complete follow-up and meaningful outcomes.	Late referral cohorts can exaggerate poor prognosis.
Risk model	Calibration, discrimination and external validation.	A score may work in its derivation cohort and fail in your population.

Best Designs	What To Check	Orthopaedic Trap
Cross-sectional diagnostic accuracy study	Independent blind comparison with a valid reference standard.	MRI accuracy depends on disease spectrum and reader expertise.
Clinical test study	Spectrum of patients, reproducibility, likelihood ratios and reference standard.	A special test in a sports clinic may not perform the same in acute trauma.

Best Designs	What To Check	Orthopaedic Trap
Large cohort, registry or case-control study	Exposure definition, confounding, event capture and duration of follow-up.	Rare complications may be invisible in small RCTs.
Case series or alerts	Signal detection, biological plausibility and denominator uncertainty.	A dramatic complication report can identify danger but not incidence.

Study Type	Useful Standard	Key Questions
Randomised trial	CONSORT	Randomisation, allocation concealment, blinding, flow, analysis and harms.
Observational study	STROBE	Selection, exposure, confounding, missing data and outcome measurement.
Diagnostic accuracy study	STARD	Patient spectrum, index test, reference standard, blinding and timing.
Systematic review	PRISMA and AMSTAR 2	Question, protocol, search, bias, heterogeneity and synthesis method.
Recommendation	GRADE	Evidence certainty, benefit-harm balance, values and resources.

Best Designs	What To Check	Orthopaedic Trap
Inception cohort	Consecutive patients at a similar disease stage, complete follow-up and meaningful outcomes.	Late referral cohorts can exaggerate poor prognosis.
Risk model	Calibration, discrimination and external validation.	A score may work in its derivation cohort and fail in your population.