High-yield overview

Evidence Hierarchy | GRADE System | Clinical Application

Level IHigh-quality RCT or SR

Level IILesser RCT or Cohort

Level IIICase-Control Studies

Level IV-VCase Series and Opinion

Evidence Levels for Therapeutic Questions

Level I

PatternHigh-quality RCT or Systematic Review

TreatmentStrong recommendation possible

Level II

PatternLesser RCT or Prospective Cohort

TreatmentModerate recommendation

Level III

PatternCase-Control or Retrospective Cohort

TreatmentWeak recommendation

Level IV-V

PatternCase Series or Expert Opinion

TreatmentVery weak recommendation

Critical Must-Knows

Level I Evidence: High-quality RCT with randomization, blinding, adequate power, low loss to follow-up
GRADE System: Assesses quality of evidence (High/Moderate/Low/Very Low) AND strength of recommendations (Strong/Weak)
Evidence Levels Vary by Question Type: Therapeutic, Prognostic, Diagnostic questions have different hierarchies
Study Design ≠ Evidence Quality: A poorly conducted RCT can be downgraded; a well-done cohort can provide strong evidence
Recommendation Strength: Depends on evidence quality, benefit-harm balance, values, and resource use

Clinical Pearls

"
RCT is not always Level I - must meet quality criteria including blinding, adequate power, low attrition
"
Systematic review quality depends on included studies - SR of poor RCTs is not Level I
"
For rare outcomes, well-designed case-control may be best available evidence
"
GRADE separates evidence quality from recommendation strength - can have strong recommendation from low-quality evidence if large effect and ethical imperative

Critical Evidence Concepts

Study Design vs Evidence Quality

Not the same! RCT design does NOT automatically mean Level I. Must assess: randomization quality, blinding, power, attrition, bias. A flawed RCT can be Level II or III.

Question Type Matters

Therapeutic: RCT is gold standard. Prognostic: Cohort is best. Diagnostic: Cross-sectional with reference standard. Evidence hierarchy differs by question.

GRADE: Quality vs Strength

Evidence Quality: How confident are we in effect estimate? Recommendation Strength: Should we do this? Can have strong recommendation from low quality if large effect.

Upgrade and Downgrade Factors

Downgrade: Risk of bias, inconsistency, indirectness, imprecision, publication bias. Upgrade: Large effect, dose-response, residual confounding (favors null).

At a Glance

The Levels of Evidence framework ranks study designs to guide clinical decision-making, with Level I representing high-quality randomized controlled trials (adequate randomization, blinding, power, and low attrition) or systematic reviews thereof—importantly, study design does not automatically determine evidence level, as a poorly conducted RCT may be downgraded to Level II or III. The hierarchy descends through Level II (lesser RCTs, prospective cohorts), Level III (case-control, retrospective cohorts), to Level IV-V (case series, expert opinion). The GRADE system introduces crucial nuance by separating evidence quality (confidence in effect estimate) from recommendation strength (should we act), acknowledging that strong recommendations can arise from lower-quality evidence when effects are large and harms minimal. Evidence can be downgraded by "RIIIP" factors (Risk of bias, Inconsistency, Indirectness, Imprecision, Publication bias) or upgraded by large effect sizes, dose-response relationships, and residual confounding favoring the null hypothesis.

Mnemonic

RCCCCELevels of Evidence (Therapeutic)

R	RCT (High Quality) Level I - Randomized, blinded, powered, ITT
C	Cohort (Prospective) Level II - Observational, forward in time
C	Case-Control Level III - Retrospective comparison
C	Case Series Level IV - Descriptive, no control
C	Committee Opinion Level V - Expert consensus
E	Editorial/Expert Lowest evidence level

R	RCT (High Quality) Level I - Randomized, blinded, powered, ITT	C	Case-Control Level III - Retrospective comparison	C	Committee Opinion Level V - Expert consensus
C	Cohort (Prospective) Level II - Observational, forward in time	C	Case Series Level IV - Descriptive, no control	E	Editorial/Expert Lowest evidence level

Hook:Remember Chronic Cases Can Create Excellent evidence - from highest to lowest quality!

Mnemonic

RIIIPGRADE Factors that Downgrade Evidence

R	Risk of Bias Flawed study design, inadequate blinding, high attrition
I	Inconsistency Conflicting results across studies (heterogeneity)
I	Indirectness Study population/intervention differs from question (PICO mismatch)
I	Imprecision Wide confidence intervals, small sample, few events
P	Publication Bias Negative studies not published (funnel plot asymmetry)

R	Risk of Bias Flawed study design, inadequate blinding, high attrition	I	Imprecision Wide confidence intervals, small sample, few events
I	Inconsistency Conflicting results across studies (heterogeneity)	P	Publication Bias Negative studies not published (funnel plot asymmetry)
I	Indirectness Study population/intervention differs from question (PICO mismatch)

Hook:RIIIP evidence apart - five factors that lower your confidence in the evidence!

Overview and Introduction

Understanding Levels of Evidence

Levels of evidence provide a hierarchical framework for evaluating the quality of research studies. This system helps clinicians appraise the strength of evidence supporting clinical decisions.

Key Principles:

Higher evidence levels indicate greater confidence in study findings
Study design alone does not determine evidence level - quality matters
Different question types have different evidence hierarchies
Context determines appropriate evidence level for clinical decisions

Concepts and Methodology Principles

Core Concepts in Evidence Appraisal

The Evidence Pyramid:

Top: Systematic reviews and meta-analyses
High: Randomized controlled trials (RCTs)
Medium: Cohort and case-control studies
Low: Case series, case reports, expert opinion

Why Study Design Matters:

Randomization controls for known and unknown confounders
Blinding prevents performance and detection bias
Control groups allow comparison of intervention effects
Prospective design avoids recall and selection bias

GRADE Framework:

Separates evidence quality (confidence) from recommendation strength
RCTs start as high quality, observational studies as low
Quality can be upgraded or downgraded based on specific criteria

Study Hierarchies for Different Question Types

Therapeutic Questions (Treatment Effectiveness)

Question Format: In [population], does [intervention] compared to [control] improve [outcome]?

Levels of Evidence - Therapeutic

Level	Study Design	Quality Criteria	Example
Level I	High-quality RCT or SR of Level I RCTs	Randomization, allocation concealment, blinding, greater than 80% follow-up, ITT analysis	HEALTH trial: THA vs Hemi for femoral neck fracture
Level II	Lesser-quality RCT, Prospective Cohort, SR of Level II	RCT with methodological flaws OR well-designed cohort	Registry study comparing surgical approaches
Level III	Case-Control, Retrospective Cohort	Observational with comparison, prone to confounding	Case-control of AVN risk factors
Level IV	Case Series	No comparison group, descriptive only	Series of 50 arthroscopic rotator cuff repairs
Level V	Expert Opinion	Lowest level, based on experience	Editorial on surgical technique preferences

For therapeutic questions, randomization is critical because it eliminates confounding and selection bias.

GRADE System

What is GRADE?

GRADE (Grading of Recommendations Assessment, Development and Evaluation) is the most widely used system for assessing evidence quality and recommendation strength.

Two Key Outputs:

Quality of Evidence: High / Moderate / Low / Very Low
Strength of Recommendation: Strong / Weak (for or against)

Assessing Evidence Quality

Start with Study Design, then apply modifiers:

GRADE Evidence Quality Assessment

Starting Point	Downgrade For	Upgrade For	Final Quality
RCT = HIGH	Risk of bias, Inconsistency, Indirectness, Imprecision, Publication bias (each -1 or -2)	Large effect, Dose-response, Residual confounding (each +1)	High / Moderate / Low / Very Low
Observational = LOW	Same downgrade factors as above	Same upgrade factors, often applied to cohort studies	Can upgrade to Moderate or even High with large effect

Example: RCT with high risk of bias (-1) and wide confidence intervals (-1) = Moderate quality evidence.

Example: Cohort study with very large effect (+2) = Moderate quality evidence (upgraded from Low).

Understanding GRADE is essential for guideline development and evidence interpretation.

Distinguishing Study Designs (Differential)

A common exam task is to be handed a study description and asked to name the design, its level, and its dominant bias. Use the structured features below to tell designs apart quickly.

Study Design Differential - Features and Dominant Bias

Design	Direction	Comparison group	Best for	Dominant bias / limitation
Randomised controlled trial	Prospective, allocation by chance	Yes - randomised arms	Therapeutic (treatment effect)	Attrition and lack of blinding can downgrade; may lack external validity
Prospective cohort	Forward in time from exposure	Yes - exposed vs unexposed	Prognosis, harm, natural history	Confounding; loss to follow-up
Retrospective cohort	Backward using existing records	Yes - exposed vs unexposed	Harm with long latency	Confounding and data-quality / measurement bias
Case-control	Backward from outcome to exposure	Yes - cases vs controls	Rare outcomes, multiple exposures	Recall and selection bias; gives odds ratio not risk
Cross-sectional	Single time point	Sometimes	Prevalence, diagnostic accuracy	Cannot establish temporality
Case series / case report	Descriptive, no comparator	No	Hypothesis generation, rare conditions	No control - cannot infer causation; selection bias

Quick Discriminators

Case-control yields an odds ratio and starts from the outcome; cohort yields relative risk and starts from the exposure. If there is no comparison group at all, it is a case series (Level IV) no matter how many patients are included.

Controversies and Areas of Uncertainty

Is the design hierarchy too rigid?

Concato and colleagues (NEJM 2000) showed well-designed observational studies did not systematically overestimate effects versus RCTs. GRADE responded by allowing observational data to be upgraded - but how large an effect justifies upgrading remains a judgement call.

Does a Level I label mean high quality?

Poolman and Bhandari (2006) found Level I and Level II orthopaedic RCTs had similar, often low, reporting-quality scores. The label is a starting point - individual methodological safeguards must still be appraised.

RCT external validity

Strict inclusion criteria, expert centres, and protocolised follow-up can make trial populations unrepresentative. Efficacy in a trial is not always effectiveness in routine practice, which is where pragmatic trials and registries add value.

Surgical RCT feasibility

Blinding surgeons is impossible, sham surgery is ethically fraught, learning curves bias early results, and equipoise is often lacking. This is why much high-quality orthopaedic evidence is necessarily observational.

Clinical Relevance and Applications

Applying Evidence to Patients

Level I evidence is ideal but not always applicable. Consider:

Does patient match RCT inclusion criteria?
Were exclusion criteria too strict?
Do patient values align with outcomes studied?

When Lower Evidence is Acceptable

Situations where Level III-IV may suffice:

Rare diseases (no RCTs feasible)
Urgent clinical need (cannot wait for RCT)
Ethical constraints prevent randomization
Consistent observational data with large effects

Reading Guidelines Critically

Check the evidence grade: Guidelines should cite evidence level for each recommendation. Strong recommendation based on weak evidence? Question the rationale.

Communicating Uncertainty

Be honest with patients: If evidence is Level IV, explain uncertainty. Shared decision-making is crucial when evidence is weak.

Evidence Base

Introducing Levels of Evidence to the Journal (JBJS framework)

Guideline

Wright JG, Swiontkowski MF, Heckman JD • J Bone Joint Surg Am (2003)

Key Findings:

Editorial that formally introduced the levels-of-evidence rating system to JBJS (vol 85-A, p1-3)
Adapted the system to provide separate hierarchies for therapeutic, prognostic, diagnostic, and economic/decision-analysis questions
Defined Level I as high-quality RCT or systematic review of Level I RCTs, descending to Level V (expert opinion)
Adopted as a journal policy requiring an evidence level to accompany each clinical article

Clinical Implication: Established the labelling convention now ubiquitous across orthopaedic journals, letting readers gauge study design at a glance.

Limitation: A simplified design-based hierarchy that, unlike GRADE, does not separately score imprecision, inconsistency, or publication bias.

Verify on PubMed (PMID 12533564)

GRADE: An Emerging Consensus on Rating Quality of Evidence and Strength of Recommendations

Guideline

Guyatt GH, Oxman AD, Vist GE, et al • BMJ (2008)

Key Findings:

Landmark consensus article describing the GRADE approach to rating evidence and recommendations
Separates quality of evidence (High/Moderate/Low/Very Low) from strength of recommendation (Strong/Weak)
RCTs start as high quality and observational studies as low, then move up or down on explicit criteria
Now adopted by the WHO, Cochrane, NICE, and over 100 organisations worldwide

Clinical Implication: Provides the transparent, reproducible framework underpinning most modern orthopaedic and general clinical guidelines.

Limitation: Resource-intensive: requires a systematic review and a trained panel to apply rigorously.

Verify on PubMed (PMID 18436948)

Randomized, Controlled Trials, Observational Studies, and the Hierarchy of Research Designs

Concato J, Shah N, Horwitz RI • N Engl J Med (2000)

Key Findings:

Compared meta-analyses of RCTs with meta-analyses of cohort/case-control studies on the same five clinical topics (99 reports)
Well-designed observational studies did NOT systematically overestimate treatment effects versus RCTs
Point estimates were similar (e.g. BCG vaccine: RCT relative risk 0.49 vs case-control odds ratio 0.50)
The range of estimates was actually wider for the RCTs than the observational studies

Clinical Implication: Challenges a rigid design-only hierarchy and supports GRADE's principle that high-quality observational data can be upgraded.

Limitation: Restricted to topics where both RCT and observational meta-analyses existed; does not negate confounding risk in individual poor-quality observational studies.

Verify on PubMed (PMID 10861325)

Does a Level I Evidence Rating Imply High Quality of Reporting in Orthopaedic RCTs?

Poolman RW, Struijs PA, Krips R, Sierevelt IN, Lutz KH, Bhandari M • BMC Med Res Methodol (2006)

Key Findings:

Assessed 32 RCTs in JBJS-Am (2003-2004, 3543 patients) using the Cochrane reporting-quality tool
Studies labelled Level I and Level II had low and statistically indistinguishable reporting-quality scores
Item-level correlations between evidence level and reporting quality ranged from only 0.0 to 0.2
Concluded a Level I label does NOT guarantee high methodological reporting quality

Clinical Implication: Reinforces the core exam point: the assigned level is a label, not a substitute for appraising individual methodological safeguards.

Limitation: Single-journal sample over a two-year window; reporting quality is a proxy for, not identical to, internal validity.

Verify on PubMed (PMID 16965628)

CONSORT 2010 Statement: Updated Guidelines for Reporting Parallel Group Randomised Trials

Guideline

Schulz KF, Altman DG, Moher D • Ann Intern Med (2010)

Key Findings:

Provides the internationally endorsed 25-item checklist and flow diagram for reporting RCTs
Specifies reporting of randomisation, allocation concealment, blinding, and participant flow
Used by journals worldwide as a condition of publication for randomised trials
Directly maps to the quality criteria distinguishing a true Level I RCT from a downgraded one

Clinical Implication: A CONSORT-compliant report lets the appraiser verify the randomisation, blinding, and attrition that define Level I evidence.

Limitation: Improves transparency of reporting, not the underlying conduct of the trial; a well-reported trial can still be biased.

Verify on PubMed (PMID 20335313)

Evidence Based Medicine: What It Is and What It Isn't

Guideline

Sackett DL, Rosenberg WM, Gray JA, Haynes RB, Richardson WS • BMJ (1996)

Key Findings:

Seminal editorial defining evidence-based medicine as integrating best external evidence with clinical expertise and patient values
Clarified that EBM is not 'cookbook' medicine and does not ignore individual clinical judgement
Emphasised that the best external evidence may come from designs other than RCTs depending on the question
Established the conceptual foundation on which evidence hierarchies and GRADE were later built

Clinical Implication: Frames evidence levels as one input into clinical decisions, not a mechanical rule that overrides patient context.

Limitation: A conceptual editorial; it predates and does not provide the structured grading later supplied by GRADE.

Verify on PubMed (PMID 8555924)

Exam Viva Scenarios

Use these scenarios to practise clinical reasoning and management decisions

CLINICAL SCENARIOStandard

Scenario 1: Interpreting Evidence Levels

CLINICAL PROMPT

"A colleague shows you a case series of 30 patients who underwent a new surgical technique for rotator cuff repair, with 90 percent good outcomes at 2 years. He says this is Level I evidence. How would you respond?"

PRACTICAL APPROACH

I would respectfully disagree that this is Level I evidence. This study is a case series, which is Level IV evidence. Level I evidence for a therapeutic question requires a high-quality randomized controlled trial or systematic review of such trials. This case series lacks several critical features: First, there is no comparison group - we do not know how these patients would have done with standard treatment. Second, there is no randomization, so we cannot account for selection bias - the surgeon may have chosen patients with favorable characteristics. Third, case series cannot establish causality - the good outcomes may be due to patient selection, natural history, or placebo effect rather than the technique itself. While this series is hypothesis-generating and suggests the technique may be promising, we would need an RCT comparing this new technique to standard repair before drawing conclusions about superiority. The 90 percent good outcome also lacks context without a control group - standard repair might also achieve 90 percent good outcomes.

KEY CLINICAL POINTS

Case series is Level IV, not Level I

Level I requires RCT with randomization and control group

Case series has selection bias and no comparison

Cannot establish causality without control group

COMMON PITFALLS

Accepting that case series can be Level I evidence

Not mentioning the critical role of randomization and control groups

Not explaining why 90 percent success is meaningless without comparison

FURTHER QUESTIONS

"What would be required to make this Level I evidence?"

"Can observational studies ever provide high-quality evidence?"

"What is the difference between efficacy and effectiveness?"

CLINICAL SCENARIOChallenging

Scenario 2: GRADE System Application

CLINICAL PROMPT

"You are reviewing a guideline that gives a Strong recommendation for surgical fixation of ankle fractures based on Moderate quality evidence from observational studies. Is this appropriate?"

PRACTICAL APPROACH

Yes, this can be appropriate under GRADE methodology. GRADE separates evidence quality from recommendation strength. Evidence quality reflects our confidence in the effect estimate, while recommendation strength reflects whether we should do the intervention considering benefits, harms, values, and resources. A strong recommendation can be made from moderate quality evidence if several conditions are met: First, there is a large and consistent treatment effect across studies. Second, the benefit-harm balance strongly favors intervention - for example, unstable ankle fractures have high risk of arthritis without surgery, while surgical risks are manageable. Third, patient values are aligned - most patients would choose surgery given the consequences of non-treatment. Fourth, the intervention is feasible and cost-effective. In contrast, we might make a weak recommendation even with high-quality evidence if the benefit-harm balance is close, patient values vary widely, or costs are prohibitive. For ankle fracture fixation, the strong recommendation likely reflects that failure to fix an unstable fracture leads to predictable poor outcomes, even though we lack RCTs comparing operative vs non-operative treatment. Ethically, an RCT would be difficult to justify when the natural history of untreated unstable fractures is so poor.

KEY CLINICAL POINTS

GRADE separates evidence quality from recommendation strength

Strong recommendation requires large effect, favorable benefit-harm balance, aligned values, feasibility

Can make strong recommendation from moderate evidence if effect is large and harms are low

Ankle fracture fixation example shows ethical constraints preventing RCTs

COMMON PITFALLS

Saying strong recommendations require high-quality evidence always

Not explaining the difference between evidence quality and recommendation strength

Not mentioning benefit-harm balance and patient values

FURTHER QUESTIONS

"When would you make a weak recommendation from high-quality evidence?"

"What are the five GRADE factors that downgrade evidence quality?"

"How does patient preference influence recommendation strength?"

CLINICAL SCENARIOChallenging

Scenario 3: Choosing the Right Design and the Limits of the Hierarchy

CLINICAL PROMPT

"An examiner says: 'A registry of 200,000 hip replacements shows one cemented stem has a much higher revision rate than its competitors. A trainee argues this should be ignored because it is only Level II observational data and we have no RCT. How do you respond, and how would you design the ideal study?'"

PRACTICAL APPROACH

I would not ignore the registry signal. Although a registry is observational and nominally lower in the design hierarchy, GRADE explicitly allows observational evidence to be upgraded when there is a very large effect, a dose-response gradient, or when plausible residual confounding would only weaken the observed effect. A registry of 200,000 implants has enormous statistical power, captures real-world performance across many surgeons, and detects rare late failures that a typical RCT would be far too small and too short to find. For implant survival, a randomised trial powered to detect a difference in revision at ten years would need many thousands of patients followed for a decade, which is rarely feasible and may be unethical once a registry signal of harm exists. I would still appraise the registry critically: is the difference adjusted for confounders such as age, sex, diagnosis, fixation, and surgeon volume? Is there confounding by indication? Is the comparison consistent across other national registries such as the NJR, Swedish, and Norwegian registries? Consistent signals across independent registries strongly support causality. The ideal complementary study is not necessarily an RCT but a well-adjusted multi-registry cohort with competing-risks analysis for revision, supplemented by patient-reported outcomes. The teaching point is that the level of evidence is a starting point, not a verdict: design must match the question, and for long-term implant survival a large prospective cohort or registry is often the most valid feasible evidence.

KEY CLINICAL POINTS

Registry data is observational but can be upgraded under GRADE for very large effect

Registries detect rare, late failures that underpowered short RCTs miss

Critically appraise for confounding by indication and check consistency across multiple registries

Match study design to the question: for implant survival a large cohort/registry can be the best feasible evidence

Level of evidence is a starting point, not a final verdict on validity

COMMON PITFALLS

Dismissing high-powered registry data purely because it is observational

Insisting an RCT is always the correct answer regardless of feasibility or ethics

Forgetting confounding by indication when interpreting registry comparisons

Not mentioning competing-risks (mortality) when analysing revision rates

FURTHER QUESTIONS

"What is confounding by indication and how does it threaten registry comparisons?"

"Why is competing-risks analysis important when reporting implant revision rates?"

"Name the three GRADE criteria that can upgrade observational evidence."

MCQ Practice Points

Level I Evidence Question

Q: Which of the following is required for an RCT to be considered Level I evidence? A: All of the following: Adequate randomization and allocation concealment, blinding of participants and assessors, intention-to-treat analysis, less than 20 percent loss to follow-up, and adequate sample size with power calculation. A poorly conducted RCT with high attrition or lack of blinding is downgraded to Level II.

GRADE Downgrade Factors

Q: What are the five factors that downgrade evidence quality in the GRADE system? A: RIIIP: Risk of bias, Inconsistency (heterogeneity across studies), Indirectness (PICO mismatch), Imprecision (wide confidence intervals), and Publication bias. Each factor can downgrade by 1 or 2 levels.

Question Type and Design

Q: What is the best study design for answering a prognostic question about fracture healing? A: Prospective cohort study. For prognostic questions, cohort studies are superior to RCTs because you follow natural history without intervention. RCTs are best for therapeutic questions, not prognosis.

Guidelines, Registries & Global Practice

Evidence-Grading Systems Used Worldwide

Different bodies grade evidence and recommendations differently. Knowing which system a guideline uses is essential to interpret its recommendations correctly across exam jurisdictions (FRCS, FRACS, EBOT/FEBOT, ABOS, DNB/MS, MRCS, SICOT).

Major Evidence and Recommendation Frameworks

System / Body	Region	What it grades	Key feature
GRADE (GRADE Working Group)	Global (WHO, Cochrane, NICE, BOA)	Evidence quality + recommendation strength	Separates confidence in estimate from should-we-act; most widely adopted
Oxford CEBM Levels (2011)	UK / international	Design-based level by question type	Separate tables for treatment, diagnosis, prognosis, screening
JBJS / OrthoEvidence Levels I-V	Orthopaedic journals globally	Study design level (therapeutic/prognostic/diagnostic/economic)	Article-label convention; level shown in abstract
AAOS Clinical Practice Guidelines	USA	Strength of recommendation (Strong/Moderate/Limited/Consensus)	Built on systematic review with explicit appraisal
NICE / SIGN methodology	UK	GRADE-based evidence and recommendation grading	Health-economic modelling integrated into recommendations

Side-by-Side Society Approaches

AAOS (US) publishes CPGs and Appropriate Use Criteria, rating each recommendation Strong, Moderate, Limited, or Consensus based on the quality and consistency of the underlying evidence.
BOA / BOAST (UK) standards are pragmatic, consensus-and-evidence based, and increasingly cite GRADE-rated NICE guidance where available.
AO Foundation education and guidance are largely expert-consensus and principle-based, explicitly acknowledging limited Level I evidence for many fracture-fixation decisions.
EFORT / European national societies generally follow GRADE methodology for formal guidelines while recognising registry data as key observational evidence.

Registry Evidence as High-Quality Observational Data

Large arthroplasty registries are the prime real-world example of observational evidence that can be upgraded under GRADE (very large sample, consistent effects):

AOANJRR (Australia), NJR (England, Wales, NI and IoM), AJRR (US), Swedish (SHAR) and Norwegian registries provide implant-survival and revision-rate data that no RCT could feasibly generate.
Registry signals (for example, early failure of specific implant designs) have changed practice faster than trials could, illustrating when robust observational data legitimately drives strong recommendations.
Limitations remain: confounding by indication, surgeon and patient selection, and outcome restricted largely to revision rather than patient-reported outcomes.

High- vs Limited-Resource Practice Variation

In high-resource settings, guideline-concordant care can rely on RCTs, meta-analyses, and registry feedback loops.
In limited-resource settings, Level I evidence may be unavailable or non-applicable (different implants, case-mix, and follow-up capacity); well-conducted local cohorts and pragmatic adaptation of global guidelines are appropriate.
The principle is constant worldwide: integrate the best available external evidence with clinical expertise and patient values rather than apply a single hierarchy mechanically.

Why This Matters in the Exam

Levels of evidence and GRADE are core research-methodology topics across all major fellowship exams.
Vivas commonly test the ability to assign a level to a described study, identify its dominant bias, and apply the RIIIP downgrade factors.
Examiners expect candidates to translate an evidence level into a defensible treatment recommendation, acknowledging uncertainty when evidence is weak.

Management Algorithm

LEVELS OF EVIDENCE

Clinical summary

Evidence Levels (Therapeutic)

•Level I = High-quality RCT or SR of RCTs
•Level II = Lesser RCT or Prospective Cohort
•Level III = Case-Control or Retrospective Cohort
•Level IV = Case Series (no control)
•Level V = Expert Opinion (lowest)

Question-Specific Best Evidence

•Therapeutic question = RCT gold standard
•Prognostic question = Cohort study best
•Diagnostic question = Cross-sectional with reference standard
•Economic question = Cost-effectiveness analysis
•Hierarchy differs by question type

GRADE System

•GRADE assesses quality (High/Moderate/Low/Very Low) AND strength (Strong/Weak)
•Start with RCT = High quality; Observational = Low quality
•Downgrade for: RIIIP (Risk, Inconsistency, Indirectness, Imprecision, Publication bias)
•Upgrade for: Large effect, Dose-response, Residual confounding
•Strong recommendation can come from moderate evidence if large effect

Level I Criteria (RCT)

•Adequate randomization and allocation concealment
•Blinding of participants and assessors
•Intention-to-treat analysis
•Less than 20% loss to follow-up
•Adequate power (sample size calculation)

Common Pitfalls

•RCT design does NOT automatically equal Level I (must meet quality criteria)
•SR quality depends on included studies (SR of poor RCTs is not Level I)
•Case-control overestimates diagnostic test accuracy (spectrum bias)
•Cannot establish causality from case series (no comparison group)
•Observational studies CAN provide high-quality evidence if very large effect

Level

Study Design

Quality Criteria

Example

Level I

High-quality RCT or SR of Level I RCTs

Randomization, allocation concealment, blinding, greater than 80% follow-up, ITT analysis

HEALTH trial: THA vs Hemi for femoral neck fracture

Level II

Lesser-quality RCT, Prospective Cohort, SR of Level II

RCT with methodological flaws OR well-designed cohort

Registry study comparing surgical approaches

Level III

Case-Control, Retrospective Cohort

Observational with comparison, prone to confounding

Case-control of AVN risk factors

Level IV

Case Series

No comparison group, descriptive only

Series of 50 arthroscopic rotator cuff repairs

Level V

Expert Opinion

Lowest level, based on experience

Editorial on surgical technique preferences

Starting Point

Downgrade For

Upgrade For

Final Quality

RCT = HIGH

Risk of bias, Inconsistency, Indirectness, Imprecision, Publication bias (each -1 or -2)

Large effect, Dose-response, Residual confounding (each +1)

High / Moderate / Low / Very Low

Observational = LOW

Same downgrade factors as above

Same upgrade factors, often applied to cohort studies

Can upgrade to Moderate or even High with large effect

Design

Direction

Comparison group

Best for

Dominant bias / limitation

Randomised controlled trial

Prospective, allocation by chance

Yes - randomised arms

Therapeutic (treatment effect)

Attrition and lack of blinding can downgrade; may lack external validity

Prospective cohort

Forward in time from exposure

Yes - exposed vs unexposed

Prognosis, harm, natural history

Confounding; loss to follow-up

Retrospective cohort

Backward using existing records

Yes - exposed vs unexposed

Harm with long latency

Confounding and data-quality / measurement bias

Case-control

Backward from outcome to exposure

Yes - cases vs controls

Rare outcomes, multiple exposures

Recall and selection bias; gives odds ratio not risk

Cross-sectional

Single time point

Sometimes

Prevalence, diagnostic accuracy

Cannot establish temporality

Case series / case report

Descriptive, no comparator

Hypothesis generation, rare conditions

No control - cannot infer causation; selection bias

Introducing Levels of Evidence to the Journal (JBJS framework)

Guideline

Wright JG, Swiontkowski MF, Heckman JD • J Bone Joint Surg Am (2003)

Key Findings:

Editorial that formally introduced the levels-of-evidence rating system to JBJS (vol 85-A, p1-3)
Adapted the system to provide separate hierarchies for therapeutic, prognostic, diagnostic, and economic/decision-analysis questions
Defined Level I as high-quality RCT or systematic review of Level I RCTs, descending to Level V (expert opinion)
Adopted as a journal policy requiring an evidence level to accompany each clinical article

Clinical Implication: Established the labelling convention now ubiquitous across orthopaedic journals, letting readers gauge study design at a glance.

Limitation: A simplified design-based hierarchy that, unlike GRADE, does not separately score imprecision, inconsistency, or publication bias.

Verify on PubMed (PMID 12533564)

GRADE: An Emerging Consensus on Rating Quality of Evidence and Strength of Recommendations

Guideline

Guyatt GH, Oxman AD, Vist GE, et al • BMJ (2008)

Key Findings:

Landmark consensus article describing the GRADE approach to rating evidence and recommendations
Separates quality of evidence (High/Moderate/Low/Very Low) from strength of recommendation (Strong/Weak)
RCTs start as high quality and observational studies as low, then move up or down on explicit criteria
Now adopted by the WHO, Cochrane, NICE, and over 100 organisations worldwide

Clinical Implication: Provides the transparent, reproducible framework underpinning most modern orthopaedic and general clinical guidelines.

Limitation: Resource-intensive: requires a systematic review and a trained panel to apply rigorously.

Verify on PubMed (PMID 18436948)

Randomized, Controlled Trials, Observational Studies, and the Hierarchy of Research Designs

Concato J, Shah N, Horwitz RI • N Engl J Med (2000)

Key Findings:

Compared meta-analyses of RCTs with meta-analyses of cohort/case-control studies on the same five clinical topics (99 reports)
Well-designed observational studies did NOT systematically overestimate treatment effects versus RCTs
Point estimates were similar (e.g. BCG vaccine: RCT relative risk 0.49 vs case-control odds ratio 0.50)
The range of estimates was actually wider for the RCTs than the observational studies

Clinical Implication: Challenges a rigid design-only hierarchy and supports GRADE's principle that high-quality observational data can be upgraded.

Limitation: Restricted to topics where both RCT and observational meta-analyses existed; does not negate confounding risk in individual poor-quality observational studies.

Verify on PubMed (PMID 10861325)

Does a Level I Evidence Rating Imply High Quality of Reporting in Orthopaedic RCTs?

Poolman RW, Struijs PA, Krips R, Sierevelt IN, Lutz KH, Bhandari M • BMC Med Res Methodol (2006)

Key Findings:

Assessed 32 RCTs in JBJS-Am (2003-2004, 3543 patients) using the Cochrane reporting-quality tool
Studies labelled Level I and Level II had low and statistically indistinguishable reporting-quality scores
Item-level correlations between evidence level and reporting quality ranged from only 0.0 to 0.2
Concluded a Level I label does NOT guarantee high methodological reporting quality

Clinical Implication: Reinforces the core exam point: the assigned level is a label, not a substitute for appraising individual methodological safeguards.

Limitation: Single-journal sample over a two-year window; reporting quality is a proxy for, not identical to, internal validity.

Verify on PubMed (PMID 16965628)

CONSORT 2010 Statement: Updated Guidelines for Reporting Parallel Group Randomised Trials

Guideline

Schulz KF, Altman DG, Moher D • Ann Intern Med (2010)

Key Findings:

Provides the internationally endorsed 25-item checklist and flow diagram for reporting RCTs
Specifies reporting of randomisation, allocation concealment, blinding, and participant flow
Used by journals worldwide as a condition of publication for randomised trials
Directly maps to the quality criteria distinguishing a true Level I RCT from a downgraded one

Clinical Implication: A CONSORT-compliant report lets the appraiser verify the randomisation, blinding, and attrition that define Level I evidence.

Limitation: Improves transparency of reporting, not the underlying conduct of the trial; a well-reported trial can still be biased.

Verify on PubMed (PMID 20335313)

Evidence Based Medicine: What It Is and What It Isn't

Guideline

Sackett DL, Rosenberg WM, Gray JA, Haynes RB, Richardson WS • BMJ (1996)

Key Findings:

Seminal editorial defining evidence-based medicine as integrating best external evidence with clinical expertise and patient values
Clarified that EBM is not 'cookbook' medicine and does not ignore individual clinical judgement
Emphasised that the best external evidence may come from designs other than RCTs depending on the question
Established the conceptual foundation on which evidence hierarchies and GRADE were later built

Clinical Implication: Frames evidence levels as one input into clinical decisions, not a mechanical rule that overrides patient context.

Limitation: A conceptual editorial; it predates and does not provide the structured grading later supplied by GRADE.

Verify on PubMed (PMID 8555924)

System / Body

Region

What it grades

Key feature

GRADE (GRADE Working Group)

Global (WHO, Cochrane, NICE, BOA)

Evidence quality + recommendation strength

Separates confidence in estimate from should-we-act; most widely adopted

Oxford CEBM Levels (2011)

UK / international

Design-based level by question type

Separate tables for treatment, diagnosis, prognosis, screening

JBJS / OrthoEvidence Levels I-V

Orthopaedic journals globally

Study design level (therapeutic/prognostic/diagnostic/economic)

Article-label convention; level shown in abstract

AAOS Clinical Practice Guidelines

USA

Strength of recommendation (Strong/Moderate/Limited/Consensus)

Built on systematic review with explicit appraisal

NICE / SIGN methodology

GRADE-based evidence and recommendation grading

Health-economic modelling integrated into recommendations

Level	Study Design	Quality Criteria	Example
Level I	Prospective Cohort with independent, blinded reference standard	Consecutive patients, index test blinded to reference, reference blinded to index	MRI vs arthroscopy (gold standard) for meniscal tears
Level II	Retrospective Cohort, or cohort with minor flaws	Non-consecutive patients OR lack of blinding	Retrospective chart review of X-ray accuracy
Level III	Case-Control	Cases with disease vs healthy controls (inflates accuracy)	MRI in known ACL tears vs normal knees
Level IV	Case Series or poor reference standard	No independent reference, verification bias	Series of positive tests without verification

Level	Study Design	Quality Criteria	Example
Level I	Prospective Cohort with independent, blinded reference standard	Consecutive patients, index test blinded to reference, reference blinded to index	MRI vs arthroscopy (gold standard) for meniscal tears
Level II	Retrospective Cohort, or cohort with minor flaws	Non-consecutive patients OR lack of blinding	Retrospective chart review of X-ray accuracy
Level III	Case-Control	Cases with disease vs healthy controls (inflates accuracy)	MRI in known ACL tears vs normal knees
Level IV	Case Series or poor reference standard	No independent reference, verification bias	Series of positive tests without verification

Level	Study Design	Quality Criteria	Example
Level I	High-quality Prospective Cohort, SR of Level I studies	Inception cohort, greater than 80% follow-up, uniform outcome assessment	AOANJRR: Long-term implant survival cohort
Level II	Lesser-quality Cohort, Retrospective Cohort, Untreated controls from RCT	Less complete follow-up OR retrospective design	Hospital registry of fracture healing rates
Level III	Case-Control	Retrospective comparison, recall bias	Cases with nonunion vs controls
Level IV	Case Series	No comparison, descriptive outcomes	Series reporting complications after surgery

Levels of Evidence

Levels of Evidence

Evidence Levels for Therapeutic Questions

Critical Must-Knows

Clinical Pearls

Critical Evidence Concepts

Study Design vs Evidence Quality

Question Type Matters

GRADE: Quality vs Strength

Upgrade and Downgrade Factors

At a Glance

RCCCCELevels of Evidence (Therapeutic)

RIIIPGRADE Factors that Downgrade Evidence

Overview and Introduction

Understanding Levels of Evidence

Concepts and Methodology Principles

Core Concepts in Evidence Appraisal

Study Hierarchies for Different Question Types

Therapeutic Questions (Treatment Effectiveness)

Levels of Evidence - Therapeutic

Prognostic Questions (Natural History, Outcomes)

Levels of Evidence - Prognostic

Diagnostic Questions (Test Accuracy)

Levels of Evidence - Diagnostic

GRADE System

What is GRADE?

Assessing Evidence Quality

GRADE Evidence Quality Assessment

Distinguishing Study Designs (Differential)

Study Design Differential - Features and Dominant Bias

Controversies and Areas of Uncertainty

Is the design hierarchy too rigid?

Does a Level I label mean high quality?

RCT external validity

Surgical RCT feasibility

Clinical Relevance and Applications

Applying Evidence to Patients

When Lower Evidence is Acceptable

Reading Guidelines Critically

Communicating Uncertainty

Evidence Base

Introducing Levels of Evidence to the Journal (JBJS framework)

GRADE: An Emerging Consensus on Rating Quality of Evidence and Strength of Recommendations

Randomized, Controlled Trials, Observational Studies, and the Hierarchy of Research Designs

Does a Level I Evidence Rating Imply High Quality of Reporting in Orthopaedic RCTs?

CONSORT 2010 Statement: Updated Guidelines for Reporting Parallel Group Randomised Trials

Evidence Based Medicine: What It Is and What It Isn't

Exam Viva Scenarios

Scenario 1: Interpreting Evidence Levels

Scenario 2: GRADE System Application

Scenario 3: Choosing the Right Design and the Limits of the Hierarchy

MCQ Practice Points

Guidelines, Registries & Global Practice

Evidence-Grading Systems Used Worldwide

Major Evidence and Recommendation Frameworks

Side-by-Side Society Approaches

Registry Evidence as High-Quality Observational Data

High- vs Limited-Resource Practice Variation

Why This Matters in the Exam

Management Algorithm

LEVELS OF EVIDENCE

Evidence Levels (Therapeutic)

Question-Specific Best Evidence

GRADE System

Level I Criteria (RCT)

Common Pitfalls

Levels of Evidence

Levels of Evidence

Evidence Levels for Therapeutic Questions

Critical Must-Knows

Clinical Pearls

Critical Evidence Concepts

Study Design vs Evidence Quality

Question Type Matters

GRADE: Quality vs Strength

Upgrade and Downgrade Factors

At a Glance

RCCCCELevels of Evidence (Therapeutic)

RIIIPGRADE Factors that Downgrade Evidence

Overview and Introduction