Reliability and Validity (Kappa, ICC, Bland-Altman)

Fellowship-level guide to reliability and validity in orthopaedic research: the distinction (precision vs accuracy), intra- and inter-observer reliability, Cohen's/Fleiss' kappa for categorical agreement (with the Landis-Koch interpretation), the intraclass correlation coefficient for continuous data, the Bland-Altman plot for method comparison, why correlation is not agreement, and the validity types - applied to fracture classifications and outcome measures.

Jump to

High-yield overview

Kappa, ICC and Bland-Altman

ReliabilityPrecision / reproducibility

ValidityAccuracy (measures the truth)

Kappa / ICCCategorical / continuous agreement

Bland-AltmanMethod comparison (not correlation)

Agreement statistics by data type

Cohen's kappa

PatternAgreement for CATEGORICAL/nominal data between TWO raters, correcting for chance (Fleiss' kappa for more than 2 raters; weighted kappa for ORDINAL).

Treatment

Intraclass correlation coefficient (ICC)

PatternAgreement/reliability for CONTINUOUS data (e.g. angle measurements); 0-1 (above 0.75 good, above 0.9 excellent).

Treatment

Bland-Altman plot

PatternCompares two CONTINUOUS measurement methods - plots difference vs mean, showing BIAS (mean difference) and LIMITS OF AGREEMENT (+/-1.96 SD).

Treatment

Sensitivity/specificity, etc.

PatternCriterion VALIDITY against a gold standard (see our Diagnostic Test Statistics topic).

Treatment

Critical Must-Knows

RELIABILITY is reproducibility or PRECISION - how consistently a measurement gives the same result; VALIDITY is ACCURACY - whether it measures what it is supposed to. They are independent: a measurement can be RELIABLE WITHOUT being VALID (consistently wrong, a systematic bias), so reliability is necessary but NOT sufficient for validity (the dartboard analogy - tight grouping vs hitting the bullseye).
Reliability has forms: INTRA-OBSERVER (the same rater repeating a measurement), INTER-OBSERVER (different raters), and TEST-RETEST (the same instrument over time); orthopaedic classification systems and outcome measures must be tested for these.
For CATEGORICAL data (e.g. a fracture classification), agreement is measured by COHEN'S KAPPA between two raters - which corrects for the agreement expected by CHANCE - with FLEISS' kappa for more than two raters and WEIGHTED kappa for ORDINAL categories; the conventional LANDIS-KOCH interpretation is: below 0.20 slight, 0.21-0.40 fair, 0.41-0.60 moderate, 0.61-0.80 substantial, and 0.81-1.00 almost perfect agreement.
For CONTINUOUS data (e.g. an angle or a length measurement), reliability is measured by the INTRACLASS CORRELATION COEFFICIENT (ICC), which runs from 0 to 1 (broadly above 0.75 is good and above 0.9 excellent), and method comparison is best shown with a BLAND-ALTMAN PLOT, which plots the DIFFERENCE between two methods against their MEAN to reveal systematic BIAS and the LIMITS OF AGREEMENT (mean difference +/- 1.96 standard deviations).
A key trap is that CORRELATION (e.g. Pearson's r) is NOT agreement: two methods can correlate almost perfectly yet disagree systematically (one always reads higher), so r should NOT be used to claim two methods/raters agree - use the ICC or a Bland-Altman plot instead.
Applied to orthopaedics: classification systems are valuable as a SHARED LANGUAGE but have LIMITED stand-alone reliability - increasing the number of categories/subcategories consistently REDUCES kappa/ICC, while a brief rater CALIBRATION session improves agreement; report agreement statistics with confidence intervals and choose the statistic that matches the data type.

Clinical Pearls

“
Reliability = precision (reproducible); Validity = accuracy (true). Reliable can be invalid (systematic bias) - dartboard analogy.
“
Categorical agreement = kappa (Cohen's 2 raters, Fleiss' for more than 2, weighted for ordinal); Landis-Koch: 0.21-0.40 fair, 0.41-0.60 moderate, 0.61-0.80 substantial, 0.81-1.0 almost perfect.
“
Continuous: ICC for reliability; Bland-Altman (difference vs mean; bias + limits of agreement) for method comparison. Correlation is NOT agreement. More categories -> lower kappa.

Reliable Does Not Mean Valid

Reliability (precision)

Consistent, reproducible results. A reliable but invalid measure is consistently wrong (tight cluster, off the bullseye - systematic bias).

Validity (accuracy)

Measures the true value. Reliability is necessary but not sufficient for validity - you need both to hit the bullseye.

Reliability vs Validity

Precision and Accuracy Are Different

Reliability (reproducibility, precision) asks whether repeated measurements agree with each other; validity (accuracy) asks whether the measurement reflects the true value. The classic dartboard analogy makes the relationship clear: tight grouping = reliable, hitting the centre = valid. A measurement can be reliable but not valid - tightly clustered but systematically off-target (a bias) - so reliability is necessary but not sufficient for validity. Reliability comes in forms - intra-observer (same rater repeated), inter-observer (different raters) and test-retest (over time) - all of which matter when validating a classification system or an outcome measure.

Four-target dartboard analogy of reliability and validity: reliable and valid; reliable but not valid; valid but not reliable; neither — The dartboard analogy of reliability vs validity. Reliable AND valid: shots tight and centred. Reliable but NOT valid: tight but off-centre (a systematic bias). Valid but NOT reliable: scattered but centred on average. Neither: scattered and off-centre. Reliability is necessary but not sufficient for validity.Credit: OrthoVellum (AI-generated schematic)

Agreement Statistics

Choose the Statistic for the Data Type

Categorical data (e.g. a fracture classification): Cohen's KAPPA for two raters (corrects for chance agreement), Fleiss' kappa for more than two raters, and weighted kappa for ordinal categories. Interpret with Landis-Koch: below 0.20 slight, 0.21-0.40 fair, 0.41-0.60 moderate, 0.61-0.80 substantial, 0.81-1.00 almost perfect.
Continuous data (e.g. angle/length measurements): the INTRACLASS CORRELATION COEFFICIENT (ICC) (0-1; broadly above 0.75 good, above 0.9 excellent) for reliability.
Method comparison: the BLAND-ALTMAN plot - plot the difference between two methods against their mean, revealing systematic bias (the mean difference) and the limits of agreement (mean difference plus or minus 1.96 standard deviations).
Do NOT use correlation (Pearson's r) to claim agreement - two methods can correlate strongly yet disagree systematically; use ICC or Bland-Altman.

Which agreement statistic?
Data type / question	Statistic	Notes

Validity Types & Classification Reliability

Validity Types and the Reliability of Classifications

Validity has several forms: face (does it look reasonable), content (does it cover the construct), construct (does it behave as theory predicts), and criterion validity - concurrent (agrees with a gold standard now) and predictive (predicts a future outcome); sensitivity/specificity against a gold standard are criterion validity (see our Diagnostic Test Statistics topic). In orthopaedics, classification systems (e.g. fracture classifications) are tested for inter- and intra-observer reliability using kappa/ ICC; the evidence shows their stand-alone reliability is often only fair-to-moderate, that increasing granularity (more categories) lowers kappa/ICC, and that rater calibration improves agreement - which is why classifications are best used as a shared language and research scaffold alongside specific radiographic thresholds and patient factors, rather than as the sole basis for decisions.

Evidence & Key Studies

Evidence

Distal radius fracture classifications in real life: reliability and how they change treatment

III

Nguyen SA, Dang AH, Tran DQ • Journal of Hand Surgery Global Online (2025)

Key Findings:

Interobserver agreement for distal radius fracture classifications was typically fair-to-moderate on radiographs, with only modest improvement on CT.
Increasing granularity (more categories/subcategories) consistently REDUCED kappa/ICC, whereas a brief rater calibration session improved agreement.
Classifications remain valuable as a shared language but have limited stand-alone reliability and prognostic power, best combined with instability thresholds and patient factors.

Verify on PubMed (PMID 41542021)

Evidence

Zero echo time MRI vs CT in intra-articular distal radius fractures: inter/intraobserver agreement

Kaymakoglu M, Kolac UC, Bahsi A, et al. • Injury (2025)

Key Findings:

Inter- and intraobserver agreement were quantified with Cohen's and Fleiss' kappa and intraclass correlation coefficients.
Classification agreement was 'good' (kappa about 0.68-0.78), with surgeons agreeing more than radiologists; continuous measures showed good ICC for angulation (about 0.76-0.86) but lower for inclination.
Illustrates the practical use and interpretation of kappa (categorical) and ICC (continuous) for reliability.

Verify on PubMed (PMID 41187521)

Evidence Attribution

According to PubMed, the fair-to-moderate reliability of fracture classifications, the reduction of kappa/ICC with greater granularity and the benefit of rater calibration come from the cited Nguyen review, and the worked use of Cohen's/Fleiss' kappa and ICC (with interpretive values) from the cited Kaymakoglu study. The reliability-versus-validity distinction, the Landis-Koch kappa grades, the ICC and the Bland-Altman method- comparison approach are standard, well-established statistical teaching. (See also our Diagnostic Test Statistics, Measures of Effect and Study Design topics.)

Clinical Decision Scenarios

Practise clinical reasoning and management decisions out loud

Viva scenarioStandard

Clinical prompt

“What is the difference between reliability and validity, and which statistics would you use to assess the reliability of a fracture classification and of an angle measurement?”

Practical approach

Reliability is reproducibility or precision - how consistently a measurement gives the same result on repetition - whereas validity is accuracy, whether the measurement reflects the true value. They are independent: a measurement can be reliable without being valid if it is consistently wrong, that is systematically biased, which the dartboard analogy captures - tight grouping is reliable, hitting the bullseye is valid, and you need both. Reliability is therefore necessary but not sufficient for validity. To assess the reliability of a fracture classification, which is categorical data, I would use Cohen's kappa for agreement between two raters, which corrects for the agreement expected by chance, with Fleiss' kappa if there are more than two raters and weighted kappa if the categories are ordinal; I would interpret the kappa using the Landis-Koch grades, where 0.21 to 0.40 is fair, 0.41 to 0.60 moderate, 0.61 to 0.80 substantial, and above 0.81 almost perfect, and I would test both inter-observer and intra-observer reliability. For a continuous measurement such as an angle, I would use the intraclass correlation coefficient, where above 0.75 is good and above 0.9 excellent, and to compare two measurement methods I would use a Bland-Altman plot, which plots the difference between the methods against their mean to show systematic bias and the limits of agreement, rather than a correlation coefficient, because correlation is not agreement.

Key clinical points

Reliability = precision/reproducibility; validity = accuracy; reliable can be invalid (bias) - dartboard analogy

Categorical (classification): Cohen's kappa (Fleiss for more than 2, weighted for ordinal); Landis-Koch grades

Continuous (angle): ICC (above 0.75 good, above 0.9 excellent)

Method comparison: Bland-Altman (bias + limits of agreement), NOT correlation

Common pitfalls

Using correlation (Pearson r) to claim two methods/raters agree

Forgetting kappa corrects for chance agreement

Confusing reliability with validity

Further questions

“What does kappa correct for? - Agreement expected by chance”

“Why not use correlation for method comparison? - Methods can correlate yet disagree systematically; use Bland-Altman/ICC”

Viva scenarioStandard

Clinical prompt

“Why do more detailed (granular) classification systems tend to be less reliable, and how can agreement be improved?”

Practical approach

More detailed classification systems tend to be less reliable because adding categories and subcategories increases the number of choices a rater must discriminate between, and many of those distinctions are subtle and subjective, so raters disagree more often and the chance-corrected agreement, the kappa or intraclass correlation coefficient, falls. The evidence on fracture classifications shows this consistently: the more granular the system, the lower the kappa, and agreement is only fair to moderate for many radiograph-based classifications, with computed tomography giving only modest improvement. Agreement can be improved in several ways: simplifying the classification or grouping it into fewer, more clinically meaningful categories; providing clear definitions and decision rules; using better imaging where it helps; and, importantly, a brief rater calibration or training session before assessment, which has been shown to improve agreement. In practice this is why classifications are best treated as a shared language and a research scaffold rather than the sole basis for treatment, and why decisions should also rest on specific measurable thresholds, such as articular step-off or angulation, and on patient factors. When reporting reliability I would quote kappa or the intraclass correlation coefficient with confidence intervals and state whether it is inter- or intra-observer.

Key clinical points

More categories/subcategories -> more subtle distinctions -> raters disagree -> lower kappa/ICC

Many fracture classifications only fair-moderate; CT modest improvement

Improve agreement: simplify/group categories, clear definitions, better imaging, rater calibration/training

Use classifications as shared language + objective thresholds + patient factors; report kappa/ICC with CIs

Common pitfalls

Assuming more detail = better classification

Using a classification as the sole basis for treatment

Reporting kappa without stating inter- vs intra-observer or confidence intervals

Further questions

“What single intervention improves rater agreement? - A calibration/training session”

“Effect of granularity on kappa? - More categories lowers kappa/ICC”

Mnemonics & Memory Aids

Mnemonic

PRECISE vs TRUE

Precision = Reliability (reproducible)

Reliable can still be wrong (systematic bias)

True value = Validity (accuracy)

Both needed to hit the bullseye (dartboard)

Hook:Reliability = PRECISE; Validity = TRUE; you need both.

Mnemonic

KIB

Kappa - categorical agreement (Landis-Koch grades; corrects for chance)

ICC - continuous reliability (0-1; above 0.75 good)

Bland-Altman - method comparison (bias + limits of agreement; not correlation)

Hook:KIB: Kappa (categorical), ICC (continuous), Bland-Altman (method comparison).

Exam day cheat sheet

Reliability & Validity - Exam Day Cheat Sheet

Concepts

Reliability = precision/reproducibility; validity = accuracy
Reliable can be invalid (systematic bias) - dartboard analogy
Reliability forms: intra-observer, inter-observer, test-retest

Categorical agreement

Cohen's kappa (2 raters), Fleiss' kappa (more than 2), weighted kappa (ordinal)
Corrects for chance agreement
Landis-Koch: 0.21-0.40 fair, 0.41-0.60 moderate, 0.61-0.80 substantial, 0.81-1.0 almost perfect

Continuous agreement

ICC for reliability (0-1; above 0.75 good, above 0.9 excellent)
Bland-Altman for method comparison (difference vs mean; bias + limits of agreement)
Correlation (Pearson r) is NOT agreement

Validity & classifications

Validity: face, content, construct, criterion (concurrent/predictive)
More categories -> lower kappa; rater calibration improves agreement
Classifications = shared language; report kappa/ICC with CIs

Reliability and Validity (Kappa, ICC, Bland-Altman)

Jump to

High-yield overview

Kappa, ICC and Bland-Altman

ReliabilityPrecision / reproducibility

ValidityAccuracy (measures the truth)

Kappa / ICCCategorical / continuous agreement

Bland-AltmanMethod comparison (not correlation)

Agreement statistics by data type

Cohen's kappa

PatternAgreement for CATEGORICAL/nominal data between TWO raters, correcting for chance (Fleiss' kappa for more than 2 raters; weighted kappa for ORDINAL).

Treatment

Intraclass correlation coefficient (ICC)

PatternAgreement/reliability for CONTINUOUS data (e.g. angle measurements); 0-1 (above 0.75 good, above 0.9 excellent).

Treatment

Bland-Altman plot

PatternCompares two CONTINUOUS measurement methods - plots difference vs mean, showing BIAS (mean difference) and LIMITS OF AGREEMENT (+/-1.96 SD).

Treatment

Sensitivity/specificity, etc.

PatternCriterion VALIDITY against a gold standard (see our Diagnostic Test Statistics topic).

Treatment

Critical Must-Knows

RELIABILITY is reproducibility or PRECISION - how consistently a measurement gives the same result; VALIDITY is ACCURACY - whether it measures what it is supposed to. They are independent: a measurement can be RELIABLE WITHOUT being VALID (consistently wrong, a systematic bias), so reliability is necessary but NOT sufficient for validity (the dartboard analogy - tight grouping vs hitting the bullseye).
Reliability has forms: INTRA-OBSERVER (the same rater repeating a measurement), INTER-OBSERVER (different raters), and TEST-RETEST (the same instrument over time); orthopaedic classification systems and outcome measures must be tested for these.
For CATEGORICAL data (e.g. a fracture classification), agreement is measured by COHEN'S KAPPA between two raters - which corrects for the agreement expected by CHANCE - with FLEISS' kappa for more than two raters and WEIGHTED kappa for ORDINAL categories; the conventional LANDIS-KOCH interpretation is: below 0.20 slight, 0.21-0.40 fair, 0.41-0.60 moderate, 0.61-0.80 substantial, and 0.81-1.00 almost perfect agreement.
For CONTINUOUS data (e.g. an angle or a length measurement), reliability is measured by the INTRACLASS CORRELATION COEFFICIENT (ICC), which runs from 0 to 1 (broadly above 0.75 is good and above 0.9 excellent), and method comparison is best shown with a BLAND-ALTMAN PLOT, which plots the DIFFERENCE between two methods against their MEAN to reveal systematic BIAS and the LIMITS OF AGREEMENT (mean difference +/- 1.96 standard deviations).
A key trap is that CORRELATION (e.g. Pearson's r) is NOT agreement: two methods can correlate almost perfectly yet disagree systematically (one always reads higher), so r should NOT be used to claim two methods/raters agree - use the ICC or a Bland-Altman plot instead.
Applied to orthopaedics: classification systems are valuable as a SHARED LANGUAGE but have LIMITED stand-alone reliability - increasing the number of categories/subcategories consistently REDUCES kappa/ICC, while a brief rater CALIBRATION session improves agreement; report agreement statistics with confidence intervals and choose the statistic that matches the data type.

Clinical Pearls

“
Reliability = precision (reproducible); Validity = accuracy (true). Reliable can be invalid (systematic bias) - dartboard analogy.
“
Categorical agreement = kappa (Cohen's 2 raters, Fleiss' for more than 2, weighted for ordinal); Landis-Koch: 0.21-0.40 fair, 0.41-0.60 moderate, 0.61-0.80 substantial, 0.81-1.0 almost perfect.
“
Continuous: ICC for reliability; Bland-Altman (difference vs mean; bias + limits of agreement) for method comparison. Correlation is NOT agreement. More categories -> lower kappa.

Reliable Does Not Mean Valid

Reliability (precision)

Consistent, reproducible results. A reliable but invalid measure is consistently wrong (tight cluster, off the bullseye - systematic bias).

Validity (accuracy)

Measures the true value. Reliability is necessary but not sufficient for validity - you need both to hit the bullseye.

Reliability vs Validity

Precision and Accuracy Are Different

Agreement Statistics

Choose the Statistic for the Data Type

Categorical data (e.g. a fracture classification): Cohen's KAPPA for two raters (corrects for chance agreement), Fleiss' kappa for more than two raters, and weighted kappa for ordinal categories. Interpret with Landis-Koch: below 0.20 slight, 0.21-0.40 fair, 0.41-0.60 moderate, 0.61-0.80 substantial, 0.81-1.00 almost perfect.
Continuous data (e.g. angle/length measurements): the INTRACLASS CORRELATION COEFFICIENT (ICC) (0-1; broadly above 0.75 good, above 0.9 excellent) for reliability.
Method comparison: the BLAND-ALTMAN plot - plot the difference between two methods against their mean, revealing systematic bias (the mean difference) and the limits of agreement (mean difference plus or minus 1.96 standard deviations).
Do NOT use correlation (Pearson's r) to claim agreement - two methods can correlate strongly yet disagree systematically; use ICC or Bland-Altman.

Which agreement statistic?
Data type / question	Statistic	Notes

Validity Types & Classification Reliability

Validity Types and the Reliability of Classifications

Evidence & Key Studies

Evidence

Distal radius fracture classifications in real life: reliability and how they change treatment

III

Nguyen SA, Dang AH, Tran DQ • Journal of Hand Surgery Global Online (2025)

Key Findings:

Interobserver agreement for distal radius fracture classifications was typically fair-to-moderate on radiographs, with only modest improvement on CT.
Increasing granularity (more categories/subcategories) consistently REDUCED kappa/ICC, whereas a brief rater calibration session improved agreement.
Classifications remain valuable as a shared language but have limited stand-alone reliability and prognostic power, best combined with instability thresholds and patient factors.

Verify on PubMed (PMID 41542021)

Evidence

Zero echo time MRI vs CT in intra-articular distal radius fractures: inter/intraobserver agreement

Kaymakoglu M, Kolac UC, Bahsi A, et al. • Injury (2025)

Key Findings:

Inter- and intraobserver agreement were quantified with Cohen's and Fleiss' kappa and intraclass correlation coefficients.
Classification agreement was 'good' (kappa about 0.68-0.78), with surgeons agreeing more than radiologists; continuous measures showed good ICC for angulation (about 0.76-0.86) but lower for inclination.
Illustrates the practical use and interpretation of kappa (categorical) and ICC (continuous) for reliability.

Verify on PubMed (PMID 41187521)

Evidence Attribution

Clinical Decision Scenarios

Practise clinical reasoning and management decisions out loud

Viva scenarioStandard

Clinical prompt

“What is the difference between reliability and validity, and which statistics would you use to assess the reliability of a fracture classification and of an angle measurement?”

Practical approach

Key clinical points

Reliability = precision/reproducibility; validity = accuracy; reliable can be invalid (bias) - dartboard analogy

Categorical (classification): Cohen's kappa (Fleiss for more than 2, weighted for ordinal); Landis-Koch grades

Continuous (angle): ICC (above 0.75 good, above 0.9 excellent)

Method comparison: Bland-Altman (bias + limits of agreement), NOT correlation

Common pitfalls

Using correlation (Pearson r) to claim two methods/raters agree

Forgetting kappa corrects for chance agreement

Confusing reliability with validity

Further questions

“What does kappa correct for? - Agreement expected by chance”

“Why not use correlation for method comparison? - Methods can correlate yet disagree systematically; use Bland-Altman/ICC”

Viva scenarioStandard

Clinical prompt

“Why do more detailed (granular) classification systems tend to be less reliable, and how can agreement be improved?”

Practical approach

Key clinical points

More categories/subcategories -> more subtle distinctions -> raters disagree -> lower kappa/ICC

Many fracture classifications only fair-moderate; CT modest improvement

Improve agreement: simplify/group categories, clear definitions, better imaging, rater calibration/training

Use classifications as shared language + objective thresholds + patient factors; report kappa/ICC with CIs

Common pitfalls

Assuming more detail = better classification

Using a classification as the sole basis for treatment

Reporting kappa without stating inter- vs intra-observer or confidence intervals

Further questions

“What single intervention improves rater agreement? - A calibration/training session”

“Effect of granularity on kappa? - More categories lowers kappa/ICC”

Mnemonics & Memory Aids

Mnemonic

PRECISE vs TRUE

Precision = Reliability (reproducible)

Reliable can still be wrong (systematic bias)

True value = Validity (accuracy)

Both needed to hit the bullseye (dartboard)

Hook:Reliability = PRECISE; Validity = TRUE; you need both.

Mnemonic

KIB

Kappa - categorical agreement (Landis-Koch grades; corrects for chance)

ICC - continuous reliability (0-1; above 0.75 good)

Bland-Altman - method comparison (bias + limits of agreement; not correlation)

Hook:KIB: Kappa (categorical), ICC (continuous), Bland-Altman (method comparison).

Exam day cheat sheet

Reliability & Validity - Exam Day Cheat Sheet

Concepts

Reliability = precision/reproducibility; validity = accuracy
Reliable can be invalid (systematic bias) - dartboard analogy
Reliability forms: intra-observer, inter-observer, test-retest

Categorical agreement

Cohen's kappa (2 raters), Fleiss' kappa (more than 2), weighted kappa (ordinal)
Corrects for chance agreement
Landis-Koch: 0.21-0.40 fair, 0.41-0.60 moderate, 0.61-0.80 substantial, 0.81-1.0 almost perfect

Continuous agreement

ICC for reliability (0-1; above 0.75 good, above 0.9 excellent)
Bland-Altman for method comparison (difference vs mean; bias + limits of agreement)
Correlation (Pearson r) is NOT agreement

Validity & classifications

Validity: face, content, construct, criterion (concurrent/predictive)
More categories -> lower kappa; rater calibration improves agreement
Classifications = shared language; report kappa/ICC with CIs