Visit to a blind student's school🧑🦯🧑🦯(community medicine)
Assessing Validity and Reliability in Measurement
1. Validity and Reliability
in Assessment
This work is the summarizations
.Of the previous efforts done by great educators
A humble presentation by Dr Tarek Tawfik Amin
2. Measurement experts (and many educators) believe
that every measurement device should possess certain
qualities.
The two most common technical concepts in
measurement are reliability and validity.
3. Reliability Definition (Consistency)
The degree of consistency between two measures of
the same thing. (Me hre ns and Le hman, 1 9 8 7 (
The measure of how stable, dependable,
trustworthy, and consistent a test is in measuring the
same thing each time (Wo rthe n e t al. , 1 9 9 3)
4. (Validity definition (Accuracy
Truthfulness: Does the test measure what it purports
to measure? the extent to which certain inferences
can be made from test scores or other measurement.
(M hre ns and Le hman, 1 9 8 7 )
e
The degree to which they accomplish the purpose
for which they are being used. (Wo rthe n e t al. , 1 9 9 3(
5.
The term “ validity” refers to the degree to which the
conclusions (interpretations) derived from the
results of any assessment are “ well-grounded or
justifiable; being at once relevant and meaningful.”
(M ssick S. 1 9 9 5)
e
Content” : related to objectives and their sampling.
“ Construct” : referring to the theory underlying the target.
“ Criterion” : related to concrete criteria in the real world. It can be concurrent or
predictive.
“ Concurrent” : correlating high with another measure already validated.
“ Predictive” : Capable of anticipating some later measure.
“ Face” : related to the test overall appearance.
The usual concepts of validity.
8. All assessments in medical education require
evidence of validity to be interpreted meaningfully.
In contemporary usage, all validity is construct
validity, which requires multiple sources of evidence;
construct validity is the whole of validity, but has
multiple facets. (Do wning S 20 0 3)
9. ( Construct (Concepts, ideas and notions
- Nearly all assessments in medical education, deal with constructs:
intangible collections of abstract concepts and principles which
are inferred from behavior and explained by educational or
psychological theory.
- Educational achievement is a construct, inferred from performance
on assessments; written tests over domain of knowledge, oral
examinations over specific problems or cases in medicine, or
OSCE, history-taking or communication skills.
- Educational ability or aptitude is another example of construct –
a construct that may be even more intangible and abstract than
achievement. (Do wning 20 0 3)
10. Sources of validity in assessment
Content: do instrument items completely represent the
construct?
Response process: the relationship between the intended
construct and the thought processes of subjects or observers
Internal structure: acceptable reliability and factor structure
Relations to other variables: correlation with scores from
another instrument assessing the same construct
Consequences: do scores really make a difference?
Downing 2003, Cook S 2007
11. Sources of validity in assessment
Content
Response process
- Examination
blueprint
- Student format
familiarity
- Representativeness
of test blueprint to
achievement
- Quality control of
electronic
domain
- Test specification
- Key validation of
preliminary scores
- Match of item
content to test
specifications
- Accuracy in
combining different
formats scores
- Representativeness
of items to domain
- Quality
control/accuracy of
final
scores/marks/grades
- Logical/empirical
relationship of content
tested domain
- Quality of test
questions
- Item writer
qualifications
- Sensitivity review
scanning/scoring
- Subscore/subscale
analyses:
1-Accuracy of
applying pass-fail
decision rules to
scores
2-Quality control of
score reporting
Internal structure
• Item analysis data:
1. Item
difficulty/discriminati
on
2. Item/test
characteristic curves
3. Inter-item
correlations
4. Item-total
correlations (PBS)
• Score scale
reliability
• Standard errors of
measurement (SEM)
• Generalizability
• Item factor analysis
• Differential Item
Functioning (DIF)
Relationship to
other variables
Consequences
• Correlation with
other relevant
variables (exams)
• Impact of test
scores/results
• Convergent
correlations -
• Consequences on
learners/future
internal/external:
- Similar tests
• Divergent
correlations:
internal/external
on students/society
learning
• Reasonableness of
method of establishing
pass-fail (cut) score
- Dissimilar measures
• Pass-fail
consequences:
• Test-criterion
correlations
1. P/F Decision
reliability-accuracy
• Generalizability of
evidence
2. Conditional
standard error of
measurement
• False +ve/-ve
12. Sources of validity
Internal Structure-1
Statistical e vide nce o f the hypo the size d re latio nship
:be twe e n te st ite m sco re s and the co nstruct
(:Reliability (internal consistency - 1
Test scale reliability
Rater reliability
Generalizability
Item analysis data- 2
Item difficulty and discrimination
MCQ option function analysis
Inter-item correlations
Scale factor structure -3
Dimensionality studies- 4
Differential item functioning (DIF studies- 5
)
13. Sources of validity
Relationship to other variables-2
Statistical e vide nce o f the hypo the size d re latio nship be twe e n
te st sco re s and the co nstruct
�
Criterion-related validity studies
�Correlations between test scores/
subscores
and other measures
�Convergent-Divergent studies
14. Keys of reliability assessment
“ Stability” : related to time consistency.
“ Internal” : related to the instruments.
“ Inter-rater” : related to the examiners’ criterion.
“ Intra-rater” : related to the examiner’ s criterion.
Validity and reliability are closely related.
A test cannot be considered valid unless the measurements resulting from it are
reliable. Likewise, results from a test can be reliable and not necessarily valid.
15. Keys of reliability assessment
Validity and reliability are closely related.
A test cannot be considered valid unless the measurements resulting from it are
reliable. Likewise, results from a test can be reliable and not necessarily valid.
16. Sources of reliability in assessment
Source of
reliability
Internal
consistency
Description
M
easures
Definitions
Comments
- Rarely used because
the “effective”
instrument is only half
as long as the actual
instrument; SpearmanBrown† formula can
adjust
- Do all the items on an
instrument measure the same
construct? (If an instrument
measures more than one
construct, a single score will
not measure either construct
very well.
Split-half reliability
- Correlation between
scores on the first and
second halves of a
given instrument
- We would expect high
correlation between item
scores measuring a single
construct.
Kuder-Richardson 20
-Assumes all items are
- Similar concept to
split-half, but accounts equivalent, measure a
for all items
single construct, and
have dichotomous
responses
- Internal consistency is
probably the most commonly
reported reliability statistic, in
part because it can be
calculated after a single
administration of a single
instrument.
- Because instrument halves
can be considered “alternate
forms,” internal consistency
can be viewed as an estimate
of parallel forms reliability.
Cronbach’ s alpha
- A generalized form
of the
Kuder-Richardson
formulas
- Assumes all items
are equivalent and
measure a single
construct; can be used
with dichotomous or
continuous data
17. Sources of reliability in assessment
Source of
reliability
Temporal
stability
Description
M
easures
Definitions
Comments
Does the instrument produce
similar results when
administered a second
Test-retest reliability
Administer the instrument
to the same person at
different times
Usually quantified
using correlation
(eg, Pearson’ s r)
Administer different
versions of the instrument
to the same individual at
the same or
Usually quantified
using correlation
(eg, Pearson’ s r)
time?
Parallel forms
Do different versions of the
“same” instrument produce
similar results?
Alternate forms
reliability
different times
Agreement
(inter-rater
When using raters, does it
matter who does the rating?
Percent agreement
reliability)
Is one rater’ s score similar to
another’ s?
Kappa
Phi
Kendall’s tau
Intraclass correlation
coefficient
%identical responses
Simple correlation
Agreement corrected for
chance
Agreement on ranked data
ANOVA to estimate how
well ratings from different
raters coincide
Does not account
for agreement that
would occur by
chance
Does not account
for chance
18. Sources of reliability in assessment
Source of
reliability
Generalizability
theory
Description
Measures
How much of the error in
Generalizability
measurement is the result
coefficient
of each factor (eg, item,
item grouping, subject,
rater, day of
administration) involved in
the measurement process?
Definitions
Complex model that
allows estimation of
multiple sources of
error
Comments
As the name implies,
this elegant method is
“generalizable” to
virtually any setting in
which reliability is
assessed;
For example, it can
determine the relative
contribution of
internal consistency
and
inter-rater reliability to
the overall reliability
of a given instrument
. I ms” are the individual q ue stio ns o n the instrume nt*“
te
.The “ co nstruct” is what is be ing me asure d, such as kno wle dg e , attitude , skill, o r sympto m in a spe cific are a
The Spe arman B wn “ pro phe cy” fo rmula allo ws o ne to calculate the re liability o f an instrume nt’ s sco re s
ro
(.whe n the numbe r o f ite ms is incre ase d (o r de cre ase d
(Co o k and B ckman Validity and Re liability o f Psycho me tric I
e
nstrume nts (20 0 7
20. Keys of reliability assessment
Different types of assessments require different kinds
of reliability
Written MCQs
Scale reliability
Oral Exams
Internal consistency
Rater reliability
Generalizability Theory
Written—Essay
Observational Assessments
Inter-rater agreement
Rater reliability
Generalizability Theory
Inter-rater agreement
Generalizability Theory
Performance Exams (OSCEs)
Rater reliability
Generalizability Theory
21. Keys of reliability assessment
?R
eliability – H high
ow
Very high-stakes: > 0.90 + (L
icensure
(tests
( M
oderate stakes: at least ~0.75 (OSCE
( L stakes: >0.60 (Quiz
ow
22. Keys of reliability assessment
?How to increase reliability
For Written tests
objectively scored formats
Use
least 35-40 MCQs
At
MCQs that differentiate high-low students
For performance exams
least 7-12 cases
At
Well trained SPs
Monitoring, QC
Observational Exams
(
Lots of independent raters (7-11
Standard checklists/rating scales
Timely ratings
23. Conclusion
Validity = Meaning
Evidence to aid interpretation of assessment data
Higher the test stakes, more evidence needed
Multiple sources or methods
Ongoing research studies
Reliability
Consistency of the measurement
One aspect of validity evidence
Higher reliability always better than lower
24. References
National Board of Medical Examiners. United States Medical Licensing
Exam Bulletin. Produced by Federation of State Medical Boards of
the United States and the National Board of Medical Examiners.
Available at: http://www.usmle.org/bulletin/2005/testing.htm.
Norcini JJ, Blank LL, Duffy FD, Fortna GS. The mini-CEX: a method
for assessing clinical skills. Ann Intern Med. 2003;138:476-481.
Litzelman DK, Stratos GA, Marriott DJ, Skeff KM. Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers. Acad Med. 1998;73:688-695.
Merriam-Webster Online. Available at: http://www.m-w.com/.
Sackett DL, Richardson WS, Rosenberg W, Haynes RB. Evidence-
Based Medicine: How to Practice and Teach EBM. Edinburgh: Churchill Livingstone; 1998.
Wallach J. Interpretation of Diagnostic Tests. 7th ed. Philadelphia:
Lippincott Williams & Wilkins; 2000.
Beckman TJ, Ghosh AK, Cook DA, Erwin PJ, Mandrekar JN. How reliable are assessments of clinical
teaching? A review of the published instruments. J Gen Intern Med. 2004;19:971-977.
Shanafelt TD, Bradley KA, Wipf JE, Back AL. Burnout and selfreported
patient care in an internal medicine residency program. Ann Intern Med. 2002;136:358-367.
Alexander GC, Casalino LP, Meltzer DO. Patient-physician communication about out-of-pocket costs.
JAMA. 2003;290:953-958.
25. Reference
s
- Pittet D, Simon A, Hugonnet S, Pessoa-Silva CL, Sauvan V, Perneger TV. Hand hygiene among
physicians: performance, beliefs, and perceptions. Ann Intern Med. 2004;141:1-8.
- Messick S. Validity. In: Linn RL, editor. Educational Measurement, 3rd Ed. New York: American
Council on Education and Macmillan; 1989.
- Foster SL, Cone JD. Validity issues in clinical assessment. Psychol Assess. 1995;7:248-260.
American Educational Research Association, American Psychological Association, National Council
on Measurement in Education. Standards for Educational and Psychological Testing. Washington,
DC:
American Educational Research Association; 1999.
- Bland JM, Altman DG. Statistics notes: validating scales and indexes. BMJ. 2002;324:606-607.
- Downing SM. Validity: on the meaningful interpretation of assessment
data. Med Educ. 2003;37:830-837. 2005 Certification Examination in Internal Medicine Information
Booklet. Produced by American Board of Internal Medicine. Available
at: http://www.abim.org/resources/publications/IMRegistrationBook. pdf.
- Kane MT. An argument-based approach to validity. Psychol Bull. 1992;112:527-535.
- Messick S. Validation of inferences from persons’ responses and performances as scientific
inquiry into score meaning. Am Psychol. 1995;50:741-749.
- Kane MT. Current concerns in validity theory. J Educ Meas. 2001; 38:319-342. American
Psychological Association. Standards for Educational and Psychological Tests and Manuals.
Washington, DC: American Psychological Association; 1966.
- Downing SM, Haladyna TM. Validity threats: overcoming interference in the proposed
interpretations of assessment data. Med Educ. 2004;38:327-333.
- Haynes SN, Richard DC, Kubany ES. Content validity in psychological assessment: a functional
approach to concepts and methods. Psychol Assess. 1995;7:238-247.
- Feldt LS, Brennan RL. Reliability. In: Linn RL, editor. Educational Measurement, 3rd Ed. New
York: American Council on Education and Macmillan; 1989.
- Downing SM. Reliability: on the reproducibility of assessment data. Med Educ. 2004;38:10061012.
Clark LA, Watson D. Constructing validity: basic issues in objective scale development. Psychol
26. Resources
For an excellent resource on item analysis:
For a more extensive list of item-writing tips:
http://testing.byu.edu/info/handbooks/Multiple-Choice
%20Item%20Writing%20Guidelines%20-%20Haladyna%20and
%20Downing.pdf
http://homes.chass.utoronto.ca/~murdockj/teaching/MCQ_basi
c_tips.pdf
For a discussion about writing higher-level multiple choice items:
http://www.ascilite.org.au/conferences/perth04/procs/pdf/wo
odford.pdf
http://www.utexas.edu/academic/ctl/assessment/iar/students/
report/itemanalysis.php