Selecting and Simplifying: Rater Performance and Behavior When Considering Multiple Competencies

Tavares W, Ginsburg S, Eva KW



Assessment of clinical competence is a complex cognitive task with many mental demands often imposed on raters unintentionally. We were interested in whether this burden might contribute to well-described limitations in assessment judgments. In this study we examine the effect on indicators of rating quality of asking raters to (a) consider multiple competencies and (b) attend to multiple issues. In addition, we explored the cognitive strategies raters engage when asked to consider multiple competencies simultaneously.


We hypothesized that indications of rating quality (e.g., interrater reliability) would decline as the number of dimensions raters are expected to consider increases.


Experienced faculty examiners rated prerecorded clinical performances within a 2 (number of dimensions) × 2 (presence of distracting task) × 3 (number of videos) factorial design. Half of the participants were asked to rate 7 dimensions of performance (7D), and half were asked to rate only 2 (2D). The second factor involved the requirement (or lack thereof) to rate the performance of actors participating in the simulation. We calculated the interrater reliability of the scores assigned and counted the number of relevant behaviors participants identified as informing their ratings. Second, we analyzed data from semistructured posttask interviews to explore the rater strategies associated with rating under conditions designed to broaden raters' focus.


Generalizability analyses revealed that the 2D group achieved higher interrater reliability relative to the 7D group (G = .56 and .42, respectively, when the average of 10 raters is calculated). The requirement to complete an additional rating task did not have an effect. Using the 2 dimensions common to both groups, an analysis of variance revealed that participants who were asked to rate only 2 dimensions identified more behaviors of relevance to the focal dimensions than those asked to rate 7 dimensions: procedural skill = 36.2%, 95% confidence interval (CI) [32.5, 40.0] versus 23.5%, 95% CI [20.8, 26.3], respectively; history gathering = 38.6%, 95% CI [33.5, 42.9] versus 24.0%, 95% CI [21.1, 26.9], respectively; ps < .05. During posttask interviews, raters identified many sources of cognitive load and idiosyncratic cognitive strategies used to reduce cognitive load during the rating task.


As intrinsic rating demands increase, indicators of rating quality decline. The strategies that raters engage when asked to rate many dimensions simultaneously are varied and appear to yield idiosyncratic efforts to reduce cognitive effort, which may affect the degree to which raters make judgments based on comparable information.