Impact of rating demands on rater-based assessments of clinical competence.
Tavares W, Eva KW
Many assessment practices used in primary care rely upon judgements provided by individuals observing trainees or colleagues. Despite there being many reasons to view these observations as cognitively complex, the extent to which fallibility in judgement reflects mental workload has not been examined experimentally. The objective of this study was to evaluate the impact of increasing rating demands on rater-based assessments of clinical competence.
Participants were randomly assigned to one of four conditions (in a 2×2 factorial design) and asked to rate three pre-recorded unscripted clinical encounters illustrating three levels of performance (high, medium, low). We looked at the effect on participants of having a larger (seven) or smaller (two) number of dimensions to rate, and/or distracting them with extraneous tasks (attending to patient status and the activity of additional individuals observable on video). Outcome measures included number of dimension-relevant behaviours identified, ability to differentiate between levels of performance, and inter-rater reliability.
Using the two dimensions common to both groups, ANOVA revealed a significant effect of the number of dimensions included in the scale on the number of relevant behaviours identified: participants in the 2D group identified more features than those in the 7D group. Both groups were able to differentiate between levels of performance, but post hoc analyses revealed significance on all pairwise comparisons in the 2D group and not in the 7D group. Inter-rater reliability increased from 0.45 in the 7D group to 0.70 when participants were required to consider only two dimensions. By contrast, the distractions had little effect.
The results of this study provide preliminary evidence that requiring raters to consider a greater number of dimensions can decrease (a) the number of dimension-relevant behaviours identified, (b) the capacity to differentiate between levels of performance, and (c) inter-rater reliability.