|
Home | Mission | Working Relationships | Position Papers | Publications | Clerkship Administrator Resources |
|||||||||||||||||||||||||||||||||
Guidebook for Clerkship Directors
|
|||||||||||||||||||||||||||||||||
|
Guidebook for Clerkship Directors | Annual Meetings | Member Organization Links | LCME ED-2 | Contact Us |
|||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||
Printed version of Guidebook is available. |
|
| Sections on this page: | |
|
|
William C. McGaghie, PhD
The term evaluation frequently has a negative connotation, especially for medical learners engaged in a program of study. Medical students and residents rarely view their evaluations as opportunities for improvement even though better performance and public accountability are the principal aims of medical education and the evaluation of its outcomes. Instead, evaluations are seen by learners as hurdles grounded in threat. Evaluations are barriers that channel learner thinking and behavior, frequently motivated by fear of failure, with adverse consequences for those who fall short. Such learner perceptions contrast with faculty intent where evaluation is considered a tool needed to boost student competence and protect the public. Nonetheless, learners perceive the stakes to be high and so is their anxiety. Evaluation is a process to which most medical learners grudgingly submit. It is rarely a process they seek and enjoy.
But evaluation in medical education has an upside, especially as learners and teachers acknowledge the goal is to produce superb clinicians. When educational evaluation data are seen and used as a tool, not as a weapon, the outlook becomes improvement and mastery rather than enforcement. This outlook also changes the psychological climate toward constructive progress instead of apprehension. An illustration is when internal medicine residents express enthusiasm about the acquisition and mastery demonstration of ACLS skills in an educational program featuring deliberate practice and rigorous outcome evaluation.1,2
This section provides an overview about evaluation in medical education and sets a point of departure for 14 sections that follow. The section has four parts that lay a foundation for subsequent chapter writings: (a) purposes of learner evaluation, (b) evaluation goals, (c) matching evaluation goals and tools, (d) evaluation and learner motivation. Much of this contextual writing amplifies work published elsewhere nearly two decades ago.3 There are many similarities with the earlier work although the material has been updated to capture new developments.
There are at least eight purposes for learner evaluation in medical education. Each of these purposes is addressed in many ways throughout the remaining chapter sections. They are all important but for different reasons. So except for the first, all of the other purposes for learner evaluation should be assigned equal weight.
A program of undergraduate or postgraduate medical education simply cannot operate, or stay in operation, without being accredited. In the U.S., undergraduate medical accreditation is managed by the Liaison Committee on Medical Education (LCME), jointly sponsored by the American Medical Association (AMA) and the Association of American Medical Colleges (AAMC). U.S. graduate medical education is accredited by the Accreditation Council on Graduate Medical Education (ACGME). (See also Chapter 15: The Clerkship Director and the Accreditation Process) Each of the medical accreditation agencies imposes detailed requirements for learner evaluation that medical education programs must fulfill just to stay in business. Cyclic accreditation reviews assure that once met, a medical education program’s learner evaluation criteria and standards do not erode.
Assessment of medical student competence is a basic responsibility for all programs of clinical medical education. Such assessments represent accomplishment benchmarks, tangible signs of medical student progress along the educational continuum. They depend, of course, on a priori statements of cognitive, procedural, or affective learning goals; high performance standards; and measurement methods that yield reliable data about student achievement. Clerkship directors realize that sound competence assessments provide focused feedback to students and feedback about educational program effectiveness for faculty and administration (see Chapter 6, Section 3). Competence assessment is a cornerstone of quality medical clerkship education.
Competence assessments external to a clerkship are also imposed in the form of board examinations. With very few exceptions, clerkship medical students have successfully passed USMLE Step 1 and are beginning preparation for Step 2, especially its clinical skills component (2 CS) involving standardized patients (SPs). Students need to be aware that the best way to prepare for these high stakes competence assessments is active engagement with the clinical curriculum.4
Most clerkship directors struggle to exercise control over the type or variety of cases seen by medical students in the clinic or hospital. Patients arrive for clinic visits or are admitted to an inpatient service due to concerns about their health, not because the patients want to advance medical education. Individual cases, and the health problems they represent, often present on an uncontrolled, seemingly random basis. Unless patients having different problems are selectively distributed among clinical learners in a controlled way, clinical medical education can be an uneven experience.
Documenting and managing student exposure to a variety of clinical problems is difficult to fulfill. Hand held computers and wireless data entry and manipulation may simplify the task. The growing use of standardized patients (see Chapter 6, Section 8) and other forms of medical simulation (See Chapter 6, Section 7) can complement contact with real patients. This increases the odds that the clinical curriculum can be uniform.
Similar to competence assessment at important clinical milestones, educational evaluations are also used to gauge and monitor student academic progress more frequently. Medical students are expected to advance through the clinical curriculum on a “critical path” achieving successive program goals both within individual clerkships and across the clerkship year. Wide deviations from that path are a source of concern for clerkship directors. Similar to monitoring infant development using the Denver II development chart, medical student academic progress should be gauged frequently to insure it is within normal limits.
Today’s educational evaluations are often used to forecast performance on future assessments. The success of educational forecasts usually stems from the similarity of the skills being assessed, congruence of measurement methods, and the time span between the measurements (shorter is better). The conventional wisdom that “the best way to predict future behavior is to rely on one’s current and past overt behavior” is correct.5 Rigorous evaluations that produce reliable data give teachers and medical learners a snapshot of each student’s performance status and a window to future student performance.
A common complaint among medical students is that they rarely receive concrete information about “how they are doing” clinically or educationally. Medical learners are usually eager to discuss their experiences and are anxious to discuss ways in which they can boost their fund of knowledge or improve their clinical skill. Performance feedback is a term that is widely used to describe information that gives learners knowledge of the results of their study and clinical work. Given specific feedback about their progress or deficits, medical students can either move to new areas of clinical practice or take steps to improve marginal performance (see Chapter 6, Section 15).
An educational program needs to have three basic features before useful feedback can be given to learners. First, the program needs to have clear goals that represent a graduated set of milestones for medical students. Second, the program needs to have a means to collect, store, and routinely retrieve data that learners and their teachers can use for educational feedback. Third, the program needs faculty who are willing to take time to candidly review the evaluative data with students, tied to clerkship goals. Effective feedback about educational progress cannot occur unless a plan is in place that identifies goals to be accomplished, routinely collects data about student progress, and provides frequent opportunities for trainees and faculty to discuss clinical learning.
Clerkships operate inside a clinical department, within an undergraduate medical curriculum, usually wrapped in a university environment. Clerkships are one of many threads in a broad academic fabric. Academic tradition holds that variation in student achievement is acknowledged by the assignment of low and high grades. One of the toughest everyday responsibilities that clerkship directors face is translating data about student performance into medical school grades (see Chapter 6, Section 13). This is a medical school and university requirement, a practical reality that comes with the clerkship director’s job, which cannot be avoided.
Grades assign value to medical student work. Grades can be given in a normative way (“on the curve”) to compare students against their peers or in ways that compare all students against a fixed achievement standard. The bottom line is that assigning grades to medical students as a sign of their achievement is part of every clerkship director’s job. Fair and impartial grade assignment is a necessary condition of clerkship education.
Learner evaluation data including board examination scores, results of OSCEs and SP-based clinical exams, conative measures, and tests using medical simulations can be employed in a variety of ways to judge the effectiveness of a medical education program. The clerkship works to the degree that medical students meet or exceed a priori expectations about their acquisition of the knowledge, skill, and affective outcomes stated in the program plan. Achievement of clerkship goals is documented by medical student performance data. A clerkship is successful if a high proportion of its medical students measure up to expectations based on tough but fair assessments of their learning. [See Chapter 7: Evaluation of the Clerkship: Clinical Teachers and Program]
Quality Improvement (QI) is another outcome when medical student performance data are used to judge clerkship effectiveness. The clerkship matures and prospers as student performance data accumulate, are studied, and used for program improvement. Medical student performance data not only tell a story about individual learners but also about the quality of clerkships and curricula that shape student learning.
Medical student evaluation has at least five goals to amplify the eight purposes already stated. The five goals are evaluation of: (a) professional knowledge, (b) technical and procedural skills, (c) professionalism, (d) professional relationships, and (e) physician-patient relationships. Each evaluation goal is addressed by different measures of medical student achievement.
Evaluation of professional knowledge has been the mainstay of medical competence evaluation since the formation of the National Board of Medical Examiners in 1915.6 Today, medical knowledge assessment is done via internal (e.g., course, clerkship) and external (e.g., USMLE Step 1 and 2) evaluations that rely mainly on multiple-choice questions (see Chapter 6: Section 10 [Clerkship Examinations], Section 11 [Writing MCQ's] and Section 12 [Setting Standards in Clerkship Examinations]). These evaluations, by intent and format, propel the idea that the acquisition and maintenance of a broad and deep fund of knowledge is essential for medical practice. The primacy of these tests asserts that knowledge acquisition is a basic goal of medical education.
Technical and procedural skill
Assessment of medical student technical and procedural proficiency has grown in frequency and sophistication over the past decade. Measures are now available that permit objective evaluation of such skills as cardiac auscultation,7 ACLS maneuvers,1,2 and the female pelvic examination.8,9 Most of these measures rely on simulation technology embodied in SPs or medical simulators that vary in human fidelity.10 These technologies are covered elsewhere in this chapter (see Chapter 6: Section 6 [Procedural skills], Section 7 [Simulators], and Section 8 [Standardized patients]).
Professionalism is expressed in each young physician’s character, reliability, honesty, ability to keep confidences, and other nonacademic qualities that embody “the good doctor.” Professionalism is more than maturity and less than sainthood; it connotes promises of expertise and duty. In medical circles professionalism is usually conspicuous by its absence and taken for granted when present. Measurement and evaluation of medical professionalism has recently been expressed as a key outcome of medical education witnessed by the Medical School Objectives Project of the AAMC11 and the subsequent Outcomes Project of the ACGME.12
Teaching and evaluating student professionalism has become one of the highest priorities of U.S. medical schools. Teaching is done via customary methods including reading, case discussions, study of professional codes of conduct, and especially by faculty example in clinical settings. Assessing professionalism is difficult to do with precision.13 However, such assessments are essential because, “Unprofessional behavior in medical school is associated with subsequent disciplinary action by a state medical board.”14 Chapter 6, Section 9 [Evaluating Professionalism] presents a detailed discussion about evaluating professionalism.
A fourth evaluation goal is professional relationships. This goal goes beyond personal integrity to embrace respect for other members of the health care team, administrative staff, and other colleagues. Professional relationships are addressed infrequently in undergraduate medical education while its profile is rising in graduate and continuing medical education (GME, CME). Received clinical wisdom in addition to recent writing about patient safety15,16 teach that clinical patient care is rarely a solitary activity. Instead, nearly all patient care is now delivered by teams of individual clinicians having different credentials and skills. The emerging educational goal is [how to] turn a team of experts into an expert team.
Professional relationships, individual and team skill acquisition, team member interchangeability, and team cognition are several of the many variables involved in the preparation of expert teams. Another key variable in team effectiveness is dissolution of traditional professional hierarchies that have existed in clinical medical settings. A growing literature on team training and new professional relationships in medical practice is now beginning to affect medical curricula.17,18
The doctor-patient relationship has been a hallmark of effective clinical practice from antiquity through Osler and Halsted to the present day. Fostering these interpersonal skills and sentiments has been a key feature of medical education and sound clinical practice though always threatened by lapses in honesty by either doctor or patient. More recent threats to doctor-patient relationships include time pressures due to the managed care environment, social class differences, ethnic differences, and many others. Holmboe (Chapter 6, Section 4) addresses direct observation of physician-patient relationships in depth.
A persistent problem in evaluation and grading of students on medical clerkships is matching evaluation goals with the right evaluation tools. Many different tools are available ranging from long aptitude tests such as the MCAT to simulations, OSCEs, and short bedside encounters. Some evaluative tools such as board certification examinations like USMLE Steps 1 to 3 are highly quantitative and objective whereas others such as letters of recommendation are qualitative, subjective. Each type of measure has a place in medical learner evaluation. However, the decision to use one of the tools should be based on a clear understanding of one’s evaluative purpose and context.
Table 6.1.1. describes 16 common evaluation methods in medical education. The table also contains a short comment about the advantages of each method and a statement about potential problems associated with using each procedure. At least one citation is given for each method to encourage further reading by those who seek more detailed information.
No single evaluation method is valid for all purposes. Academic physicians need to think hard about their reason for wanting to assess a student’s knowledge, procedural skill, self-confidence, dependability, honesty, or any other clinically relevant characteristic. Only after identifying the purpose of the evaluation (e.g., educational diagnosis, technical proficiency, overall performance on a rotation) should the clerkship director select a measurement tool that will produce meaningful data to inform the needed decision.
Seasoned medical educators know that examinations shape and drive student behavior. Today’s medical students live from test to test, usually viewing each evaluation experience as a sentence rather than an opportunity. For medical students, the evaluations they encounter are an operational definition of the curriculum because no matter what is presented, read, practiced, or discussed passing tests defines life in medical school. This issue was raised 44 years ago in 1961 by George Miller in his famous book, Teaching and Learning in Medical School.48 Not cited by Miller, the identical point was made about British medical education in the 19th century including much faculty grousing about “test driven” students.49
Recent research confirms that even small changes in the emphasis or format of evaluation procedures prompt revisions in the way that students prepare for and approach examinations. This holds for learners in general50 and medical students in particular.51
The origins of this behavior are not hard to detect as discussed by Good.52 She astutely describes the widespread and high-level culture of “evaluation apprehension” in the medical profession. Left unchecked this apprehension can have bad effects like needless competition; reduced student cooperation; defensiveness; attempts at one-upmanship; and reliance on expensive, extracurricular commercial test preparation courses that have no tangible benefits.4 The challenge to medical educators is to craft and use evaluation and grading methods that truly are tools for student improvement not weapons that intimidate. The following sections of this chapter provide blueprints to fulfill that goal.
This section provides a lexicon for key issues in evaluation and assessment through a series of definitions and distinctions. The purpose is to provide clerkship directors with a quick reference to key terms that guide the practical decisions to be made in clerkships. Since terms are sometimes used differently in different contexts, and by different authors, etymologies are provided to root meaning in the embryology of the term. (Etymologies are based principally on The Compact Edition of the Oxford English Dictionary, Oxford University, Press, 1971.)
The following definitions and distinctions are included:
Evaluation, rooted in “value” and derived from the Latin valeo, (to be strong), indicates a judgment of how well a student strengths correspond with the “values” of the concerned communities, including the department, school, and the profession. Grading implies assignment of a label to the level of performance achieved, and derives from the Latin word gradus, or step. Grading within a medical schools is, effectively, an administrative action classifying the level of performance achieved. While evaluation implies a description in words of how a student is performing, grade implies a concise label that can be expressed with letters, labels or even numbers (A, B, C, D, etc.; Honors, High Pass, Pass, Low Pass, Fail, Incomplete, Withdrawal; 96%, 76%) of the level achieved. Assessment is sometimes used to embrace the entire process of evaluation and grading. It comes from a Latin term meaning to set a tax. (The term assessor would mean someone who “sat at” a judge’s bench). However, it is can also be used to refer to the process of measuring something (a radio-immuno-“assay”), or of acquiring direct observations about a learner (“sitting next to” the student). The term assessment, then, combines something of the quantitative and qualitative aspects of gathering data for evaluation. While there is some flexibility, perhaps even disagreement, on which terms are used for which part of the process, it can be useful to construct a sequence in which, together, the terms establish a rhythm (assessment-evaluation-grading), and constitute three-phase process that corresponds to the familiar rhythm of clinical medicine that, in turn, reflects the classical sequence of observation-reflection-action. In this sequence, grading and administrative action, and feedback would be an educational action. (Cf. Table 6.2.1)
Educational process |
Aristotle |
Clinical process |
(S.O.A.P.) |
Assessment = making observations about learners |
Observation |
History and Physical |
(S.O.) |
Evaluation = determining learner’s |
Reflection |
Diagnosis |
(A.) |
|
|
|
|
Grading/Feedback = taking an action
|
Action |
Therapy |
(P.) |
Practically, decisions about who is asked to evaluate a student, and who gets to “grade”, have to be decided in each setting, and teachers’ responses often depend on how they see the consequences of their role in this process. Grades are often submitted to the registrar’s office as terse summative letters (A, B, C, etc.) or steps (Honors, High, Pass, etc); and, these reductions of performance into a single letter can be seen by teachers and students as categorical judgments on the student as a person. Hence, the grading framework used dictates a choice of terms that can affect what teachers are willing to contribute to grading.53
Formative evaluation is done to “form” or shape the subsequent performance of a learner, specifically by generating and providing feedback. It is done during an experience, and can be done by teachers as frequently as time will allow, but it should also be done formally at specified times, for instance, halfway through an experience. Summative evaluation is done at the end of a unit of time, typically at the end of the clerkship, and “sums” up the student’s performance. Whereas formative evaluation is done primarily for the sake of the student, summative evaluation fulfills our responsibility to society, pronouncing the student ready for the next level of training. Summative evaluation often includes a grade as well as narrative description of performance and recommendations for improvement. A grade without comment would provide only minimal guidance to a student and would not help the student improve subsequent performance. Therefore, it is recommended that a grade (label) always be accompanied by and evaluation (description in words).
This distinction is meant to capture the difference between the curriculum that students experience (process) and their achievements (product, outcomes). The concept is often described as the process-product paradigm.54 Process measurements could include documentation that students have actually completed clerkship tasks (number of patients seen, number of procedures done), while product measurements include typical, end-of-clerkship assessments (e.g., NBME subject exams). Often, our research tries to document the relationships between what we "do" to students, and how they are changed by the experience. Since research shows that much of what individual students actually achieve depends as much on their personal characteristics as much as on the formal curriculum, it is useful to document to their "baseline" status, that is, what they bring to the clerkship, by having pre-clerkship measurements such as pre-clerkship GPA, or USMLE step 1 scores.55
Dichotomous grading (etymologically from Greek, “cuts into two”) divides a group of students into those who pass and those who fail. Polytomous (“cutting into many parts”) or Scalar (scala = steps in Italian) grading recognizes a broader spectrum of student performance by providing for a series of steps for assigning grades, such as Honors, High Pass, Pass, Low Pass, Fail, or the equivalent letter grades, A, B, C, D, F. Continuous grading would refer to a series of numbers which have small intervals, such as 88%, 87%, 86%, etc. Generally speaking, dichotomous grading fulfills our responsibility to society by determining whether a learner is competent or not. Scalar and continuous grading helps faculty and students compare performances among students, and may also help graduate program directors rank their applicants. For quantitative assessments (such as multiple choice examinations or OSCEs) the conversion from an exam score to a final grade can be straightforward, even if the cut points are arbitrary. However, students and teachers have had an ongoing concern about the lack of clarity in how descriptive assessments from teachers are converted into a step-wise grading system (such as Honors, High Pass, etc.). One simple method of addressing this problem is to categorize teachers’ observations about a student's performance into a step-wise, such as, second-year level, third-year level, fourth-year level, internship level; or, reporter, interpreter, manager/educator.20, 53 (see Chapter 6, Section 13 [Converting Evaluation into Grades])
Normative grading is “relative” and it assigns grades to students’ performance by comparing them with another group, the “norm”, such as a contemporary peer group. This comparison group could be a national reference such as all students taking a certifying examination, or a local group of students taking a clerkship at the same time of year. Normative grading can be done in a mathematical way, generating a “curve”, with grade rankings based on distance above or below the mean score. Normative grading is often done less formally, with half students in the middle (for example, a grade of High Pass), a quarter receiving Pass, and a quarter receiving Honors. In any case, the essence of normative grading is to compare students to each other.
What is often called criterion-based grading sets mastery standards for each grading level (pass, high pass, etc) and is more “absolute”, less relative, than norm-referenced methods. Basically, “criterion-based” grading is really fixed-standard grading, in which experts first decide what the tested domain will be (the criterion, the “what”) and then what will be expected standards of proficiency (fixed standards, the “how much?”). This approach depends upon a prior judgment of what has “content validity” (see below). For example, in the domain of manual skill in suturing, the fixed-standard is the degree of proficiency that must be achieved – adequacy of wound closure, the number of sutures used, and the time taken to close the wound. The examiner then decides whether the standard has been met, and how well (crites means a “judge” in Greek).
Choice of a criterion-based or fixed-standard system is one of the most difficult choices made in a clerkship, and has powerful consequences upon grading decisions. In a fixed or absolute standard system, a group of three students working with a single teacher could all receive grades of Pass or all grades of Honors, depending on the criteria they met. In a normative system, they are competing against each other.
Another consequence of a fixed-standard grading system is that it would typically yield more grades at the upper end of the grading spectrum at the end of an academic year, when students would typically perform better; whereas, a normative grading system would try to assign the same number of Honors grades at the start as at the end of the year. This highlights the difference between evaluation and grading. At the start of the year, performance as a strong “interpreter” might lead to a grade of Honors, but at the end of the year only to a grade of High Pass.
In practice, most clerkship directors agree that the dichotomous pass-fail decision should be based on criteria, rather than an arbitrary failing of a certain percentage of students in each clerkship for each year. It is the distinction between Honors, High Pass, Pass, etc. that is more problematic. Each institution, or perhaps each clerkship, has to decide which is fairer to patients and society (ranking students based on mastery of certain criteria) or fairer to students (assuring equal distribution of grades, irrespective of the time of year a student takes the clerkship.)
A compensatory grading system averages aspects of a student’s performance using various parameters to yield a final grade. For instance, a high score on a multiple-choice final examination plus a failing clinical evaluation might calculate to a grade of Pass. A non-compensatory (“weakest link”) approach would conclude that the student is not better than his/her lowest level of competence in a core area of evaluation. For instance, an excellent examination score would not compensate for poor professionalism, or vice versa. Therefore, a student with unacceptable performance in any domain of evaluation could not receive a passing final grade. Generally speaking, clerkships must determine which aspects of performance are so important that deficiencies in any cannot be compensated for by proficiency in others.
Descriptive methods of evaluation describe a student’s performance using words. Quantitative methods try to measure performance and yield a numerical score. Most summative grades are a combination of the two methods with some consistency in weighting descriptive methods more than quantitative ones. A survey of internal medicine clerkship directors reported that, average, 25% of the clerkship grade was derived from the NBME subject examination,56 this figure was 33% for surgical clerkships,57 and 31% for Psychiatry Clerkships.58
There is a tendency to refer to quantified examinations as “objective” and narrative evaluations as “subjective”. However, these terms can be misleading. In comparison to descriptive evaluations, a multiple-choice examination is dispassionate (not caring, for instance, about how confidently a student speaks), has a single “grader” (the scoring device) and its precision and reliability are more easily calculated. However, we should not confuse objectivity with reliability; and “objectification”59 may be a better term for MCQs or OSCEs. In any case, objectivity (or objectification) does not mean that in assessment itself has validity. Each step in creating a multiple choice question, decisions about what to test and wording of the item, involves judgments that reflect the opinions of teachers.25
Unspoken assumptions in the process of converting teachers’ evaluations into grades often lead students to regard teachers’ evaluations as subjective and arbitrary. Many students protest a lower-than-desired grade by arguing that a high score on a multiple choice test is “objective” (and therefore, valid) and that the narrative evaluation describing unprofessional behavior is “subjective” (and therefore not valid). Yet, descriptive methods can achieve a level of reliability (see below) and validity that is sufficient for high stakes decisions.21, 60 Both assessment methods have a role in determining summative grades and one is not inherently more valuable than the other, so the terms “subjective” and “objective” – which undervalue the former - should be avoided if possible.
Traditional evaluation theory “analyzes”, or “breaks up” a student’s performance (to analyze in Greek is to “loosen up” or “take apart”) into several components, knowledge, skills and attitudes (or, attitudes, skills, and knowledge, “ASK”). Each component can be assessed by tools appropriate for each domain. For instance, multiple choice tests might be used to assess knowledge, and standardized patients can assess history-taking skills.
A “synthetic” approach “puts things together”, and asks how the student’s abilities in several domains come together to achieve a level of proficiency. The RIME Scheme20 introduces a vocabulary for synthetic evaluation of students’ clinical skills. This describes development in clinical skills from “Reporter” to “Interpreter” to “Manager/Educator” (RIME) in which each task requires all three facets of the analytic model. For instance, a reliable “reporter” must combine skill in physical examination technique with the knowledge of what to look for in the patient at hand, and also with respect for the patient’s privacy; the ability to honestly and accurately communicate findings must be combined with a sense of duty to fulfill responsibilities each day.
The rhythm of RIME corresponds to the same sequence as observation-reflection-action and S.O.A.P.. While there is a developmental aspect to this, it does not imply that all students go sequentially through stages of development. Rather, the RIME scheme is intended as a "razor" defining a level of performance below which the learner should not fall.
Recently, there has been initiative to apply the ACGME approach of the "six competencies" to medical students. Three of the "competencies" fit the analytic model: professionalism, interpersonal skill, knowledge) and three are synthetic: patient care, system based practice, and practice-based learning and improvement.
Analytic and synthetic approaches are complimentary. For instance, the RIME synthetic vocabulary offers an initial assessment framework for organizing observations about a learner’s development toward independence. A teacher who recognizes that a student is an effective reporter, but not yet an interpreter, should switch to an analytic approach in order to determine what will help the student take the “next step”. For example, if there is a problem moving from reporter to interpreter, does the student need to acquire more knowledge, to practice the skill of differential diagnosis, or to become more confident? Analytic and synthetic approaches reinforce each other.
The ACGME approach is intended to reach a dichotomous decision about competence at the point when a resident leaves training, and moves into unsupervised practice; therefore, it minimizes the developmental approach. Clerkship students are in the transition from pre-clinical status to internship, and some developmental aspect is usually required in framing the evaluation system.
Progressive refinement of cognitive skills has an ancient pedigree. Plato described the progress from observing facts to observing and identifying the abstractions below them; in other words, the progress from reporter to interpreter. Aristotle was even more explicit in defining the fundamental rhythm of cognitive processes: observation-reflection-action, with further reflection based upon action. This developmental approach has been captured educationally in Bloom's taxonomy61 for cognitive progress in which, simply, there is progress from the possession of facts, to being able to explain the facts, to apply them to new situations, to synthesize intermediate conclusions, and to reach value judgments. The Dreyfus brothers described six stages of progress from novice, to advanced beginner, to competent, to proficient, too intuitive expert and finally to mastery.62 While these are generalizations, and difficult for every day teachers to apply to specific students, they do capture the expectation that a student will be able to accept progressively higher levels of responsibility. We have to recognize that students can be more advanced in their level of performance on some patients, that on others. This is the principle of content-based expertise. Nevertheless, clerkships often have to decide what is acceptable performance at the end of each clerkship rotation, and whether it should be different at different times of the year, or if a student is returning to the clerkship in the fourth year in remediation for prior substandard performance.
These terms have complementary meanings, but their meanings are sometimes used interchangeably, and educators should pay careful attention to how the terms are being used in a specific context. In the more common use of the terms, “competence” is what a student has the ability to do at certain times or under test conditions (in this sense, related to the etymology of the word, to strive with, or to “compete”) and “performance” is what a student does consistently on a daily basis, even when not being watched. This distinction is best reflected in the “Know-Can-Do” description of a levels of accomplishment described in Miller's triangle; that is, the student “knows what to do”, "can apply it”, “can do it successfully under test conditions”, and “does do it” regularly. Alternatively phrased, the student “knows how”, “shows how” and “does”. So, the distinction between competence and performance also highlights two differences, one in the setting - in vitro (a simulation center) and in vivo (actual practice), and another in process (whether the person is being observed, or is aware of being observed)
However, these terms can also be used in exactly the reverse senses, in which “performance” refers to a display while being observed (i.e., performing for an audience), as in being “on-stage”, in test conditions, and “competence” denotes all the attributes to function independently. In this less conventional use of the terms, competence can actually never be demonstrated until it is actually achieved in a sustained, independent way in practice.
In practice, competence is defined in many ways and embodies many frameworks. In the analytic model, competence is proficiency in tasks in each of the contributing domains (knowledge, skills and attitudes). In a developmental model, competence can be described in relation to the steps above it (intuitive expertise), and below it (proficiency).
In the synthetic model, competence is putting all the necessary characteristics and qualities together for each patient in a sustained way. The definition of competence in a profession, in this model, is the ability to give to every situation that a professional might face all that properly belongs to that situation, and no more.63 This means that a competent person first has to make the decision about what a situation requires. Since the efficiency and judgment needed to exclude unnecessary effort implies a level that is beyond most students, it may not be appropriate to use the term “competence” for students at all. Practically, our concrete expectations for students or interns should require that they consistently do all the important things for their patients (for instance, accurately report all important findings) but reward their having the ability to leave out less important with a higher grade.
Do clerkship directors judge that a learner is “competent” (or has “competence”) when proficiency is achieved in each of several “competencies”, or must they all be brought to bear, consistently, in the care of individual patients? Actual practice situations are truly in vivo, and have the complexity of authentic decision-making. In vitro tests, such as clinical skills examinations, focus on clinical “performance” and have often narrowed down the task for the learner. While use of the analytic method to create an assessment method for some single aspect of competence is quite useful at the undergraduate level, it can never be entirely successful for a resident about to begin unsupervised practice.
Clerkship directors therefore will typically use a variety of quantitative methods to assess aspects of competence (written examinations, direct observations of interviewing skill, etc. See sections 6 through 12 of this chapter) and rely on summary observations of teachers to see whether they can put things together (see Chapter 6, Section 3 [Descriptive Methods])
This term has become popular since the introduction by the ACGME of the six “general competencies” which are to guide the teaching and assessment of those in graduate education.12 The six items do not together equal “competence”, but all are part of the characteristics and detailed skills sets expected to be present in a resident ready for independent practice. In a sense the “competencies” do not describe competence, but are a framework with which program directors can assess competence, competency by competency with a toolbox of methods for each.12 This fits quite well with the intention to facilitate the ACGME’s Outcomes Project, which will link process in training to product (outcomes) at the end of training or in subsequent practice. This is a very exciting development which should foster educational measurement and research. The framework of competencies will be seen as a combination the “analytic” model noted above in the first three items, and three “synthetic” items that describe tasks to be mastered.
The “competencies” are intended to benchmark the final level of proficiency achieved by each resident, so they do not contain an explicitly developmental aspect. Clerkship directors have therefore debated their utility for medical students. The question has largely been rendered moot by the influence strong forces of regulation of the ACGME and the endorsement of the AAMC (see Chapter13: Understanding, Navigating and Leveraging American Medicine). Therefore, clerkship directors must articulate what would be expected of a starting and finishing third-year student, and finishing fourth year students. Similarly, program directors must make expectations clear for interns and PGY2 residents.
There are assessment methods appropriate for each of the competencies (please see detailed Table 6.1.1. in Chapter 6, Section 1). Although this chapter is not organized by the “competencies”, there are discussions appropriate to each in the following sections in this chapter:
Reliability is the consistency, replicability, stability, or reproducibility of results (in Latin, to rely on - religare - is to trust). Reliability is the amount of the observed variance that is due to the student (true score variance) rather than the test and everything else (error variance), and is usually expressed as a decimal figure between zero and 1.0. High reliability suggests that the “signal” (what we want to measure) is sufficiently greater than the “noise” (problems inherent in the assessment method), so that we can consider the results reproducible, or at least representative. For high stakes decisions, at least 80% of the variance should be true score variance (a reliability figure of 0.8).64 (for discussion of reliability statistics see Chapter 6, Section 12.)
Validity is confidence that you are measuring what you want to measure, what you “value” (similar in etymology to “evaluation”). There are several terms dealing with validity with which clerkship directors should be familiar.65 Content validity reflects whether assessment reflects enough of the domain you want to assess, and this can be made as a judgment of experts, or by comparison with some external standard, such as from the core curricula available from clerkship groups (CDIM, STFM, etc.). Face validity judges whether the assessment method seems to experts to be appropriate for competency in question. For instance, use of a multiple-choice test to assess interpersonal skills would not have face validity. Construct validity means that results are consistent with reasonable theory (e.g., experts perform better than novices). Criterion/concurrent validity is more numerical, and determines whether the results of your assessment method agree with other appropriate measures of students’ performance. Predictive validity refers to whether results of one assessment measure are verified by subsequent performance, and this, too, is best demonstrated with mathematical methods, such as correlations and linear regression. Consequential validity is the term applied to a judgment about whether the effects of an evaluation system, typically social effects, are desirable. For students, and perhaps for clerkship directors, one consequence of grades might be a student’s choice of what GME specialty to apply to. Clerkship directors are referred to the excellent articles by Downing66-68 on these subjects.
Feasibility deals with whether an evaluation can actually be conducted in your own clerkship setting (from the French, faire, “to do”). Time to prepare and conduct the assessment, money to support the development, and space all contributes to feasibility. Feasibility is often the rate-limiting step in deciding how we evaluate our clerkship students. To some extent, acceptability to students and faculty is another aspect of feasibility. For students, their acceptance may be contingent upon perceived fairness, or upon cost in time and money. For faculty, simplicity of use and perhaps being distanced from legal implications would be69 the priorities. Nonetheless, it is preferable to develop reliable and valid tools; then try to make them work. Another factor of assessment has been called the “educational impact” on students, how they change their strategies of studying to match not only the content but the format of assessment.87
To some extent, what we measure and reward will determine what students learn; in other words, “assessment drives the curriculum”. The list of topics or skills that we wish students to master is the syllabus (the term, etymologically, means "list"), and the methods we use to help students master the list, collectively, are "curriculum" (that is, the "horse race" we put students through, from “currere”, "to run", as in the word "current"). This distinction has implications for evaluation. If each of a school’s third year clerkships has a different list of topics to master, these are typically knowledge-based, and will require an emphasis on multiple-choice tests to establish content mastery. On the other hand, if schools wish to have common goals across clerkships, then these must be process-based, such as skills in interviewing and physical examination, in differential diagnosis, and in rapid mastery of the necessary knowledge to go beyond collecting facts to interpret them. In this approach, “curriculum” for third-year students might be seen as an expectation to move from reporter to interpreter; the basic strategy for clinical teachers would be to have a clear expectation that a student will offer a reasonable opinion.
Most clerkships accept a responsibility to be both discipline-specific (proficiency in the unique syllabus of subjects not taught elsewhere) and interdisciplinary (emphasizing common expectations which will lead to a successful performance in residency). As a consequence, the clerkship’s blueprint for evaluation might identify, explicitly, the methods to assess both the discipline-specific and the inter-departmental goals.
David Carnahan, MD and Paul A. Hemmer, MD, MPH
The focus of this section will be the descriptive evaluation of medical students by teachers during clinical clerkships. We will discuss the purpose of descriptive evaluation, its characteristics, strengths and potential deficiencies, as well as offer suggestions on how to improve the quality and credibility of descriptive evaluation. We will complete our discussion with a look at a synthetic framework for evaluating the performance of students using descriptive evaluation known as R-I-M-E.
What is descriptive evaluation? Descriptive evaluation is the term applied to the words instructors use in their assessment of students’ demonstrated competency across the domains of knowledge, skills and attitudes, and it is usually based on their observations of students over a given period of time. (see also Chapter 6, Section 2) Their words should provide evidence of students’ strengths and weaknesses, give examples of achievement or deficiencies, and serve as the basis for direct, meaningful feedback to the student and for recommending advancement or remediation. Some have described this as “clinical performance appraisal.”70
Unfortunately, descriptive evaluation is often referred to as “subjective” evaluation.71, 72 This may have been “encouraged by psychometricians and behavioral scientists who have labeled narrative judgments as unreliable and ‘soft’, and have urged faculty to focus on methods that yield ‘objective’ assessments”73 and reflects the bias toward believing that which is expressed in numbers rather than in words.74 However, Eisner has asserted that expert judgment is likely the superior approach to evaluating competence in fields in which science and art are mixed.75 We believe that use of the term “subjective” is detrimental to the evaluation process in that students and faculty often infer that a “subjective” assessment method is inferior to an “objective” method, such as multiple choice examinations. One counterpoint made to this notion by Norman et al. states that “objectivity does not necessarily result from the strategies of objectification (a set of strategies to reduce measurement error), and the application of these strategies may have undesirable consequences.”59 “Descriptive” more accurately defines this type of evaluation—conveying one’s ideas, thoughts, observations, and a synthesized judgment with words.
Descriptive evaluation is a component of an overall system of evaluation that also frequently incorporates quantifiable examinations of knowledge and/or skills evaluation.76-77 Descriptive evaluation is unique because it involves all aspects of the evaluation system, including evaluators, students, content of evaluation, and learning environment.78 Additionally, it assesses competencies not easily measured by knowledge or skills examinations, such as responsibility, integrity, compassion, maturity, and the application of knowledge in the clinical problem-solving of direct patient care.78
Clerkship directors place great emphasis on instructors’ comments in determining grades.79 Studies of required clerkships in the United States and Canada demonstrated that clinical instructors’ evaluations account for 40 to 60% (range, 0-100%) of students’ final clerkship grade.80-84 Given the reliance on descriptive evaluations in the grading process, clerkship directors must strive for reliable and valid descriptions. The evaluations should be based on as many direct clinical observations of the students as feasible, describe students’ performance based on uniform criteria established by the clerkship faculty, and cite specific examples of behavior and performance.33,54,85-86 Evaluators should make specific, behaviorally based comments that cite strengths and weaknesses, thereby providing meaningful feedback to the students. As a result, the evaluations would help clerkship directors and faculty teaching in the clerkship discern and tailor interventions for those students who are superior, average or marginal, as well as those who are failing.33, 70, 87
Studies of instructors’ ratings of medical students have shown a remarkable similarity in the elements instructors emphasize. Typically, instructors have emphasized the students’ interpersonal skills in dealing with colleagues and patients, their professional attitudes and behaviors, as well as their ability to apply knowledge and solve clinical problems.33, 77, 88-89 However, instructors at various levels of training and experience may place greater emphasis on different factors. Residents are likely to value a student’s procedural skills, work ethic, and motivation to help the team, while attending physicians are likely to place greater value on a student’s knowledge and reasoning skills.90-91
These studies are based primarily on instructors’ annotations on an evaluation form rating scale and not on their narrative comments. These rating scales, which usually address a student’s knowledge, skills, and professionalism, are used by most clerkships; although, some clerkships may now be adapting their ratings’ form to reflect the ACGME core competencies.92-93 Regardless of the domains assessed, instructors mark or circle the point on the scale they believe corresponds to the observed level of student performance. The scales are usually numerical (from three to nine options per rated domain) and may be either simply numbers or may contain more detailed written descriptors of student performance (Appendix 1). Ideally, instructors should be trained in the proper use of the forms, understand how their evaluation contributes to grading, and understand the criteria for specific levels of student achievement for each rated category, as well as overall performance (e.g., failing, marginal, satisfactory, outstanding). Further issues concerning rating scales will be discussed in the next sections on, Problems with descriptive evaluation and Improving descriptive evaluation.
We believe that that most important role that evaluation forms can play is to clearly and concisely communicate goals to teachers. The forms can be one way to communicate expectations for what teachers should assess, and provide guidance and a common language to create a frame of reference from which to evaluate students.
Despite their limitations, these studies demonstrate that instructors’ evaluations of students assess the breadth of competency: knowledge and its application, problem-solving skills, and professional qualities. Many faculty believe that assessing qualities of professionalism may be the most important aspect of evaluating medical students.94 There may be no better evaluation method to assess professional qualities than faculty and residents who observe performance on a daily basis. In fact, recent studies demonstrate that faculty ratings and comments form the centerpiece of an evaluation process focusing on professionalism,95 and that such comments made by teachers about students may identify those individuals at risk of future unprofessional conduct.14
Instructors’ unwillingness to record negative comments in evaluations does not necessarily mean that instructors are not able or willing to identify “marginal” or failing students.73, 87 Instructors are often willing to verbally discuss their concerns, but are reluctant to document, on either a rating scale or in written comments, these same concerns.77, 97-99 Reasons for reluctance include fear legal action, lack of administrative support for unpopular decisions, an unwillingness to be involved in follow through on difficult cases, or “passing the buck” to other evaluators.78 Also, instructors may feel their role as teacher and mentor may be in conflict with that as an evaluator, or they may have difficulty with delivering “bad news”. A national survey regarding grade inflation showed that 82% of respondents believed that faculty were reluctant to give low grades because of students’ expectations of higher grades, fear of legal action or student “hassle”, belief that students with strong work ethic should not fail, and that assigning higher grades may entice students to their specialty. Of further concern, forty-three percent of the clerkship directors surveyed felt that we are unable to identify incompetent students.100 These findings are disappointing for several reasons. First, the courts have consistently upheld the judgment of faculty in cases in which students have not met academic or professional standards95,101 (See Chapter 6, Section 14 [Legal Aspects of Failing Grades]). Second, it would also appear that the “halo effect” continues to strongly influence an instructor’s evaluation,102 and finally, students’ expectations, sense of entitlement, or tenacity in challenging grades appears to have undue influence on instructors.103 Even if only one instructor states or records a negative comment, it likely has substantial merit.14, 89, 98, 104-105
Studies of instructors’ ratings of students’ written case reports, as well as ratings of videotaped encounters of trainees interviewing, examining, or presenting a patient have shown low intra-rater and inter-rater reliability.106-108 Although some of the low reliability may be due to instructors focusing on different aspects of student performance, standardized rating scales only modestly improved reliability.90-91, 107.
Other studies suggest that instructors’ clinical evaluations can achieve sufficient reliability for “high stakes” academic decisions (usually considered to be a reliability coefficient > 0.8). Carline and colleagues109 analyzed individual instructors’ ratings from a standardized, descriptive clerkship evaluation form and achieved a reliability of 0.8 for assigning clerkship grades when at least 7 observations of student performance were available. More recently, Williams et al. found a reliability of 0.8 was possible when evaluating surgical residents at least 8 times with no improvement in the reliability when more rating scales were added to the evaluation form.93 Time during the academic year, clerkship site, and academic level of the rater had little effect on the ratings. A study of reliability yielded slightly lower coefficients across clerkships when 8 raters evaluated each student.110 In this study, the student’s score did seem to depend on the instructor to whom they were assigned and the clinical context in which the rating was performed. Use of global rating scales yielded inter-rater reliability of 0.83-0.91 in one study.111 The authors attributed this high inter-rater reliability to definition of the parameters rated, instructors who had direct, prolonged and close observation of relatively few students, ratings which were assigned after consensus among all supervisors, and training the raters to use the evaluation forms. Another study by MacRae et al.112 compared physician ratings of 120 videotaped medical student encounters using four cases, they noted similar inter-rater reliability with an average reliability coefficient of 0.85. They also attributed the high level of agreement due to collaboration on the rating scales that were used in the study.
While high reliability coefficients are desirable, lack of agreement among instructors’ evaluations is not necessarily undesirable. Different instructors may focus on different aspects of student’s performance, but in aggregate, the ratings may provide a more comprehensive picture of a student’s performance.90-91, 93 Limited variability in instructors’ ratings may be detrimental if it leads to overemphasis on other measures of student performance, such as written examinations.113 Ultimately, the clerkship director must decide whether areas of disagreement among instructors are desirable or undesirable.
The validity of descriptive evaluations has been questioned in studies that have centered on the predictive, concurrent, content, and face validity of descriptive evaluations. (See definitions earlier in Chapter 6, Section 2). One study examining the predictive validity of clerkship evaluations found that overall competence could be predicted better than professional behavior during residency. Students with good communication skills were more likely to receive higher overall competence ratings.114 Students who had either cognitive or non-cognitive deficiencies identified during an internal medicine clerkship were 13 times more likely to receive low ratings or comments from internship directors than those without deficiencies.87 As previously noted, a case control study suggests that comments and ratings that identify unprofessional behavior of medical students likely highlight individuals that are at risk of continued unprofessional behavior.14
Studies have also raised concern about the concurrent validity of instructors’ evaluations, as evidenced by low correlation between instructors’ end-of-clerkship evaluations and students’ performance on end-of-clerkship knowledge and/or skills examinations and licensing exams.71, 115-118 However, this low correlation may not be unexpected. In addition to assessment of student’s knowledge, instructors’ evaluations assess clinical skills and attitudes, thereby, assessing characteristics beyond the scope of knowledge or skills examinations. The different types of evaluations may be measuring different characteristics, reinforcing the need for multiple methods of evaluation. 71, 117-122
Content and face validity of instructor evaluations have also been questioned. For example, instructors’ ratings of videotaped case presentations seemed to depend on the “likeability” of the student and judgments about competency reflected students’ communication skills.108, 119 Assessment of one trait (e.g., knowledge) on an evaluation form correlated with assessment of other traits (clinical skills, personal characteristics) in another study.121 Several studies have also shown that residents tend to give higher ratings to students than faculty.115-116, 123 This may be due to a greater amount of time spent with the students, leniency in grading, or the “halo effect”.115 Resident evaluations have shown better internal consistency than faculty evaluations of students and adding resident evaluations to those of faculty improves the dependability of the evaluations.115-116, 123 Nevertheless, Holmboe demonstrated that rating forms tailored to a specific task, such as the mini-CEX for observation of resident's clinical skills, do have content validity and that faculty can be trained to observe and record accurately their observations of a trainee.124-125
Many other factors may affect the reliability and validity of instructors’ evaluations. These include a sense of personal failure if a student does not improve; a desire to be liked; evaluations that lack specific, behaviorally based comments; substitution of a grade for comments; differing expectations among instructors; limited student-instructor encounter time; lack of a trusting relationship between teacher and learner; failure to directly observe student performance; the interest of raters in the process of evaluation; the types of interactions (such as attending rounds vs. work rounds); and differences in the training environment.104, 109, 120, 126 Reliability and validity of evaluations may also be affected by instructors’ failure to use the full rating scale and impatience with completing evaluation forms.115, 127 Perceived lack of rewards for teaching may also impact instructors’ willingness to participate effectively in the evaluation process.115
Most studies of reliability and validity have focused on the evaluation form rating scale or instructors’ final ratings, not on the instructors’ comments. However, the most important aspect of descriptive evaluation is the narrative comment, not the box checked on an evaluation form. It is not clear whether the findings of the studies that used rating scales would apply to the comments which instructors make.
General interventions to improve instructors’ evaluations include developing and reinforcing clear performance guidelines, improving communication among faculty members, and faculty development regarding evaluation skills.87, 97, 116 Reliability may be improved by using additional raters or investigating the sources of disagreement among evaluators.115, 117 Using a computerized evaluation form may improve the timeliness of evaluations as well as the number and quality of comments.126, 128-129 Relying on instructors for evaluation but not grading may improve the quality of the instructors’ comments.17, 40 Feedback to instructors on their evaluation or grading patterns may help improve future evaluations.86, 130, 132
Many efforts have focused on the evaluation form itself. Adding items to the evaluation form does not improve reliability.110 However, adding behaviorally based descriptors in each evaluation category for each level of performance enhances the reliability of instructors’ evaluations, a benefit that was lost when the descriptors were subsequently withdrawn.89, 123 Behavioral descriptors on an evaluation form may contribute to instructors making more detailed written comments.131
A recent review by Williams et al.33 explores the sources of bias and limitations in clinical performance ratings, which we have referred to as descriptive evaluation. This is an excellent review, one that clerkship directors should read. The authors propose sixteen recommendations to improve clinical performance ratings and these are summarized in Table 6.3.2. These recommendations highlight that descriptive evaluation should not be viewed as simply the distribution and collection of rating forms. For the process of descriptive evaluation to be effective, it takes time and it needs to be an interactive process between the clerkship director (or site director) and the teachers—both housestaff and faculty. In the final sections, we will discuss an evaluation process that provides all teachers with a common frame of reference from which to evaluate clerkship student performance that is combined with regular, face to face meetings with teachers that serve as protected time for evaluation, feedback, and faculty development.
The third year internal medicine clerkship at the Uniformed Services University of the Health Sciences (USUHS) uses an evaluation framework designed to assess and foster a student’s progression from “Reporter” to “Interpreter” to “Manager/Educator” (RIME).21,55
Reporter: Students must: (1) accurately gather information about their patients, through an independent history and physical examination, chart review, and from other sources such as family or referring physicians; (2) use appropriate terminology to clearly communicate their findings, both orally and in writing; (3) interact professionally with patients and staff, and (4) consistently and reliably carry out their responsibilities. This stage requires that students have an adequate knowledge base, the basic skills to perform fundamental tasks, and core attributes of honesty, reliability, and commitment. Students who are Reporters can answer the "What" questions about their patients.
Interpreter: Students must: (1) demonstrate ability to identify and prioritize problems independently, (2) offer three reasonable explanations for new problems, and (3) generate and defend a differential diagnosis. This step requires a greater knowledge base, increased confidence and skill in selecting and applying clinical facts to a specific patient, and the ability to begin to pose clinical questions. Interpreters organize, prioritize, synthesize, and interpret problems. Students who are Interpreters can answer the "Why" questions about their patients.
Manager: Students must be more “proactive”, suggesting diagnostic and therapeutic plans that include reasonable diagnostic options and possible therapies. This step takes even greater knowledge, more confidence, and the skill to select interventions for an individual patient. Managers understand their patients' needs and desires and can enter into "or relationship-centered care".
Educator: Becoming a Manager is intricately tied to being an Educator. Students must identify questions related to their patients that cannot be answered from textbooks, cite evidence that new or alternative therapies or tests are worthwhile, and share their acquired knowledge with other members of the health care team. Desire and ability to educate oneself and others is intrinsic to being a “manager” and reflects a desire not only to teach colleagues but also, and most importantly, to help the patient. A Manager/Educator answers the "How" questions, for themselves, and their patients. It is not simply a matter of "bringing in articles to the team."
In our third-year clerkship, “passing” requires mastery of “reporter” skills and evidence of some transition toward “interpreter”. Acquisition of skills as a consistent, reasonable “interpreter” constitutes a higher level of performance. Consistently demonstrating skills at the “Manager/Educator” level reflects performance beyond expectations for a third-year clerk (what might be expected of a fourth year student). RIME is "synthetic"—each level encompasses the traditional analytic framework of knowledge, skills and attitudes. It is a criterion-based framework for evaluating the performance of students.
Importantly, there is a rhythm to RIME that cuts across medical specialties (see also Chapter 6, Section 2). It is a readily understood frame of reference from which all teachers can evaluate student performance.20, 86 RIME captures what clinicians do when they interact with patients: Observation (Reporter), Reflection (Interpreter), Action (Manager/Educator) and what they write: "Subjective/Symptoms" and "Objective/Observations" (Reporter), "Assessment" (Interpreter), "Plan" (Manager/Educator). Furthermore, RIME also helps teachers understand the minimal level of performance below which a trainee cannot fall. For example, it would be unacceptable for a student to be able to Interpret data that they are given if they cannot demonstrate that they are able to reliably obtain the information themselves from the patient.
R-I-M-E is readily “portable” and applicable in ambulatory care or inpatient ward settings. It can be incorporated into the student's clerkship evaluation form (see Appendix 1), into the student's clerkship handbook, onto "encounter cards" used in ambulatory or ward settings, during orientations to ward teams and ambulatory attendings, and readily becomes part of the terminology that teachers use. In a study looking at the feasibility and acceptability of R-I-M-E, Battistone et al.53 found that residents and faculty believed that the new descriptive system was “more valid” than the prior evaluation method and that 80% of students found RIME to be “helpful” to “very helpful” with overall student satisfaction. Battistone also noted that more than half of the students noted they heard the RIME terminology in the feedback from their teachers within the first year of implementation.
Importantly, we evaluate clerkship students using the RIME framework during formal evaluation sessions.21, 53, 98, 107 The evaluation sessions are formal, planned meetings that are held every 3 to 4 weeks at each clerkship site. The clerkship director, or the on-site coordinator for the clerkship, moderates each session during which 15 minutes is devoted to discussing each medical student currently on the clerkship. All instructors, including residents and faculty, who are working with the student are asked to attend. Each evaluator is asked to describe and assess the student’s strengths and/or weaknesses and is allowed to speak uninterrupted. The moderator may ask for clarifications about, or specific examples of, demonstrated knowledge, skills, and attitudes. The most junior evaluator speaks first, with the attending physician adding comments last in an effort to encourage the house staff to voice their observations uninfluenced by the comments of the attending physician. At the end of the evaluator’s comments, the facilitator asks for a recommended grade based on the student’s performance and the “next steps” for the student to progress along the R-I-M-E framework. The clerkship director or site directors can also provide feedback to the teachers on their comments. The clerkship director or on-site coordinator meets with each student the following day to provide feedback.
In addition to serving as a forum to evaluate students, the evaluation sessions fulfill other needs including: (1) defining clerkship objectives and how they can be assessed; (2) defining expectations of instructors; (3) facilitating communication among faculty members; and (4) providing faculty development to improve the evaluation of students.78, 97, 99, 100, 121 Faculty development is accomplished in a non-threatening, interactive, “workshop” format. The evaluation sessions are "protected time" for these activities. They provide a regular, recurring time to provide frame of reference training and performance dimension training to teachers, one that is immediately applicable because the teachers are still working with the students. In addition, the evaluation sessions not only meet clerkship directors’ need for timely summative evaluation, but also the students’ need for formative evaluation and feedback by identifying and discussing strengths and weaknesses during the clerkship.
Perhaps most significantly, the evaluation sessions facilitate the identification of marginally performing students by capitalizing on instructors’ willingness to verbally discuss concerns regarding students that they may not be willing to document in writing77, 98, 105 We have demonstrated the enhanced predictive validity of the evaluation sessions over traditional evaluation methods for identifying students with marginal funds of knowledge, as well as identifying those students who are likely to have problems during their first-postgraduate year of training.21, 98 Evaluation sessions enhance the quality of behavior-based description of a student’s professional demeanor135 and significantly improve the detection and description of unprofessional behavior.105 Finally, the use of the R-I-M-E framework in conjunction with the formal evaluation sessions has achieved an internal consistency of descriptive evaluation of student performance similar to that of quantifiable examinations.136
Evaluation sessions (or similar activities) have been implemented at other institutions, and on clerkships other than medicine.53, 127-139 Residency program directors and local leadership have supported these sessions by providing residency lecture time for the sessions, a clear signal as to the importance of trainee evaluation and also teachers' professional development.53, 137 Teachers will come to the meetings—in the first year of implementation, Battistone et al.53 found that 79% of residents and 72% of faculty attended the sessions; Ogburn noted near 100% attendance.137 We recognize that some clerkships may have students at such a large number of teaching sites that face to face meetings may not be feasible. We believe that what is most important is the interaction that takes place between the clerkship director and the teachers. There is a terrific opportunity for research to address other ways of interacting (email, phone, video teleconferencing) that prove valuable.
The 45 to 60 minutes invested per student during a 12-week clerkship to complete this evaluation process is similar to the time invested in evaluation by clerkship directors who use other evaluation and grading methods. The time and resources to administer the evaluation sessions is commensurate with expectations of clerkship directors.140-141 Finally, the time requirement for the evaluation sessions pales in comparison to our educational and societal obligations to evaluate the competency of medical students.
Two broad conclusions are apparent. First, credible descriptive evaluation of medical students takes time, both for the clerkship director and for the teachers. Second, improving descriptive evaluation also means clerkship directors need to talk to teachers on a regular basis. Both of these can be addressed but certainly require the support of the medical school department and local teaching site leadership. While it is important to convey clerkship goals and expectations in a variety of ways, including using a concise evaluation form with behavioral descriptors, it is unreasonable to assume that, without training, instructors will be able to improve their evaluation skills or feel the needed support to identify concerns regarding student performance. RIME is a readily understood and applicable common frame of reference. RIME encourages formal evaluation sessions using a planned, longitudinal format for student evaluation and feedback that addresses many of the recommendations for improving this type of evaluation.33
Despite tremendous advances in medical technology, the basic clinical skills of interviewing, physical examination, and counseling remain essential to the successful care of patients. The Association of American Medical Colleges (AAMC) strongly endorses the evaluation of students in these clinical skills.142 The Institute of Medicine has placed patient-centered care at the heart of its five core competencies for all physicians.143 Faculty observation of students performing a medical interview, physical examination, or counseling is still essential for the reliable and valid assessment of these skills. The development of standardized patients to evaluate clinical skills has been a major advance in the assessment of students.144-148 However, standardized patients are optimally applied in clinical skills teaching and assessment as a supplement to similar activities in the real clinical setting; they cannot replace the observation of students by physicians on an ongoing basis with actual patients.149-152
Therefore, despite the growing availability and acceptance of standardized patients and other simulation technologies, teaching faculty will continue to shoulder the primary responsibility for evaluating student skills through direct observation in real clinical settings. Unfortunately many faculty are not sufficiently prepared to accurately observe and provide effective corrective feedback about these clinical skills. In this chapter we will first explore problems in students’ clinical skills and the challenges faced by faculty performing direct observation. We will then outline some practical methods to improve faculty observation skills along with useful tools faculty can use when performing observations.
Numerous studies have documented serious deficiencies in medical interviewing and counseling that have persisted over time and in the views of some, history taking skills may have actually declined.153-157 More importantly, research has demonstrated positive associations between good communication skills and improved patient outcomes.158 Errors are also common in physical examination skills.159-164 For example, deficiencies in auscultatory skills among trainees were noted over forty years ago161-162 and poor cardiac and pulmonary physical exam skills continue to plague U.S. students and residents today.163-164
These findings are relevant because we know that despite advances in technology, accurate data collection during the medical interview and the physical exam remains the most potent diagnostic tool available to physicians.165-167 Two important studies showed that the medical interview alone produced the correct diagnosis in nearly 80% of patients presenting to an ambulatory care clinic with a previously undiagnosed condition.165, 167 Bordage recently noted that errors in data collection are one of the principle factors in diagnostic errors committed by physicians.168 As a result, there has been a significant push to re-emphasize both the training and evaluation of clinical skills.169-171 Without accurate evaluation of clinical skills, which must be accomplished by direct observation, improvement in the clinical skills of physicians is unlikely.
Perhaps the biggest problem in the evaluation of clinical skills is simply getting faculty to observe students. One of the most prominent physician-scientists and educators of the twentieth century, the late George Engel, strongly advocated direct observation of the history and physical examination skills of trainees over 30 years ago.172-173 Dr. George Engel commented in a 1976 editorial,
"Evidently it is not deemed necessary to assay students' (and residents) clinical performance once they have entered the clinical years. Nor do clinical instructors more than occasionally show how they themselves elicit and check the reliability of clinical data. To a degree that is often at variance with their own professed scientific standards, attending staff all too often accept and use as the basis for discussion, if not recommendations, findings reported by students and housestaff without ever evaluating the reporter's mastery of the clinical methods utilized or the reliability of the data obtained."173
The AAMC found that among 97 medical schools it visited between 1993 and 1998, faculty rarely observed student interactions with patients, noting that the majority of a student’s evaluation was based on faculty and resident recollections of student presentation skills and knowledge.174
Although several studies show that four to seven observations produces sufficient reliability in the evaluation of clinical skills for "pass-fail" determinations, little is known about the validity and accuracy of faculty rating. Noel and Herbers, in two important studies of the American Board of Internal Medicine’s (ABIM) traditional “long case” clinical evaluation exercise (CEX), found substantial deficiencies in the accuracy of faculty ratings.175-176 They demonstrated that faculty failed to detect up to 68% of errors committed by a resident scripted to depict marginal performance on a training videotape. Use of specific checklists prompting faculty to look for certain skills increased accuracy of error detection nearly twofold, but the checklist did not produce more accurate overall ratings of competence. Nearly 70% of faculty still rated a resident depicting marginal performance as satisfactory or superior overall.
Kalet examined the reliability and validity of faculty observation skills using videotapes of student performance on an objective structured clinical examination (OSCE) designed to evaluate interviewing skills.177 She found that faculty were inconsistent in identifying the use of open-ended questions and empathy, and that the positive predictive value of faculty ratings for "adequate" interviewing skills was only 12%. Another study found that faculty could not reliably evaluate 32% of the physical exam skills assessed, and had the most difficulty with examination of the head, neck and abdomen.178
Given the essential role of faculty observation in the evaluation of basic clinical skills, medical schools and residency programs must better prepare faculty for this important task. Recent research in medical education has demonstrated that effective training approaches can improve observation skills. A brief description of each approach and how it applies to faculty development for competency evaluation of medical students is described below.
Behavioral Observation Training (BOT)
Behavioral observation training is focused on improving the detection, perception, and recall of actual performance.179 There are two main strategies emphasized in BOT. The first is simply to increase the number of observations, or increased sampling of actual performance. This helps to improve recall of performance and provides multiple opportunities for skill practice in observation by the rater, the “practice makes perfect” principle.” The second strategy is to provide some form of observational aide that raters can then use to record observations, sometimes referred to as “behavioral diaries.” Studies show that even something as simple as a 3 X 5 inch index card used to record observation notes improves the quality of information provided on evaluation forms. As described below, the mini-CEX form and checklists can serve as an immediate “behavioral diary” to record a rating of an observation.180
Observation of clinical skills also requires that faculty “prepare” for the observation. First, faculty should determine what are the objectives and/or goals of the observation before entering the patient’s room with the student. For example, if you plan to perform an observation of student’s physical examination skills, what would be the appropriate components of a physical exam for the patient’s chief complaint or medical condition? Positioning is also very important because as faculty you want to minimize interference with the student-patient interaction whenever possible. Figure 1 demonstrates the principle of triangulation that maximizes the ability of the faculty to observe while minimizing interference. Table 6.4.1 lists some important yet simple rules for performing student observation.

Figure1: Principle of Triangulation
Performance Dimension Training (PDT)
This type of training is designed to teach and familiarize the faculty with the appropriate performance dimensions used in their own evaluation system.181-183 PDT simply starts with a review of the definitions and criteria for each dimension of performance or competency. The goal should be to define all those criteria and student behaviors that constitute a superior performance from the perspective of patient outcomes. The next step in PDT is to give faculty the opportunity to "interact" with the definitions using videotapes or actual evaluation examples to improve their understanding of the definitions and criteria. The overarching goal of PDT is to ensure faculty first understands the definitions and criteria for the competency of interest as a group so that some degree of consensus is shared among faculty. Appendix II provides a very straight forward and useful proactive PDT exercise that can be done with faculty to facilitate interaction with competency in clinical skills. We recommend performing PDT exercises in small groups and then have the small groups share their results. Inevitably differences occur between the groups. These differences, however, lead to productive discussions on what constitute the core elements and criteria of competency in counseling, or other clinical skills. This type of PDT exercise can be done for two clinical skills over approximately one hour of time. Another approach to PDT is reactive: using actual evaluations or videotapes of clinical skills that faculty can react to when performing the PDT exercise
Frame of Reference Training (FoRT)
This type of training specifically targets accuracy in rating. Table 6.4.2 describes the complete FoRT process. As you can see, FoRT is really an extension of PDT; the main goal of FoRT is establishing the different performance criteria that distinguish levels of performance. The main focus of FoRT should be to define four levels of performance: unsatisfactory, marginal, satisfactory and superior. The PDT exercise should first define the criteria and definitions for a superior performance from the perspective of optimal patient outcomes. The second step of the exercise, as shown in the Appendix II, is to define the minimal criteria for a satisfactory performance. These criteria for a satisfactory performance serve as an important anchoring point to define marginal and unsatisfactory performance in step 3. Once the group defines marginal criteria, by default any other type of performance is unsatisfactory.
Direct observation of competence training uses the methods of BOT, PDT, FoRT, and standardized patient training methods to train faculty in observation. There are two versions of DOC training. The “short course” form involves BOT, PDT, and FoRT exercises using small group discussion and videotape encounters. The long course version includes a half day of skill practice with standardized residents and patients.184 An evaluation course that includes DOC training is available through the American Board of Internal Medicine (www.abim.org) .
The Mini Clinical Evaluation Exercise (miniCEX)
The mini-CEX was originally designed to evaluate residents in a setting reflective of day-to-day practice. Faculty observe a resident performing a focused history, physical, or counseling session during routine care experiences on the inpatient wards, intensive care units, outpatient clinics and the emergency department. However, the miniCEX has also been used in student clerkships.185 The mini-CEX facilitates multiple observations over time by different faculty members. This improves both the reliability and validity of the evaluations. This longitudinal nature of the mini-CEX is one of its most important strength as an evaluation tool and method.
In the first large study of the mini-CEX, Norcini et al.186 reported on the results of 388 miniCEX evaluations for 88 residents at 5 different residency programs. Over half of the encounters occurred in the inpatient setting. In this initial study, most of the participating residents were in the PGY-1 year, and each resident underwent a mean of 4.4 observations (range 2-10). The authors noted that the standard error for just 4 miniCEXs per resident was acceptable enough for pass-fail determinations. Trainees reported high satisfaction ratings for the miniCEX format, and interestingly there was a modest correlation between faculty satisfaction ratings and resident performance. In a study of the miniCEX with students, Kogan and colleagues found that nearly 90% of students on a 12 week medicine clerkship were able to obtain at least 9 miniCEX observations.185 The reliability coefficient for 8 miniCEXs was 0.77 and the miniCEX was used in both the inpatient and outpatient clerkship settings.185 Holmboe and colleagues, using scripted videotapes, found that the mini-CEX evaluation form does possess construct validity.187
An essential component of the mini-CEX, as with any evaluation, is feedback. A recent study investigated the feedback generated from the miniCEX observation by audio taping the attending – resident feedback session, with a particular focus on interactive feedback.188 Interactive feedback was defined as any feedback that provided a recommendation plus self-assessment, allowing the learner to react to the feedback, and development of an action plan. The study showed that 80% of the feedback sessions included at least one recommendation for improvement for the resident, and on average each feedback session contained 2 recommendations. The majority of recommendations, as might be expected, involved the clinical skills of medical interviewing, physical examination, and counseling. However, despite the large number of recommendations, only 8 sessions concluded with a specific action plan from the faculty member on how to carry out the recommendation or improve.187 This is a very important aspect of feedback – including an action plan to enable the learner to act on the recommendations provided.
Checklists targeting specific skills are another tool that can improve the quality of faculty observation. However, since the purpose of faculty direct observation is to assess performance of actual clinical practice, it is not feasible to develop highly detailed checklists for every patient encounter. Some degree of faculty interpretation of behavior and skills will be required when working in actual clinical settings. A number of checklists for assessment of interviewing skills have been developed and tested for reliability. Both the SEGUE and Calgary-Cambridge checklists are useful tools to guide the evaluation of process and general content of medical interviewing.189-190 Structured clinical observation is another observation technique that uses guidelines and observations sheets to systematically assess skills in history-taking, physical examination, and information-giving.191
Creating an Observation system
There are three simple steps in creating a faculty observation system. First, determine what your faculty are doing in regards to observation. If no observation is occurring, you will probably have to create a “need” for observation. Highlighting the substantial deficiencies in clinical skills among students provide ample evidence you can use to demonstrate the need to perform observation. Second, start small and get the faculty to perform some form of observation. Usually what happens is that faculty will observe these deficiencies. Once that happens, it becomes very difficult for your faculty to argue they no longer need to observe students, especially from a patient-centered perspective.
The next step is to improve faculty skill in observation, and depending on your educational climate, can be done concurrently with creating the need for observation. We recommend you start with performance dimension and behavioral observation training. This is can be done in a series of brief workshops, evaluation sessions, or at faculty meetings. Once your group feels comfortable with the definitions and criteria for the clinical skills competencies, you can then move on to frame of reference training and direct observation of competence training to improve faculty accuracy and ability to distinguish between levels of competence.
The successful practice of medicine requires the effective application of medical interviewing, physical examination, and counseling skills. Studies continue to document significant deficiencies in all three of these clinical skills areas among students. Direct observation by medical faculty remains an essential method to assess core basic clinical skills with actual patients. Furthermore, faculty are in the best position to assess student’s acquisition and refinement of clinical skills longitudinally over time.
Pre-clerkship information that may be used to identify students who are at risk for poor clerkship outcome includes pre-matriculation data, basic science performance, incidents of unprofessional behavior, standardized test results (e.g., USMLE Step 1 and NBME subject examination scores), and in-house clerkship pre-tests. This section will briefly review the predictive power of some commonly available measurements, and then discuss the ways this information might be used to identify and help students. It is a truism that “test performance predicts test performance”, and prior measures of knowledge may readily predict a student’s ability to acquire factual knowledge during clerkships.192-193 However, competency in clinical skills, professionalism, data analysis and problem solving are also critical to successful clerkship outcomes, but there is much less data supporting the ability of pre-clerkship variables to predict deficits in the skill and attitudinal domains. Therefore, while most of the advice in this chapter is informed by published educational literature, some recommendations rely on the experience and judgment of the authors.
Undergraduate Grade Point Average
A strong positive correlation between undergraduate GPA and subsequent measures of knowledge has consistently been reported by multiple authors.194-195 Undergraduate GPA, as a measure of knowledge and test taking skills, predicts performance on knowledge assessments in medical school. However, the relationship between undergraduate GPA and clinical skills or professional attitudes has not been well described. Clinical skills deficits and unprofessional behavior do not always track consistently with knowledge deficits. A strong knowledge base does not necessarily “protect” a student from deficits in other areas. Further research exploring this association is needed.
MCAT
Much like undergraduate GPA, performance on the MCAT correlates well with subsequent standardized tests, including licensure examinations and NBME subject examinations.194 In 1991, the MCAT introduced a writing sample component, as a measure of a student’s ability to synthesize and communicate information.196 The writing sample does not add value to other measurements in predicting USMLE Step 1 or Step 2 performance,197 but may help predict other clerkship outcomes, such as global clinical competence, data gathering and communication skills, and this correlation persists into residency.198
Admissions committee interviews
Narrative comments from medical school admissions interviews are another pre-clerkship source of information. Although the process is hardly standardized,199 and of low-yield200 admission committee narratives may predict clinical performance – perhaps even better than undergraduate GPA.201 Thus, as the earliest form of observation-based evaluation done by a school’s own faculty, these narrative remarks have a potential role in identifying students who may subsequently have difficulty in clinical skills and non-cognitive domains. Of course, candidates with adverse comments are less likely to be admitted to a school, and observations about successful applicants are not regularly provided to or used by clerkship directors. It is not at all clear that interventions based o