Some thoughts about generalizing the role that confidence plays in the accuracy of eyewitness memory

Ebbe B. Ebbesen

University of California, San Diego

November 3, 2000

 

In a relatively recent survey (Kassin, Ellsworth, & Smith, 1989), researchers in the area of eyewitness memory indicated that they believed that the relationship between witnesses' confidence in their identifications and the accuracy of those identifications is weak, at best. In fact, in a recent paper Wells, et al. (1998) concluded:

 

Jurors appear to overestimate the accuracy of identifications, fail to differentiate accurate from inaccurate eyewitnesses -- because they rely so heavily on witness confidence, which is relatively nondiagnostic -- and are generally insensitive to other factors that influence identification accuracy. (p 642)

 

This shared expert opinion (see Penrod & Cutler, 1995, for a similar argument, and Leippe, 2000, for a similar statement in a written report to court in an actual car hijacking case) seems to be based on three sources of evidence. First, a large number of studies in the experimental literature report low and frequently non-significant correlations between rated confidence in identifications and the accuracy of those identifications (e.g., Bothwell, Deffenbacher, & Brigham, 1987). Second, an extensive series of studies show that eyewitness accuracy varies as a function of factors other than confidence (e.g., stress, duration of exposure, instructions, the nature of the lineup procedures, post-event information, and so on).[1] Third, a number of studies have shown that it is possible to produce changes in the confidence that witnesses express in their memories independent of changes in the accuracy of those memories. All of these studies are based on several different types of experimental tests and therefore appear to offer a form of convergent validation to the conclusion that the relationship between confidence and accuracy is relatively weak to non-existent, especially when compared to other predictors of accuracy. A weak relationship between confidence and accuracy would be consistent with the hypothesis that people do not have direct or (more properly) valid access to the strength of their memories.

 

The conclusion reached by Wells, et al. (1998) that jurors make mistakes because they emphasize witness confidence over other factors is based on two very strong applied assumptions. The first is the non-intuitive assumption that witness self-confidence is not a good predictor of the accuracy of witness testimony. The second is the assumption that other factors, as they are typically available to jurors, are more diagnostic (than confidence) of eyewitness accuracy. This paper argues that neither of these conclusions can be reasonably applied to the real world given the nature of the theory, methodology, and research results that underlies them and the nature of the decisions faced by decision-makers in the legal system.

Optimality Hypothesis

Deffenbacher (1980) proposed the "optimality hypothesis" to explain the wide range of confidence-accuracy correlations that he noted in his review of the literature conducted prior to 1980. He argued that correlations between confidence and accuracy would tend to be low when the conditions of learning and memory are less than optimal, e.g., when it is difficult for witnesses to encode and/or retrieve the information to which they have been exposed. The correlations would be high only under conditions of optimal learning and memory. Finally, he argued on intuitive grounds that most crimes consist of less than optimal learning and memory conditions (e.g., they tend to involve a great deal of stress, the exposures are generally brief, there is usually a long delay between observing the criminal and being asked to identify him, the test procedures tend not to emphasize effective retrieval strategies).[2] As a result, he concluded that the correlation between accuracy and confidence would be low for witnesses and victims to actual crimes. He further concluded that jurors and other key decision-makers should be made aware that confidence is not an indicator of the accuracy of witness memories.[3]

 

One reason that laboratory studies might support the optimality hypothesis, at least in the extreme, is the fact that subjects who have no memory for an event but are forced to respond might guess. If some subjects in experiments guess, clearly, by chance they will be correct some of the time and incorrect other times. However, we would also expect that whatever confidence they express in these guesses would average to about the same levels for correct guesses as for the incorrect ones. After all, the subjects would not know which guesses are correct and which are not. Thus, when the learning conditions are so bad that observers can do no better than to guess randomly, the relationship between confidence and accuracy should be zero, just as Deffenbacher's hypothesis argues.

                                                                                                                                          

On the other hand, once the subjects' average strengths of memory increase above zero, the relationship between confidence and accuracy can grow stronger because self-knowledge about whether a response is correct can now be based, at least some of the time, on veridical memories. That is, a subject might have a very strong, and accurate, recollection. As a result, this subject might be very confident. Consistent with this analysis, Ebbesen and Wixted (1996) used signal detection theory (Macmillian & Creelman, 1991) and Monte-Carlo simulation methods to demonstrate how the size of confidence-accuracy correlations will tend to increase with increasing d'. Interestingly, if the present reasoning is correct, it suggests that Deffenbacher's claim that the confidence-accuracy relationship will be weak to non-existent in actual crime situations is equivalent to the claim that witnesses to actual crimes have no memory for the events and are just guessing.[4] Clearly this conclusion is a much stronger one than there is a weak relationship between confidence and accuracy. It is also a conclusion for which there is no empirical support.[5]

 

Several reviews conducted after Deffenbacher's, concluded on a slightly, albeit very slightly, more positive note than Deffenbacher. For example, Fleet, Brigham, & Bothwell (1987) concluded:

 

The claims of previous reviewers of the confidence-accuracy literature (Deffenbacher, 1980; Leippe, 1980; Wells & Murray, 1984) that confidence is an unreliable predictor of accuracy are perhaps premature. In addition to the unresolved issues of how to subdivide the research samples, there are the issues concerning ecological validity.  For example, several recent field studies have found a significant correlation between confidence and accuracy (Brigham, Maass, Snyder, & Spaulding, 1982; Hosch & Platz, 1984; Krafka & Penrod, 1985; Pigott & Brigham, 1985). (p 183)

 

Other Factors

More recently, some have argued that the size of the correlation between confidence and accuracy may depend on other factors besides the optimality of initial learning and memory conditions (e.g., Clark, 1997; Cutler & Penrod, 1989; Ebbesen & Wixted, 1996; Libuser & Ebbesen, 1999; Lindsay, Read, & Sharma, 1998; Robinson & Johnson, 1998; Wells, et al., 1998). For example, whether the confidence estimate is obtained prior to or after the identification response is one factor that seems to moderate the size of confidence-accuracy correlations (Cutler & Penrod, 1989). Another is the difference between choosers, i.e., those who pick someone, and non-choosers, i.e., those who fail to pick anyone (Sporer, Penrod, Read, & Cutler, 1995). Still another is the possibility that feedback about the accuracy of an identification might affect confidence in that identification (Wells & Bradfield, 1998). Robinson & Johnson (1996) suggested still another moderating variable. They reported evidence that the testing procedure (recall compared to recognition) affects the degree of relationship between confidence and accuracy of memory. Kebbell, Wagstaff, & Covey (1996) suggested that the low correlations might be due to relatively small variation the difficulty of the items used in the memory tests. Clark (1997) has presented some data suggesting that the similarity among the items that people are attempting to recognize might play a role not only in the size but the direction of the confidence-accuracy relationship. Approaching the problem more generally, Ebbesen & Wixted (1996) used signal detection theory to describe the ways in which confidence and accuracy might be related. In the typical signal detection view, confidence estimates are simply additional decision (judgment) criteria placed on the same subjective strength-of-memory dimension used to identify someone as the culprit. This signal detection view provides an explanation for the "optimality" findings as well as chooser v. non-chooser differences (e.g., Sporer, 1992; Sporer, Penrod, Read & Cultler, 1995). It also raises a number of issues that have been all but ignored by those concluding that jurors should ignore confidence because there is no relationship between confidence and accuracy.

 

One of the issues raised by the signal detection analysis concerns the specific method of aggregation used in the computation of the relationship between confidence and accuracy. In particular, correlations between confidence and accuracy can be computed in a number of different ways. To fully appreciate these different methods requires that we examine how researchers have studied eyewitness identification accuracy, confidence, and their relationship. Event memory, face memory, and fact memory are the most commonly used procedures to acquire information about accuracy and confidence.

Methods of collecting evidence about eyewitness accuracy and confidence

Event Memory

In event memory research, study participants are presented with a single event in which they observe one (or a very small number) of individuals do something. For example, participants might watch a slide presentation or a videotape of a simulated robbery or they might be present in a room when someone enters and does something unusual or unexpected. Afterwards, in the large majority of these studies the participants are asked to look at a photographic array of individuals (usually but not always consisting of six people) and attempt to identify the person that they saw in the event. In some studies the participants are asked how confident they are in their ability to identify the person(s) that they saw in the event prior to being shown a lineup as well as how confident they are in a particular response made to the lineup. For example, a "witness" might be asked, "How confident are you that you would be able to identify the person you saw were you to see him in a lineup?" Then after being shown the lineup and picking someone, the "witness" might be asked, "How confident are you that the person you picked is the person you saw?" or after declining to pick someone, "How confident are you that the person you saw is not in the lineup?" In other studies, the post-lineup confidence question might be, "How confident are you in your response?" It is important, as we shall soon see, to note that event memory research generally produces one post-lineup confidence response and one "identification" response per participant. That is, each participant sees one event, attempts to identify someone from one lineup, and then indicates how confident he or she is in that response. In general, the event and the criminal are held constant across all witnesses within a particular study. In addition, in the majority of studies the participants know that their choices have little or no real consequences. That is, no one will be accused of committing a crime on the basis of the participants' choices.

 

In more recent research, studies such as these present half of the participants with target-present lineups that contain the "culprit" who was in the videotape or who did the unusual act and half with target-absent or blank lineups that do not contain the "culprit". In the latter, the target is frequently replaced with someone who looks similar to but is not the target.[6] Note that the participant's choices can be coded as correct or incorrect. However, there are several different types of correct and incorrect responses. A participant might correctly choose the culprit from the target present lineup or correctly not pick anyone from a target absent lineup. Alternatively, the participant might make several different types of incorrect responses. She might pick a "foil" from either the target-present or the target-absent lineup or she might not pick anyone when presented with the target-present lineup.

 

It is important to note that these different errors would have different implications for the legal system were they produced by actual eyewitnesses. For example, when a witness fails to pick the actual culprit from a target present lineup, the culprit will probably not be charged with the crime (assuming other strong evidence against him does not exist) and a guilty person will be set free. Similarly, when a witness picks a "foil" from the target present lineup, the guilty culprit will again be set free and the "foil" will, in all likelihood, not be charged with the crime because the police who constructed the lineup generally know that the foils are innocent. If, on the other hand, the witness picks the "suspect" in a target absent lineup (that is, a person who the police believe committed the crime but is actually innocent), then, in a miscarriage of justice, the wrong person will be charged with the crime (assuming other evidence is not sufficient to exonerate the innocent individual) and the guilty person will go free.[7]

Face Memory

In face memory research, participants are shown a large number of faces one at a time (via slides or pictures) and frequently asked to make some sort of judgment about each one (generally, to ensure that the participants are paying attention). After looking at all of the faces, typically for no more than a few seconds each (Shapiro & Penrod, 1986), they are tested for their memory of the faces. Often the test consists of a "yes/no" task but sometimes a "two alternative forced-choice" procedure is used. In the "yes/no" task, the participants are shown another large set of faces. They are told that they saw some of the test faces before but did not see others. They are also told that their job is to indicate which they had seen before. They are to say, "yes," if they believe that they saw the face in the first set and, "no," if they believe they did not. After each "yes/no" response, they might be asked to indicate how confident they are in their response. In the two-alternative forced-choice procedure the participants are presented with two faces at once and asked to indicate which of the two they saw in the first set and which they did not see (generally one of each pair was seen before). Again, the subject might be asked to indicate how confident they are that each response is correct. In both procedures each memory response can be coded as correct or incorrect. In the "yes/no" task, the participant can be correct by either picking a person they saw before (called a hit) or saying, "no," to a person they did not see before (called a correct rejection). Similarly, they can be incorrect by picking someone that did not see before (called a false alarm) or not picking someone whom they did see before (called a miss). In the two-alternative forced-choice procedure, the participant either picks the face they did see before or picks the one they did not see before. Unlike in the event memory procedure, it is possible to aggregate the results from all of the responses to all of the faces together for each subject. Researchers can compute an overall, "percent correct" score for each subject.

Fact Memory

In fact memory research, participants are asked a series of questions about an event, frequently the same event for which they are shown a lineup. The researcher establishes a set of correct answers to the questions depending on the match between what happened in the event and the answer given. For example, the experimenter might ask a participant whether the "culprit" held a pen in his hand. Responses are coded as correct or incorrect depending on the degree of match. After each response, the experimenter might ask the participant how confident she is in her answer.

Methods of measuring the association between confidence and accuracy

Relationship between confidence and accuracy in event memory research

Researchers generally estimate the size of the relationship between confidence and accuracy by computing correlations between the two measures. Bothwell, Deffenbacher, & Brigham, (1987) reviewed many of the event memory studies and concluded that the average correlation between confidence and accuracy was, although probably greater than zero, unlikely to be much larger than .25. In these correlations, each participant contributes two observations, a memory response that is either correct or not (coded 1 or 0) and a confidence rating (coded 1 through n, depending on the number of steps on the confidence scale, e.g., not at all confident, slightly confident, moderately confident, etc.). The data will look something similar to the graph in Figure 1. Each participant is either right or wrong and gives a confidence rating that goes along with his or her response. Each of the dots in Figure 1 represent the results for a group of participants, all of whom indicated a particular confidence level and were correct (coded 1) or not (coded 0). Although we can't see it in the graph, the number of participants whose responses put them at a particular point (1 or 0) varies. Thus, when the relationship between confidence and accuracy is high, we would expect most of the participants who indicated that they were very confident (5) would be correct and most of those indicating a low confidence (1) would be incorrect. Stated differently, the proportion of highly confident people who are correct should be higher than the proportion of unconfident people who are correct.

 

 


Figure 1. Linear fit to dot plot of correct and incorrect responses as a function of rated confidence. Each dot represents a number of observations at the conjoint accuracy and confidence values. Of particular importance is the fact that the best fitting linear function will be unable to provide a very good fit of the data even when large majority of the high confidence responses are correct and large majority of the low confidence responses are incorrect.

 


Interestingly, this method of computing correlations between confidence and accuracy is constrained to produce generally low correlation coefficients, even when the proportion of correct highly confident people is greater than the proportion of correct unconfident witnesses. This is because correlations fit a continuous linear function to data points. A perfect correlation is obtained when the linear function runs through all of the data points, as in Figure 2. However, as can be seen in Figure 1, any attempt to fit a straight line to the data will fall in between all of the data points because the line must move through values that are between 0 and 1 but the data points are constrained to be either 0 (incorrect) or 1 (correct) and can be nothing in between. As a consequence, the resulting correlation coefficients will tend to be numerically low. The fact that the proportion of correct responses at the very lowest (e.g., guessing) confidence level will generally be greater than 0 (by chance, some subjects who guess will be correct) also tends to reduce the upper limit of the size of the correlations that one might expect from this research.[8] In short, the fact that many event memory studies report low correlations potentially tells us more about the inappropriate use of the correlation coefficient to measure the size of the relationship between confidence and accuracy in event memory studies than about whether there is a strong relationship between confidence and accuracy in event memory.

 

Ignoring the problems with using correlation coefficients in event memory studies for the moment, it is of some interest to note that researchers have reported that the correlations (though still small in absolute size) for witnesses who choose someone from a lineup are higher than witnesses who choose no one (e.g., Sporer, 1992; Sporer, Penrod, Read, & Cutler, 1995). If this is a general result, it has applied significance. Defense attorneys sometimes argue that their client is innocent of the charges, either because the charges describe behavior in which they claim their client did not engage (e.g., the drugs belonged to the other person) or because their client is the wrong person (e.g., he was home with his mother at the time the store was robbed). When defense attorneys argue that their client has been misidentified, they are claiming that a witness who positively identified their client is wrong. In other words, the defense must be that the witness was presented with a target-absent lineup and chose an innocent suspect, not that the witness was presented with a target-present lineup and didn't choose the guilty culprit. Thus, when attempting to generalize research results to the real world, experts who argue that witness confidence is an unreliable indicator (overall) are potentially misapplying the research because current research seems to suggest that confidence is a more reliable indicator for subject-witnesses who have identified someone than for those who chose not to identify anyone. To be consistent with the domain to which the results are being generalized, experts who worry that too many innocent people are being falsely identified should be basing their conclusions on the confidence-accuracy relationship for choosers and not on the relationship for both choosers and non-choosers. In my experience, I have never heard experts testify this way when speaking about the confidence-accuracy relationship.

                                                                   


 


Figure 2. Linear fit of average degree of confidence to the proportion of correct responses.

 

The use of correlations to measure the relationship between confidence and accuracy in event memory studies raises another important issue, namely, the fact that the data points represent the behavior of different witnesses who have witnessed the identical (or nearly identical) event. Thus, the variation from data point to data point in Figure 1 is variation from one witness to the next. Because each witness saw the identical videotape with the same culprit, the event memory correlation represents differences in confidence and accuracy that must be due to "pre-existing" psychological differences between the witnesses and not to the fact that different witnesses saw the culprit under different learning conditions. The only way these correlations could be high is if people who have better face memory (e.g., Benton, Sivan, Hamsher, Varney, & Spreen, 1983) or who attend more closely or who process faces more deeply, etc., are also people who tend to provide higher confidence ratings.[9]  The typical analyses of event memory studies do not allow for the possibility that the reason some witnesses in the real world are more confident in their identifications than others is because they saw the culprit under better learning conditions and therefore had better memories of the culprit (see Lindsay, Read, & Sharma, 1998, for a similar argument).

 

Use of the single-event, multiple-witness memory procedure opens the door to the possibility that different participants in the research will use the measurement scale differently to express their confidence. Not only might different people be more or less likely to remember what the "culprit" looked like (for whatever unique and unknown individual differences) but people whose memories might be equally good (or bad) may also be more or less likely to label their confidence as very high or very low on the rating scales. Thus, individual differences in how people use the confidence scale that are uncorrelated with individual differences that cause differences in strength of memory for the event will tend to attenuate single-event, multiple-witness confidence-accuracy correlations.

 

The fact that the generalization is over individuals who witnessed the identical event is important because we can ask what an outside observer might infer from an in-court confidence statement made by a single witness. In the real world, individuals who have to decide whether a witness is correct generally do not have the luxury of hearing from multiple witnesses, all of whom observed the identical crime from the same visual angle for the same amount of time. Instead, most decision-makers (e.g., detectives, jurors or prosecutors) hear one witness make an identification response and then provide a confidence estimate. When a witness tells us that he is confident in his memory of a particular person or event, do we infer from this expression of confidence that this witness is more likely to remember correctly than someone else might be -- a someone else whom neither the witness nor we have seen or heard? Alternatively, do we infer that this witness is telling us something about this particular memory compared to other memories this witness has? If it is the latter, then the between subjects correlations that make up a major part of the database (see Bothwell et al., 1987, Sporer, et al., 1995) for the experts' opinions may employ an inappropriate method of aggregation to study the relationship between confidence and accuracy. I shall return to this point later.

Relationship between confidence and accuracy in face memory research

Individual Differences in Face Memory and General Confidence: In face memory research correlations between each participant's overall accuracy (e.g., the percentage of all of their responses to all of the test faces that are correct) and each participant's average confidence (for all of their responses, both correct and incorrect ones) are generally computed. The results for a zero correlation might look something like the data represented in Figure 3. Note that a correlation computed on the basis of the confidence and memory data presented in Figure 3 represents something different than that previously described. In this case, we are asking whether people who are generally more confident in all of their attempts to identify faces that they have seen before are also getting a higher percentage of their identifications correct. Like the event memory case, these correlations represent individual differences among people who studied the identical set of faces. However, these are based on averages over many identifications rather than just one identification of one face. As such, they represent individual differences in tendencies to be confident and tendencies to be correct. Are people who are more likely to use the higher ends of the confidence scale also more likely to identify correctly faces that they have and have not seen before?

 

 


 


Figure 3. Scatter plot of average degree of confidence to proportion of correct responses.

 

Differences in the Memorability of Faces and General Confidence: Although not typically reported, another correlation can generally be computed from data obtained from face memory research. In this case, averages are computed for each face rather than for each participant. That is, rather than average all the data from all of the faces that each participant saw, it is possible to average all of the participant's together for a particular face. In this case, one is asking whether people are more likely to identify correctly some faces than others. In addition, are the participants more confident, as a group, in their responses to those faces that they are more likely to identify correctly? In this case, the generalization is across "criminals" (actually faces) rather than witnesses. Are the more memorable faces those that witnesses (as a group) tend to be most confident about?

 

One potentially important difference between the individual-difference and face-based correlations is that in the latter individual differences in how people use the confidence scale will average out whereas in the former they will be exaggerated. If different people tend to use confidence scales differently but each person is generally more confident in those faces that they correctly recognize, then we would expect the face-based correlations to be higher than the individual-difference correlations. This is exactly the result that Ebbesen & Wixted (1996) reported. The fact that face-based compared to individual-difference confidence-accuracy correlations may be higher suggests that one of the reasons single-event, multiple-subject correlations are generally low is because they are sensitive to individual differences in how people use confidence scales.

Relationship between confidence and accuracy in fact memory research

A correlation can be computed for each witness in fact-memory research. Like the data in Figure 1, witnesses get each of the facts correct or incorrect. However, since each witness is tested on multiple, rather than just one, fact, it is possible to compute a correlation for each witness (e.g., Smith, Kassin, & Ellsworth, 1989). In this case, one is asking whether the fact responses that a witness is most confident in are more likely to be correct than the witness's less confident fact responses. Here a separate generalization can be made for each witness over each witness's many different attempts to remember things (Gruneberg & Sykes, 1993; Smith, Kassin, & Ellsworth, 1989).

 

It might be worth noting that the same correlations can be computed for each participant in face memory research (Ebbesen & Wixted, 1996). That is, we can ask whether for a particular witness those faces in which he expresses the most confidence are those faces to which he is most likely to respond correctly.

 

This method of measuring the association between confidence and accuracy has the same statistical limitations as that described for event memory (e.g., correct and incorrect responses are coded 1 and 0 and the resulting correlations will generally be small). However, these correlations represent variation over memories for different events/people within each witness and not individual differences in memory for the same event. That is, they represent multiple-event, single-subject correlations.

 

It is also possible to compute individual difference and fact-based (like face-based) correlations. That is, the total percentage of correct facts that each witness remembers can be compared to each witness's average confidence. Alternatively, the percentage of witnesses who respond correctly to each fact can be compared to the average confidence that all witnesses report for each fact.

Generalization over criminals, crimes, and witnesses

Very few studies have examined the relationship between confidence and accuracy over a wide array of different events/criminals (Lindsay, et al., 1998). In such an analysis, variation in the event as well as the witness and the criminal contribute to differences in both accuracy and confidence. That is, different data points represent different witnesses, criminals, and events (events that might differ in terms of key factors, e.g., other v same race, stress, retention interval, disguise, distinctive features, and so on). Lindsay, et al., (1998) and Read, et al. (1998) reported that this method of constructing correlations between confidence and accuracy resulted in higher correlations than have been typically reported. In other words, when the variation that different learning conditions produce in different people's memory is not held constant, i.e., multiple-event, multiple-witness memory procedures are used, the confidence-accuracy correlations appear to be larger. This is unsurprising for two reasons.

 

First, when we allow stimulus variations to influence accuracy, the range of memory strength and therefore accuracy values may well be larger and more evenly distributed over subjects than when stimulus variations are held constant. If subjects do have access to relatively reliable information about the strength of their memories, an increase in the range of strengths of memory should increase the size of confidence-accuracy correlations. Second, the multiple-event, multiple-witness procedure can increase the size of confidence-accuracy correlations even if subjects do not have direct access to the strength of their memories. If subjects generate confidence estimates, at least in part, on observations of the learning and test conditions (they know they saw the culprit for five minutes) and the meta-theory that they use to infer confidence is sufficiently accurate, then their confidence estimates will tend to co-vary with the event differences that control accuracy. Again, this would tend to increase the size of confidence-accuracy correlations.

 

If the multiple-event, multiple-witness procedure for producing variation in accuracy and confidence data does produce higher confidence-accuracy correlations, the conclusions reached by Bothwell, Wells, Penrod and others regarding the relatively weak relationship between confidence and accuracy might be premature and apply only to single-event, multiple-witness data. Since in the real world of crime, different witnesses do not see the same crime from the same visual angle for the same amount of time, one could easily argue that generalizations made from the single-event, multiple-witness paradigm are inappropriate.

What is the appropriate method of aggregation?

Ebbesen and Wixted (1996) report that confidence-accuracy correlations are much higher (between .35 and .7) when they are based on differences between faces averaged over witnesses (even holding learning and test conditions constant). They also found that for over 90% of their subjects, the within-subject, between-face, confidence-accuracy correlations were positive (i.e., for the large majority of subjects, higher confidence was consistently associated with a greater probability that identification responses were correct) although the absolute average size of the correlations was between .2 and .25. Because different methods of aggregating the same raw data generate different results regarding the size of the correlation between confidence and accuracy, it is important that we ask which method(s) supply the most appropriate estimates when generalizing to the real world. Should we focus on variation produced exclusively by individual differences in reactions to a constant criminal event? Alternatively, should we focus on individual differences based on averages over different culprits and events? Would it make more sense to focus exclusively on culprit-face differences (averaged over many different witnesses and/or criminal events)? Should we focus instead on culprit-face differences within each witness? Alternatively, we might focus on differences in learning and/or test conditions (averaged over witnesses but for only one culprit) or combinations of several of these (and others) all at once.

 

Do we want to know whether more confident witnesses to the same crime who saw the culprit at the same visual angle for the same period of time are also more accurate? Or, do we want to know whether the confidence that a witness has in her memory for one thing predicts the odds that the recollection of that thing compared to other things will be accurate? Or, do we want to know whether confidence estimates supplied by different witnesses who saw different culprits under varying conditions is predictive of their accuracy? Or do we want to know whether some criminals for whom typical witnesses feel more confident are the criminals that typical witnesses will tend to remember correctly? Clearly, these are different generalizations. Unfortunately, the differences have not been adequately discussed in the context of the decision problems faced by people in the legal system.

 

Deciding on the appropriate source of variation to recommend is complicated by the fact that law and members of the legal system generally see every case as different (Konecni & Ebbesen, 1982). Nevertheless, the legal system speaks in terms of "odds." For example, terms such as, "more probably than not," are used when discussing jury decision standards. Prosecutors ask witnesses about and witnesses are willing speak about percentages; as in,  "I am 90% certain." If every case is truly different, then such statements are meaningless because odds and percentages depend on multiple examples of similar events. When a witness says that she is 90% certain, what events make up the numerator and the denominator of the percentage? Which of the following is she saying, a) 90 out of 100 times when my memory is this strong I will be correct, b) 90 people out of 100 who saw what I saw would be correct, c) I would be able to identify correctly 90 out of 100 criminals who looked as distinctive as the criminal that I saw, d) 90 people out of 100 would correctly identify a criminal with his features, e) 90 out of a 100 times that I see people under the conditions that I saw this person, I would be able to recognize them, and so on? Clearly, if witnesses are telling us the odds that equally strong, but different, memories of past people and events will be correct and researchers are attempting to tell us the meaning of these claims by examining the size of correlations based on individual-differences in single-event, multiple-witness studies, the researchers' conclusions are based on the wrong kind of evidence. Stated differently, we should base our conclusions about relationships on data that match the kind of information that jurors and the rest of the legal system really want to know.

Odds, probabilities, and certainties are not the same as correlations

What information do actors in the legal system want to know when deciding whether to file charges or reach a guilty verdict? Do these decision-makers care about the strength of the relationship between confidence and accuracy or do they care about the odds that a suspect is the guilty culprit? Although probably not articulated in the same language as a statistician, it seems reasonable that most actors would focus on odds and not the relationship. The reasoning of key actors might be something like: if witnesses are very confident in their identifications, then the odds that their identifications are accurate should be high, and therefore the odds that the suspect is the guilty culprit should be high. This reasoning says nothing about how accuracy should change and as the level of confidence changes. In fact, it is possible for the conditional probability that suspects are the guilty given that witnesses express high confidence to be high even though changes in accuracy are weakly or even unrelated to changes in confidence. If all levels of confidence were associated with high accuracy, then no relationship between confidence and accuracy would exist but the probability that identifications were accurate given high (as well as low) confidence could be close to one.[10]

 

One aspect of evidence that decision-makers might be expected to use in estimating the pool of suitable matching suspects is identification by a witness. How many innocent people look similar enough to the culprit (who also match whatever other evidence is available, if any is available) for the witness to identify them as the person they saw? That is, what are the odds that the police have arrested an innocent individual who looks enough like the culprit that a witness would be willing to identify him?

 

Lineup diagnosticity is one measure that some (e.g., Wells & Lindsay, 1980) have suggested should be used to assess the ability of witnesses to indicate accurately who the culprit is when shown a lineup. This measure compares the rate at which subjects falsely identify "innocent" suspects in target absent lineups to the rate of correct choices of the "guilty" target in target present lineups (e.g., Wells & Lindsay, 1980, Wells & Luus, 1990). The higher this ratio, the more diagnostic the lineup is thought to be. Of course, this measure can only be computed in experimental studies (with a known culprit) that use single-event/culprit, multiple witness paradigms in which different subjects are shown the same target present or target absent lineup. This is because lineup diagnosticity would be expected to be different for different culprits, foils, and suspects.

 

As Navon (1990) correctly noted, given the decision problem facing police, prosecutors, and jurors, lineup diagnosticity is not the measure of diagnosticity on which the real world should focus its attention. This is because lineup diagnosticity depends so much on how the experimenter selects the innocent suspect for the target absent lineup as well as the match between what the target looked like during the event and what he or she looked like in the lineup (photo). It seems obvious that the more the innocent suspect looks like the culprit, the higher the false alarm rate will be (assuming that the witnesses remember something about the culprit's looks). In addition, the more a culprit's appearance changes from the event to the lineup, the lower the correct identification rate will be. In addition, the more the lineup is constructed in a manner so that the innocent suspect "stands out," the higher the false alarm rate will be. Thus, it should be possible in a laboratory experiment to control the relative rates of correct to false identifications -- lineup diagnosticity -- by varying the similarity relationships between the actual culprit and pictures used for the target, suspect, and foils (e.g., Luus & Wells, 1991). This raises the possibility that every real world lineup will have a different diagnosticity depending on such details. Unfortunately, such details cannot be measured in any particular lineup because they depend, in part, on the match between the suspect's and culprit's looks (assuming that they are not one in the same). Obviously, the guilty culprit's looks are generally not known if an innocent suspect is being charged.

 

On the other hand, as the ecological likelihood increases, the odds that the suspect is the guilty culprit increase. As a result, the odds that the lineup shown to witnesses is a target absent lineup go down. As the odds that the lineup is a target absent lineup go down, the likelihood that suspect choices are correct goes up. This is true even if lineup diagnosticity is low.

 

Lineup diagnosticity is measured in terms of the ratio of two ratios:

(# "guilty" target choices)/(# of target present lineups)

(# "innocent" suspect choices)/(# of target absent lineups)

 

Compare the following two situations. In the first 100 witnesses are shown a target present lineup and 50 witnesses pick the target while another 100 witnesses are shown a target absent lineup and 50 pick the suspect. In the second 100 witnesses are shown a target present lineup and 50 witnesses pick the target while another 10 witnesses are shown a target absent lineup and 5 pick the suspect. In each case the diagnosticity ratio is .5/.5 or 1. However, if we ask about the odds that witnesses who choose someone are correct, the odds are 50/50 or 1 to 1 in the first case and 50/5 or 10 to 1 in the second case. In short, in actual cases, the odds that witness identifications are correct depend heavily on the ecological likelihood that the lineup contains the guilty culprit as opposed to an innocent suspect.

Features of a case linked to the suspect that will increase ecological likelihood are, by definition, "distinctive" or unlikely to be associated with a large percentage of the population of potential suspects. Thus, the facts that the culprit drove off in a car and that the suspect owns a car do not add to the ecological likelihood that the suspect is the culprit because these facts can apply to so many potential suspects, i.e., owning a car is not distinctive. On the other hand, the fact that the "get-away" car had a pink lightening bolt painted on its hood and that the suspect owns a car with a pink lightening bolt on its hood does add to the ecological likelihood that this is the guilty suspect (although how much might depend on other factors, e.g., did the suspect report that the car was stolen before the crime was committed, did the suspect lend the car to someone on the day the crime was committed, and so on).

One set of features that might affect ecological likelihood is that associated with the "looks" of the suspect/culprit. For example, suppose a witness recalls that the culprit had an unusual tattoo on his neck or a very prominent scar on his cheek or that he was cross-eyed. Since such features would reduce the set of possible suspects to a very small number, they should add to the ecological likelihood that a suspect who has such a feature is the culprit. On the other hand, when witnesses are asked to identify whether the suspect is the culprit from a lineup, an issue arises about how best to deal with such distinctive features. Many researchers (e.g., NIJ, 1999; Wells, et. al., 1998) argue that creating a lineup in which the suspect is the only member with the distinctive feature decreases the diagnosticity of the lineup (because a target absent lineup in which the innocent suspect is the only person with the distinctive feature will produce a high rate of false alarms). After all, if a witness recalls that the culprit had crossed-eyes and the only individual in the lineup with crossed-eyes is the suspect, it seems reasonable that the witness would not even consider choosing any of the foils. As a result, the witness would be looking at a lineup with a functional size (Lindsay, Smith, & Pryke, 1999; Wells, Leippe, & Ostrom, 1979) of one instead of near six. If witnesses tend to use a "relative decision" strategy when picking from a simultaneous lineup (e.g., Lindsey, Pozzulo, Craig, & Lee, 1997; Lindsey & Wells, 1985; Puzzulo & Lindsey, 1999), the one picture they will be most likely to pick should be the innocent suspect. Two strategies have been suggested to correct for this problem (NIJ, 1999). Either the lineup should be constructed in a manner in which all of its members have the distinctive feature, e.g., crossed-eyes, or the distinctive feature should be hidden from the witness, say by having all of the members of the lineup wear a patch over one of their eyes. In this way, none of the members of the lineup will "stand out" from the remaining members.

We can look at this problem from a different point of view, however. Consider the example of the car with a pink lightening bolt. Imagine that the police find a car with a pink lightening bolt painted on its hood and ask the witness to identify the car. Would we require that the witness pick from a lineup of six cars in which the distinctive feature (pink lightening bolt) was hidden from view, say by repainting the hoods of all of the cars with black paint or by painting pink lightening bolts on the hoods of the five known "innocent" cars? We know of no researchers who have suggested that this is the way in which witness testimony about "objects" should be collected. One reason might be because such procedures seem unnecessary.

But why would such procedures seem unnecessary in the case of objects but not in the case of people? After all, witnesses might be identifying the wrong car because they are recalling the pink lightening bolt and not the entire car. Surely we would want the witness's identification of the car to be based on recognition of the "entire" car. On the other hand, how can we expect a witness to recognize every aspect of the car? Can we expect the witness to recall the pattern of the scratch marks on the passenger's front door or the tiny crack in the plastic cover of the left rear blinker? Isn't it enough that the witness identifies the car? In part the answer might have something to do with real and perceived ecological likelihoods and assumptions about how witnesses make identifications.

 

With regard to the former, it might seem very unlikely that the police found the wrong car with a pink lightening bolt on it (because it seems obvious that very few such cars exist) and as a result we infer that the odds that the car being shown to a witness is "innocent" are extremely low. As a result, the odds that a positive identification is correct are very high. In the case of an identification of a cross-eyed suspect however, one might feel that the likelihood that the police found the wrong cross-eyed suspect (because there are so many cross-eyed individuals -- at least many more than cars with pink lightening bolts painted on their hoods) is considerably more likely. As a result, the odds that a witness's identification of a cross-eyed individual is correct seem much lower.

 

Of course, the feeling that special procedures are required to protect the accused from false identifications might be based not on prior expectations about the odds that lineups contain innocent suspects but rather on the belief that eyewitnesses are more likely not to reject innocent suspects than "innocent" objects because face recognition depends more heavily on distinctive features than "object" recognition. That is, one might assume that witnesses fail to consider other features of faces besides the distinctive one(s) when deciding whether a face is the culprit's. Such a process could reflect the way in which faces are initially encoded (Main, Leland, & Bartlett, 1998) or the way in which decisions are made during the identification task (e.g., the presence of a remembered distinctive feature is sufficient evidence to identify). While there is considerable evidence that distinctive faces (those rated as more distinctive) are better recognized than those that are less distinctive (Shapero & Penrod, 1986; Webster, Leland, & Bartlett, 1997; Wickham, Morris, & Fritz, 2000), the role that particular distinctive features, e.g., crossed-eyes, play in identification accuracy in lineups has not been well studied. It is not known, for example, whether the presence of such features will increase false alarm rates faster than hit rates. In addition, we do not currently know the effect on the relative rate of hits compared to false alarms of hiding the feature or of adding foils with similar distinctive features.

 

The problem for the real world decision-maker is estimating the number of people who match the evidence (e.g., suspect seen driving a similar car, gun found in suspect's home, etc.) and who look enough like the culprit that a witness would be willing to say, "That's him." Whether a suspect is similar enough for a witness to identify him as the culprit depends on several different mechanisms. The first is the distribution of facial and other characteristics over the population. How many people look similar enough to a randomly sampled individual that other people might confuse them? A second mechanism consists of the process by which the suspect was selected by the police. If the process leading to a suspect's arrest depends on how much the suspect looks like the culprit, then the odds, based on random sampling, that an innocent suspect would look like the culprit will be too small. For example, suppose that a witness is asked to help a police artist draw a sketch of the culprit. Suppose further that after this picture is made widely available to the public, someone calls the police and tells them where they can find a person who this informant believes looks a lot like the sketch. The police then construct a lineup with this person as the suspect. Clearly, the odds are a lot higher that the witness would be willing to say that this suspect is the culprit than if a suspect had been selected simply because he or she owned a car fitting a description of that used in the crime. A third set of mechanisms determines the strength of the witness's memory for the culprit. Although the relevant research has yet to be done, it is possible that the stronger witness memory is for culprits, the less likely witnesses will be to choose innocent suspects who look very similar to the culprit. A fourth set of mechanisms consists of the variables that control where a witness places his or her decision criterion. How good does the match between the witness's memory and the suspect (or suspect's picture) have to be before the witness is willing to say, "That's him"? The fewer people a witness would be willing to identify as the culprit, the greater the odds the suspect is the culprit.

 

We can restate the issue of the witness's criterion in terms of resemblance. The odds that the suspect is the guilty culprit should be higher the greater the resemblance of the suspect to the witness's memory of the culprit. The stricter the resemblance criteria, the fewer people in the world one would expect to satisfy that degree of resemblance. If one views confidence as a statement of the degree of resemblance between the contents of memory and the looks of the suspect/culprit, then given the above reasoning, it would make sense to assume that the odds that a witness's identification is correct increases with increasing confidence. This is similar to the view that Ebbesen and Wixted (1996) proposed in their signal detection analysis of face memory. However, the fact that the odds that identifications are correct increases with increasing confidence is not identical to saying that confidence and accuracy will be highly correlated.

 

The fact that multiple and different mechanisms determine the evidentiary value of a witness's identification requires that information other than correlation coefficients and diagnosticity be obtained to evaluate the odds that a witness's identification might be correct. In particular, decision makers should want to know whether the odds that the suspect's looks matches the culprit's is higher than would be expected by random sampling. This is quite a different issue than the degree to which the witness's memory of the culprit matches the culprit's looks. Whether the correlation between confidence and accuracy is big enough provides no information about these issues.

 

What are the odds that the police have arrested and charged an innocent suspect who "stands out" enough because of the method of lineup construction and who looks enough like what a witness remembers that the witness would be willing to pick this suspect and then indicate that she was very confident in that choice? When one phrases the issue this way, it seems clear that the data and analyses from which conclusions about the relationship between confidence and accuracy have been drawn are simply irrelevant.

Confidence v. Other factors: Results in most eyewitness memory studies are not reported correctly

It should be obvious that the legal system either will choose not to pursue cases in which the witnesses express low confidence in their identifications or (according to critics) will tend to cause witnesses with low confidence to raise their estimates before they testify. One need simply imagine a trial in which an ID witness testifies that the defendant is the person who raped her but then says that she is just guessing about the identification to realize that the legal system is going to select cases, at least in part, on the basis of witness confidence. Thus, in the huge majority of real world cases, jurors will be faced with witnesses who are confident. However, researchers who study the effects of various factors (e.g., stress, duration, lineup procedure, race, instructions) on eyewitness identification continually fail to report their findings conditional on the confidence of their subjects.

 

Consider for example a study of the effects that duration of exposure might have on eyewitness accuracy. Surely, with some diminishing returns, longer durations of exposure will produce more accurate identifications than very short durations. As already discussed, many eyewitness memory researchers would also ask the subjects how confident they are in their identifications. The results of duration (or stress, retention interval, instructions, and so on) are then presented in terms of some measure of average accuracy for each duration of exposure. The results for the confidence-accuracy relation are then presented. Thus, researchers tend to include all of the responses, including those for which the subjects might have said they were just guessing, in the results for the factor effects. For example, suppose the design compared the accuracy for a short duration of exposure, say, .5 seconds, to long one, say, 30 seconds. The researcher would compare all of the responses for the .5-second exposure with all of the responses for the 30-second exposure even though the subjects in .5-second exposure might have said they were just guessing 80% of the time while those in the 30-second condition might have guessed only 10% of the time. Thus, the differences in average performance between the short and long duration conditions would consist of different proportions of "guessers." This is a problem because the legal system almost surely doesn't use identifications of witnesses who say they were just guessing. To generalize the effect of the different durations, researchers would have to select out those witnesses whose confidence was high enough and then examine the difference between short and long durations just for them. It might well be that the size of the duration effect will be a lot smaller when the memories of only the most confident witnesses are examined.[11]

 

When researchers (e.g., Wells, et al., 1998) claim that jurors might be better off focusing on factors other than confidence, they are basing this suggestion on their belief that other factors predict witness accuracy "better" than confidence. However, researchers have never actually examined the relative predictive accuracy of confidence compared to other factors, especially in relevant applied settings. Research results are not presented in terms of the ability of these different measures (e.g., confidence and duration of exposure) to predict eyewitness accuracy. One reason might be because it is not obvious how this should be done. Although we can use standard statistical procedures (e.g., general linear modeling), non-statistical features of the experimental procedure and design complicate the interpretation of results. For example, suppose one researcher designed our hypothetical duration of exposure study with durations of 1 and 2 seconds and another designed it with durations of 1 second and 10 minutes. We would intuitively expect that the effect of study duration would be much smaller in the former than the latter study. Suppose both researchers also measured confidence. The first researcher might discover that individual differences in confidence predicted witness accuracy "better" than the small difference in duration that he created while the latter might discover the opposite. Thus, the relative predictive accuracy of confidence v. other factors will depend heavily on the range of levels of the factor being varied (or observed).

 

This problem is similar to that encountered when single-event, multiple witness confidence-accuracy correlations are used to generalize to the multiple-event, single-witness real world of crimes. In order to generalize "effect sizes" from the laboratory to the field, one has to be sure that the range and source of variation in the laboratory of the variables of interest are the same as the range and source of variation that occurs in the settings to which one hopes to generalize. In addition, it is not clear how to assess "better" when one predictor depends on individual differences (e.g., confidence) that might be more or less reliably measured and the other depends on differences between situations (e.g., duration) that are held at fixed levels with perfect reliability in laboratory studies. But then when the general "better predictor" principle is applied to the real world, duration is no longer measured with perfect reliability and depends completely on estimated values whose relationship to actual values is completely unknown.

 

The problem is exacerbated by the possibility that confidence and the diagnostically better factors that jurors are supposed to use are not independent of each other. For example, we can be pretty sure that average confidence in memory would be higher in the 10-minute condition than the 1-second condition described earlier. This means that it is quite possible that some, if not all, of the effect of a factor on memory, might be mirrored by an effect on confidence. If so, confidence might actually account for some of the effect of factors on memory. This is exactly what one might expect from a signal detection analysis (Ebbesen and Wixted, 1996).

 

Assuming that the prior reasoning is correct, it raises the possibility that confidence may well be a "excellent" predictor of eyewitness accuracy when it is allowed to capture sources of variance that would be typical in the real world, e.g., differences in criminals' faces, differences in duration of exposure, differences in attention paid, and so on.

 

Looking at this same issue from the side of the factor rather than confidence casts the problem in a slightly different light. Researchers almost never present their results in a way that shows the effect of duration (or any other factor) on those witnesses who expressed the highest confidence compared to those who expressed the least. The reason that this is so important is that it is possible that the effect might be much bigger for those with the lowest confidence and substantially reduced or eliminated for those who are very confident. For example, Kebbell, Wagstaff, and Covy (1996) reported that witnesses who were absolutely confident about a recollection were almost invariably accurate. It is conceivable, therefore, that other factors might account for very little variation in the accuracy of those responses in which witnesses are absolutely confident. If this were the case, the conclusion that long durations might produce many fewer errors than short ones would primarily apply to witnesses who expressed lower confidence levels and since the legal system will probably be more likely to eliminate witnesses as their confidence decreases, the effect that other factors, such as duration, have on accuracy in laboratory studies would tell us little about the more confident witnesses who play a role in the real world.

 

Why might the effects of situational variables disappear for highly confident witnesses? If confidence does reflect the strength of witnesses' memories for events that they have experienced, then as contextual events increase witness accuracy, the same events might also increase witness confidence. Thus, the effect of the factor on accuracy might be mirrored by an equivalent effect on confidence. If this were to happen, we wouldn't need to know the conditions that existed when a witness was exposed to the criminal event. Knowing confidence would tell us the end result, namely the strength of the witness's memory, regardless of how that memory strength was produced.

 

In any case, as noted earlier, the issue is not whether duration of exposure has an effect on accuracy. The issue facing those who have to make the guilt/not guilty decisions is about the probability of guilt that is associated with the particular pattern of evidence presented in the case. A given witness saw the culprit for a given duration of exposure. The jury's job is to estimate the guilt of the accused given that the witness confidently identified the defendant after observing him for a particular duration of exposure. The jury's job is not to determine the nature of the relationship between duration and accuracy. In this context, researchers should be reporting their results in terms of the conditional probability of an accurate identification, given that the witness is highly confident and the duration of exposure is at a particular value. The applied issue should be the stability of such conditional probabilities.

But research shows that mock jurors make poor decisions because they rely on confidence

A number of studies using slightly different methods have attempted to test, directly, whether mock jurors who are presented with witnesses who made correct or incorrect identifications can accurately determine which are which. During an initial phase in these studies, witnesses are shown a simulated crime. All of the witnesses are then asked to "testify." During their testimony they are shown a lineup and asked to choose the culprit as well as to supply an estimate of their confidence in the identification. A second group of subjects, mock jurors, are then presented with evidence and shown the testimony of one of the witnesses. Half of the subjects see the testimony of a witness who accurately chose the culprit in the lineup and half see a witness who chose incorrectly. The subjects are then asked to indicate whether the culprit whom the witness identified is guilty and/or to rate the witness's accuracy.

 

Studies such as these (e.g., Lindsay, Wells, & O'Connor, 1989) have reported that mock jurors tend to respond to the witnesses' confidence estimates and not whether the witnesses had correctly chosen the culprit. That is jurors tend to believe the accurate witness at about the same rate as the inaccurate witness. Furthermore, because they tended to use confidence, the mock jurors tended to ignore the other information presented in the simulated trial, information that the other studies have shown significantly effects witness accuracy, e.g., stress or retention interval.

 

Unfortunately, the logic of this research is seriously flawed and therefore the conclusions frequently drawn from it are not justified. The logic is as follows: Research has shown that confidence is not, or only very weakly, related to accuracy. Research has also shown the factor x (insert your favorite set here, e.g., stress, other-race, weapon focus, retention interval, post-event memory influences, etc.) consistently affects the accuracy of eyewitness identification (i.e., variations in one or more of these produces significant mean differences in one or more measures of witness accuracy). When different jurors sees witnesses who express different levels of confidence and hear case evidence that includes a description of the level of one or more of these factors present during the "crime," jurors judge the witnesses' accuracy on the basis of their confidence rather than on the basis of the level of the factor(s). As a result, they are frequently wrong.

 

The logic is faulty for several reasons. First, the logic assumes that confidence is not diagnostic of accuracy. We have shown above how this conclusion does not accurately reflect the nature of the different relationships between confidence and accuracy. Therefore, the initial premise is wrong.

 

Second, the researcher is comparing apples and oranges when comparing the relationship between the level of confidence and accuracy and that between the level of one or more eyewitness factors and accuracy. The former is almost always based on individual differences and the latter almost always on situational differences. We can attempt to compare apples with apples by asking whether the same situational variables that produce differences in average accuracy also produce similar differences in average confidence. In those studies that have reported average confidence, the answer is generally, "yes." For example, we know that as duration of exposure goes from a few seconds to 60 seconds, the average accuracy of subjects increases. In addition, the average confidence that subjects express in their identifications also increases (e.g., Ebbesen & Wixted, 1996). Thus, there is a natural tendency for average accuracy and average confidence to be correlated over learning conditions. As a result, using confidence to estimate the "strength" of the largely hidden learning conditions in these studies might be a very rationally and generally accurate strategy in the real world.

 

Third, if we look at the task from the point of view of the mock jurors, we discover that their task is particularly difficult and different from that of most real jurors. The mock jurors are given evidence in the mock trial that we know, because of the experimental design, is unrelated to witness accuracy. That is, all witnesses saw the same "criminal" event and the same culprit for the same amount of time under the same learning conditions. Some witnesses then picked the wrong person from a lineup and others picked the right person. What information could the jurors possibly use to detect which witnesses picked the culprit and which picked someone else. Surely detailed information about the learning conditions that both correct and incorrect witnesses experienced cannot provide jurors with information about which witness will be correct. After all, the evidence is identical in all cases because all witnesses saw the same crime under the same conditions. The only information that would be available would be individual differences in how the witnesses behaved during their testimony. Is this representative of actual cases? A moment's reflection will indicate that jurors are generally presented with evidence that co-occurs (over trials) with witness testimony. That is, jurors hear about alibis, about how the culprit was arrested, about other physical evidence, as well as from other witnesses who might present corroborating information. In this context, the jurors also hear the witness identify the culprit with whatever confidence they express. Finally, jurors also hear witnesses report on the conditions of observation that they experienced, e.g., where they looked, how far away they were, what they were feeling at the time, and so on. In the typical experiment, the "other" case evidence is held constant. Thus, the only potentially predictive information available to the mock jurors is witness behavior. The opportunity for jurors to rely in what most researchers believe is diagnostically better situational information is never made available.

 

An adequate test of whether jurors appropriately weight confidence compared to other information requires that we compare the ability of witness confidence verses situational factors to predict actual guilt with the subjective "weight" that jurors give to confidence verses situational factors when they decide guilt. These comparisons have never been performed because we do not know how well witnesses' confidence estimates verses other factors predict actual guilt.

 

Finally, some of the research on this topic uses a slightly different methodology (e.g., Cutler, Penrod, & Stuve, 1988) that is common to most of the research that attempts to determine the relative weights that decision-makers (e.g., mock jurors) give to different sources of information. Unfortunately, results from this methodology are also extremely difficult to interpret. This research methodology typical varies the different sources of information (in a factorial design) and then examines the effect of those variations on decision-making.[12] In general, relative weights are then inferred from the relative sizes of the effects of the two factors. Researchers assume that the jurors give greater relative weight to the factor that produced the bigger effect (or accounted for more of the variance). The reason that this kind of research is so difficult to interpret is because relative effect size is so dependant on the range of variation in the factors. Assume, for the moment, that jurors weight two factors equally and that an experiment varies both of them in a factorial design. If the range of variation in one factor is small and the range in the other is large (by whatever measures one chooses), then it is likely that the factor with the greater range of variation will produce the bigger effect. For example, imagine that mock jurors either see a witness who is very confident or one who admits that she just guessed. Imagine further that half of each of these jurors are told that the witness saw the culprit for 5 seconds and the other half are told 10 seconds. We might expect the confidence manipulation to have a bigger effect. But suppose that another experiment is done. However, this time half of the jurors are told that the witness saw the culprit for 1 second and the other half are told 10 minutes. Now we might expect the effect of duration to be much bigger. Thus, the weight inference requires that the subjective differences between the levels within one factor be equivalent to those for the other factors to which it is being compared. Unless this equality in size of the manipulations is demonstrated, the issue of weight cannot be unambiguously determined.

 

Unfortunately, the studies that have examined the relative weight that mock jurors give to confidence compared to other factors suffer from one or more of the interpretational and design problems discussed here.

Confidence malleability and independence of confidence and accuracy

Several researchers (e.g., Wells & Bransford, 1998) have suggested that confidence and accuracy are independent because confidence can be changed without accuracy also changing. For example, some researchers have concluded that repeated questioning, learning that others agree with one's recollections, and learning that one's recollections are "correct," tend to increase confidence without also increasing accuracy (Luus & Wells, 1994; Shaw, 1996; Wells & Bradfield, 1998, Wells, et. al, 1998). Unfortunately, to claim that the mechanisms that control accuracy and confidence are independent requires additional evidence. To understand why requires that we consider the various ways in which two response production systems (confidence ratings and recognition responses) might be related. Figure 4 shows a representation of the simplest model of complete dependence between confidence and accuracy. It suggests that all of the variables that control accuracy have their effects on the same processes that control confidence. In this way, whenever a variable changes accuracy, it will also produce changes in confidence.


 


Figure 4

 

It is not necessary to assume that the same single mechanism controls both confidence and accuracy to have a system in which the two responses will show dependences, however. Consider Figure 5. This figure shows a system in which confidence is controlled by one mechanism and accuracy by another but because all of the variables that affect one mechanism also simultaneously affect the other, variations in the two response systems will always co-occur, although the exact form of the covariation would depend on the nature of the two mechanisms.


Figure 5

 


It is also possible to construct models in which the confidence and accuracy mechanisms are related in time. Figure 6 shows such an example. In this case all input first affects the accuracy mechanism and then output from that process affects the confidence process. Clearly this system would cause accuracy and confidence to co-vary as inputs of different types were varied.


Figure 6

 


To break the co-variation between two response systems requires that not all variables affect both response systems as depicted in Figures 4 through 6. However, the fact that one set of variables affects one response system but not the other does not necessarily mean that the two systems are independent. Consider the model in Figure 7. In this model, variation in input A will cause accuracy and confidence to co-vary in the same manner as in the model depicted in Figure 6 because the output of the accuracy process serves as an input to the confidence process. Of course, variation in input C will only affect the level of confidence and will have no effect on accuracy. Thus, it is possible for accuracy to be related to confidence in that variables that affect accuracy (input A) also affect confidence, but for the level of confidence to be controlled by other variables (input C) as well.       


Figure 7

 


After all, confidence is simply a verbal self-rating. As a self-rating it should be subject to all of the same instructional and motivational factors that affect all self-ratings (Eiser, 1990). For example, because confidence is a self-report, instructions in how to use the confidence scale will almost surely affect witnesses' average measured confidence (e.g., mean confidence ratings) without affecting the accuracy of their memories. Imagine that we told some subjects that they were to use the "absolute confidence" self-description whenever they felt that there was a 65% chance that their identification response was correct. Imagine that we told others to use the same self-description only when they felt that there was a 99% chance that their identification response was correct. It seems likely that the former would have a much higher average (over several identification attempts) confidence score than the latter, all other things equal. It also seems likely that these instructions would have no effect on the accuracy of memories.

 

The fact that confidence ratings can be manipulated independently of accuracy does not mean that accuracy and confidence will not tend to be related nor that confidence is not likely to be a very good predictor of the accuracy of a witness's different memories. For example, while it true that instructions might raise the average confidence for all witnesses who hear them, it can still be true that for each witness, the things that they are most confident in will be the things that they are most likely to identify correctly. Figure 8 shows how this might work.

 

 

 


Figure 8

 


Instructions might increase the confidence that a witness expresses for each memory about which they are asked. However, the items in which they are most confident might still be the items for which they have the strongest memories. Thus, higher confidence would still be associated with higher accuracy.[13]

 

True independence of accuracy and confidence would only occur if the variables that control accuracy and those that control confidence both affect different mechanisms and the two mechanisms were unconnected. Figure 9 shows an example of this model. In this case, variation of input A (while holding input C constant) would produce an effect on accuracy and no effect on confidence and similarly variation in input C (while holding input A constant) would produce an effect on confidence but no effect on accuracy. There is no evidence that confidence and accuracy are related in this manner. Although the relevant research has yet to be done, most studies show that variables that produce changes in accuracy (e.g., duration of exposure, retention interval, attention, and so on) tend to produce changes in average confidence, as well (e.g., Ebbesen & Wixted, 1996).


Figure 9

 


Does the fact that confidence can be changed by some manipulations that do not affect accuracy mean that decision-makers in the legal system should ignore witness confidence and focus on other variables, e.g., the retention interval, the type of lineup procedure used, or whether the person who conducted the lineup was blind as to the suspect? To answer this question in the affirmative implies that experts and jurors would be more accurate in their judgments about guilt (and witness accuracy) were they to rely on other factors and ignore confidence rather than base their judgments on both types of information (or even just confidence alone). Unfortunately, currently we have no data that directly tests this hypothesis. Still, if confidence is malleable and other factors that might predict accuracy, e.g., retention interval, are not, then it seems only logical to ignore the "unreliable" or variable predictor in favor of the more reliable ones.

 

However, when attempting to generalize these findings to the real world, at least three issues are important.

 

As suggested above, confidence ratings are likely to be affected by all of the "judgment" variables that affect other types of judgments, e.g., attitudes and perceptions. Two board categories of variables that will affect how people use judgment scales to express subjective experiences are information about how to map the scale onto their feelings (e.g., instructions, anchors, social comparison, training) and motivational variables (e.g., rewards and punishments for using different parts of the scale under various conditions). The studies showing that confidence is malleable have used specific procedures to show that college student subjects who know that they are participating in an experiment will alter their confidence estimates. A key to being able to generalize the results of these procedures is the nature of the informational and motivational variables that control the behavior of actual victims and witnesses when they provide evidence to the legal system. Suppose, for example, that the large majority of actual witnesses and victims are extremely motivated to avoid making highly confident false alarms. They might do so because they fear innocent people will be convicted and/or because they do not want the system to arrest the wrong person leaving the guilty person to go free. Would the variables in those studies that have demonstrated confidence malleability be strong enough to have effects on such people? That is, how do the motivational and information variables that have been studied in the laboratory compare to the motivational and informational forces that exist in the real world? Laboratory studies have not examined this issue because researchers do not know how actual witnesses and victims are motivated. Nevertheless, it does seem reasonable to assume that actual victims and witnesses may establish their confidence criteria differently than subjects who view videotapes in laboratories.

 

A second issue concerns the fact that very few police and investigative agencies use rating scales to measure confidence. In cases with which I am familiar, some officers will ask, after a witness has selected someone from a photographic lineup, "How do you know this person?" If the witness says, "That's the person who robbed me," the officer concludes that the witness is very confident. But if the witness says, "He looks like the person who robbed me," the officer will state that the witness was less than completely confident. In other words, some officers consider verbal responses that "identify" the suspect as the culprit as indicating high confidence. In other cases, witnesses will point to one or more features and say something like, "That's him. I'm sure that's him. I will never forget those eyes." Although it is possible that the same variables that affect how subjects use confidence ratings will affect this kind of verbal behavior, relevant research has not yet been done.

 

The third and possibly most important issue concerns the possibility that the initial strength of the witness's memory for the event/culprit will affect the malleability of the witness's confidence. For example, in most laboratory studies of eyewitness memory, researchers arrange the learning conditions such that accuracy scores will be far from perfect. Clearly, if subjects are 100% correct because the learning conditions are so good, then it will be impossible to observe the effects of other variables on accuracy. As a consequence, in the huge majority of laboratory studies, memories for events and people will not be strong (e.g., the proportion of subjects who correctly recognize the culprit will be much less than 100%, the average proportion of faces that are correctly recognized will generally be somewhere between chance and 85%, and so on). It is likely that the average confidence of subjects in these studies will also be somewhat less than absolute. If this description accurately portrays the current research, two problems arise. First, as we know from research in other judgment domains (Eiser, 1990), it may well be harder to influence extreme ratings than middle-valued ones. Second, as noted earlier, the legal system is likely to ignore witnesses whose confidence is weak. Although we have no research that assesses the average cutoff point used by the legal system, it might well be higher than the average confidence created by researchers. If so, it could be that most witnesses for cases that are pursued by police and prosecutors are already initially so confident that the fear that other events might have caused their confidence to increase is unnecessary.

 

Still, it can be argued that unless one knows the precise conditions under which confidence estimates were obtained, it would be very difficult to derive an ex