Ebbe B. Ebbesen
University of California,
San Diego
November 3, 2000
In
a relatively recent survey (Kassin, Ellsworth, & Smith, 1989), researchers
in the area of eyewitness memory indicated that they believed that the relationship
between witnesses' confidence in their identifications and the accuracy of
those identifications is weak, at best. In fact, in a recent paper Wells, et
al. (1998) concluded:
Jurors appear to overestimate the accuracy of
identifications, fail to differentiate accurate from inaccurate eyewitnesses --
because they rely so heavily on witness confidence, which is relatively
nondiagnostic -- and are generally insensitive to other factors that influence
identification accuracy. (p 642)
This
shared expert opinion (see Penrod & Cutler, 1995, for a similar argument,
and Leippe, 2000, for a similar statement in a written report to court in an
actual car hijacking case) seems to be based on three sources of evidence.
First, a large number of studies in the experimental literature report low and
frequently non-significant correlations between rated confidence in
identifications and the accuracy of those identifications (e.g., Bothwell,
Deffenbacher, & Brigham, 1987). Second, an extensive series of studies show
that eyewitness accuracy varies as a function of factors other than confidence
(e.g., stress, duration of exposure, instructions, the nature of the lineup
procedures, post-event information, and so on).[1]
Third, a number of studies have shown that it is possible to produce changes in
the confidence that witnesses express in their memories independent of changes
in the accuracy of those memories. All of these studies are based on several
different types of experimental tests and therefore appear to offer a form of
convergent validation to the conclusion that the relationship between
confidence and accuracy is relatively weak to non-existent, especially when
compared to other predictors of accuracy. A weak relationship between
confidence and accuracy would be consistent with the hypothesis that people do
not have direct or (more properly) valid access to the strength of their
memories.
The
conclusion reached by Wells, et al. (1998) that jurors make mistakes because
they emphasize witness confidence over other factors is based on two very
strong applied assumptions. The first is the non-intuitive assumption that
witness self-confidence is not a good predictor of the accuracy of witness
testimony. The second is the assumption that other factors, as they are typically
available to jurors, are more diagnostic (than confidence) of eyewitness
accuracy. This paper argues that neither of these conclusions can be reasonably
applied to the real world given the nature of the theory, methodology, and
research results that underlies them and the nature of the decisions faced by
decision-makers in the legal system.
Deffenbacher
(1980) proposed the "optimality hypothesis" to explain the wide range
of confidence-accuracy correlations that he noted in his review of the
literature conducted prior to 1980. He argued that correlations between
confidence and accuracy would tend to be low when the conditions of learning
and memory are less than optimal, e.g., when it is difficult for witnesses to
encode and/or retrieve the information to which they have been exposed. The
correlations would be high only under conditions of optimal learning and
memory. Finally, he argued on intuitive grounds that most crimes consist of
less than optimal learning and memory conditions (e.g., they tend to involve a
great deal of stress, the exposures are generally brief, there is usually a
long delay between observing the criminal and being asked to identify him, the
test procedures tend not to emphasize effective retrieval strategies).[2]
As a result, he concluded that the correlation between accuracy and confidence
would be low for witnesses and victims to actual crimes. He further concluded
that jurors and other key decision-makers should be made aware that confidence
is not an indicator of the accuracy of witness memories.[3]
One
reason that laboratory studies might support the optimality hypothesis, at
least in the extreme, is the fact that subjects who have no memory for an event
but are forced to respond might guess. If some subjects in experiments guess,
clearly, by chance they will be correct some of the time and incorrect other
times. However, we would also expect that whatever confidence they express in
these guesses would average to about the same levels for correct guesses as for
the incorrect ones. After all, the subjects would not know which guesses are
correct and which are not. Thus, when the learning conditions are so bad that
observers can do no better than to guess randomly, the relationship between
confidence and accuracy should be zero, just as Deffenbacher's hypothesis
argues.
On
the other hand, once the subjects' average strengths of memory increase above
zero, the relationship between confidence and accuracy can grow stronger
because self-knowledge about whether a response is correct can now be based, at
least some of the time, on veridical memories. That is, a subject might have a
very strong, and accurate, recollection. As a result, this subject might be
very confident. Consistent with this analysis, Ebbesen and Wixted (1996) used
signal detection theory (Macmillian & Creelman, 1991) and Monte-Carlo
simulation methods to demonstrate how the size of confidence-accuracy
correlations will tend to increase with increasing d'. Interestingly, if the
present reasoning is correct, it suggests that Deffenbacher's claim that the
confidence-accuracy relationship will be weak to non-existent in actual crime
situations is equivalent to the claim that witnesses to actual crimes have no
memory for the events and are just guessing.[4]
Clearly this conclusion is a much stronger one than there is a weak
relationship between confidence and accuracy. It is also a conclusion for which
there is no empirical support.[5]
Several
reviews conducted after Deffenbacher's, concluded on a slightly, albeit very
slightly, more positive note than Deffenbacher. For example, Fleet, Brigham,
& Bothwell (1987) concluded:
The claims of previous reviewers of the
confidence-accuracy literature (Deffenbacher, 1980; Leippe, 1980; Wells &
Murray, 1984) that confidence is an unreliable predictor of accuracy are
perhaps premature. In addition to the unresolved issues of how to subdivide the
research samples, there are the issues concerning ecological validity. For example, several recent field studies
have found a significant correlation between confidence and accuracy (Brigham,
Maass, Snyder, & Spaulding, 1982; Hosch & Platz, 1984; Krafka &
Penrod, 1985; Pigott & Brigham, 1985). (p 183)
More
recently, some have argued that the size of the correlation between confidence
and accuracy may depend on other factors besides the optimality of initial
learning and memory conditions (e.g., Clark, 1997; Cutler & Penrod, 1989;
Ebbesen & Wixted, 1996; Libuser & Ebbesen, 1999; Lindsay, Read, &
Sharma, 1998; Robinson & Johnson, 1998; Wells, et al., 1998). For example,
whether the confidence estimate is obtained prior to or after the
identification response is one factor that seems to moderate the size of
confidence-accuracy correlations (Cutler & Penrod, 1989). Another is the
difference between choosers, i.e., those who pick someone, and non-choosers,
i.e., those who fail to pick anyone (Sporer, Penrod, Read, & Cutler, 1995).
Still another is the possibility that feedback about the accuracy of an
identification might affect confidence in that identification (Wells &
Bradfield, 1998). Robinson & Johnson (1996) suggested still another
moderating variable. They reported evidence that the testing procedure (recall
compared to recognition) affects the degree of relationship between confidence
and accuracy of memory. Kebbell, Wagstaff, & Covey (1996) suggested that
the low correlations might be due to relatively small variation the difficulty
of the items used in the memory tests. Clark (1997) has presented some data suggesting
that the similarity among the items that people are attempting to recognize
might play a role not only in the size but the direction of the
confidence-accuracy relationship. Approaching the problem more generally,
Ebbesen & Wixted (1996) used signal detection theory to describe the ways
in which confidence and accuracy might be related. In the typical signal
detection view, confidence estimates are simply additional decision (judgment)
criteria placed on the same subjective strength-of-memory dimension used to
identify someone as the culprit. This signal detection view provides an
explanation for the "optimality" findings as well as chooser v.
non-chooser differences (e.g., Sporer, 1992; Sporer, Penrod, Read &
Cultler, 1995). It also raises a number of issues that have been all but
ignored by those concluding that jurors should ignore confidence because there
is no relationship between confidence and accuracy.
One
of the issues raised by the signal detection analysis concerns the specific
method of aggregation used in the computation of the relationship between
confidence and accuracy. In particular, correlations between confidence and
accuracy can be computed in a number of different ways. To fully appreciate
these different methods requires that we examine how researchers have studied
eyewitness identification accuracy, confidence, and their relationship. Event
memory, face memory, and fact memory are the most commonly used procedures to
acquire information about accuracy and confidence.
In
event memory research, study participants are presented with a single event in
which they observe one (or a very small number) of individuals do something.
For example, participants might watch a slide presentation or a videotape of a
simulated robbery or they might be present in a room when someone enters and
does something unusual or unexpected. Afterwards, in the large majority of
these studies the participants are asked to look at a photographic array of
individuals (usually but not always consisting of six people) and attempt to
identify the person that they saw in the event. In some studies the
participants are asked how confident they are in their ability to identify the
person(s) that they saw in the event prior to being shown a lineup as well as
how confident they are in a particular response made to the lineup. For
example, a "witness" might be asked, "How confident are you that
you would be able to identify the person you saw were you to see him in a
lineup?" Then after being shown the lineup and picking someone, the
"witness" might be asked, "How confident are you that the person
you picked is the person you saw?" or after declining to pick someone,
"How confident are you that the person you saw is not in the lineup?"
In other studies, the post-lineup confidence question might be, "How
confident are you in your response?" It is important, as we shall soon
see, to note that event memory research generally produces one post-lineup
confidence response and one "identification" response per
participant. That is, each participant sees one event, attempts to identify
someone from one lineup, and then indicates how confident he or she is in that
response. In general, the event and the criminal are held constant across all
witnesses within a particular study. In addition, in the majority of studies
the participants know that their choices have little or no real consequences.
That is, no one will be accused of committing a crime on the basis of the
participants' choices.
In
more recent research, studies such as these present half of the participants
with target-present lineups that contain the "culprit" who was in the
videotape or who did the unusual act and half with target-absent or blank
lineups that do not contain the "culprit". In the latter, the target
is frequently replaced with someone who looks similar to but is not the target.[6]
Note that the participant's choices can be coded as correct or incorrect.
However, there are several different types of correct and incorrect responses.
A participant might correctly choose the culprit from the target present lineup
or correctly not pick anyone from a target absent lineup. Alternatively, the
participant might make several different types of incorrect responses. She
might pick a "foil" from either the target-present or the
target-absent lineup or she might not pick anyone when presented with the
target-present lineup.
It is important to note that these different errors
would have different implications for the legal system were they produced by
actual eyewitnesses. For example, when a witness fails to pick the actual
culprit from a target present lineup, the culprit will probably not be charged
with the crime (assuming other strong evidence against him does not exist) and
a guilty person will be set free. Similarly, when a witness picks a
"foil" from the target present lineup, the guilty culprit will again
be set free and the "foil" will, in all likelihood, not be charged
with the crime because the police who constructed the lineup generally know
that the foils are innocent. If, on the other hand, the witness picks the
"suspect" in a target absent lineup (that is, a person who the police
believe committed the crime but is actually innocent), then, in a miscarriage
of justice, the wrong person will be charged with the crime (assuming other
evidence is not sufficient to exonerate the innocent individual) and the guilty
person will go free.[7]
In
face memory research, participants are shown a large number of faces one at a
time (via slides or pictures) and frequently asked to make some sort of
judgment about each one (generally, to ensure that the participants are paying
attention). After looking at all of the faces, typically for no more than a few
seconds each (Shapiro & Penrod, 1986), they are tested for their memory of
the faces. Often the test consists of a "yes/no" task but sometimes a
"two alternative forced-choice" procedure is used. In the "yes/no"
task, the participants are shown another large set of faces. They are told that
they saw some of the test faces before but did not see others. They are also
told that their job is to indicate which they had seen before. They are to say,
"yes," if they believe that they saw the face in the first set and,
"no," if they believe they did not. After each "yes/no"
response, they might be asked to indicate how confident they are in their
response. In the two-alternative forced-choice procedure the participants are
presented with two faces at once and asked to indicate which of the two they
saw in the first set and which they did not see (generally one of each pair was
seen before). Again, the subject might be asked to indicate how confident they
are that each response is correct. In both procedures each memory response can
be coded as correct or incorrect. In the "yes/no" task, the
participant can be correct by either picking a person they saw before (called a
hit) or saying, "no," to a person they did not see before (called a
correct rejection). Similarly, they can be incorrect by picking someone that
did not see before (called a false alarm) or not picking someone whom they did
see before (called a miss). In the two-alternative forced-choice procedure, the
participant either picks the face they did see before or picks the one they did
not see before. Unlike in the event memory procedure, it is possible to
aggregate the results from all of the responses to all of the faces together
for each subject. Researchers can compute an overall, "percent
correct" score for each subject.
In
fact memory research, participants are asked a series of questions about an
event, frequently the same event for which they are shown a lineup. The
researcher establishes a set of correct answers to the questions depending on
the match between what happened in the event and the answer given. For example,
the experimenter might ask a participant whether the "culprit" held a
pen in his hand. Responses are coded as correct or incorrect depending on the degree
of match. After each response, the experimenter might ask the participant how
confident she is in her answer.
Researchers
generally estimate the size of the relationship between confidence and accuracy
by computing correlations between the two measures. Bothwell, Deffenbacher,
& Brigham, (1987) reviewed many of the event memory studies and concluded
that the average correlation between confidence and accuracy was, although
probably greater than zero, unlikely to be much larger than .25. In these
correlations, each participant contributes two observations, a memory response
that is either correct or not (coded 1 or 0) and a confidence rating (coded 1
through n, depending on the number of steps on the confidence scale, e.g., not
at all confident, slightly confident, moderately confident, etc.). The data
will look something similar to the graph in Figure 1. Each participant is
either right or wrong and gives a confidence rating that goes along with his or
her response. Each of the dots in Figure 1 represent the results for a group of
participants, all of whom indicated a particular confidence level and were
correct (coded 1) or not (coded 0). Although we can't see it in the graph, the
number of participants whose responses put them at a particular point (1 or 0)
varies. Thus, when the relationship between confidence and accuracy is high, we
would expect most of the participants who indicated that they were very
confident (5) would be correct and most of those indicating a low confidence
(1) would be incorrect. Stated differently, the proportion of highly confident
people who are correct should be higher than the proportion of unconfident
people who are correct.

Figure 1. Linear fit to dot plot of
correct and incorrect responses as a function of rated confidence. Each dot
represents a number of observations at the conjoint accuracy and confidence
values. Of particular importance is the fact that the best fitting linear
function will be unable to provide a very good fit of the data even when large
majority of the high confidence responses are correct and large majority of the
low confidence responses are incorrect.
Interestingly,
this method of computing correlations between confidence and accuracy is
constrained to produce generally low correlation coefficients, even when the
proportion of correct highly confident people is greater than the proportion of
correct unconfident witnesses. This is because correlations fit a continuous
linear function to data points. A perfect correlation is obtained when the
linear function runs through all of the data points, as in Figure 2. However,
as can be seen in Figure 1, any attempt to fit a straight line to the data will
fall in between all of the data points because the line must move through
values that are between 0 and 1 but the data points are constrained to be
either 0 (incorrect) or 1 (correct) and can be nothing in between. As a
consequence, the resulting correlation coefficients will tend to be numerically
low. The fact that the proportion of correct responses at the very lowest (e.g.,
guessing) confidence level will generally be greater than 0 (by chance, some
subjects who guess will be correct) also tends to reduce the upper limit of the
size of the correlations that one might expect from this research.[8]
In short, the fact that many event memory studies report low correlations
potentially tells us more about the inappropriate use of the correlation
coefficient to measure the size of the relationship between confidence and
accuracy in event memory studies than about whether there is a strong
relationship between confidence and accuracy in event memory.
Ignoring
the problems with using correlation coefficients in event memory studies for
the moment, it is of some interest to note that researchers have reported that
the correlations (though still small in absolute size) for witnesses who choose
someone from a lineup are higher than witnesses who choose no one (e.g.,
Sporer, 1992; Sporer, Penrod, Read, & Cutler, 1995). If this is a general
result, it has applied significance. Defense attorneys sometimes argue that
their client is innocent of the charges, either because the charges describe
behavior in which they claim their client did not engage (e.g., the drugs
belonged to the other person) or because their client is the wrong person (e.g.,
he was home with his mother at the time the store was robbed). When defense
attorneys argue that their client has been misidentified, they are claiming
that a witness who positively identified their client is wrong. In other words,
the defense must be that the witness was presented with a target-absent lineup
and chose an innocent suspect, not that the witness was presented with a
target-present lineup and didn't choose the guilty culprit. Thus, when
attempting to generalize research results to the real world, experts who argue
that witness confidence is an unreliable indicator (overall) are potentially
misapplying the research because current research seems to suggest that
confidence is a more reliable indicator for subject-witnesses who have
identified someone than for those who chose not to identify anyone. To be
consistent with the domain to which the results are being generalized, experts
who worry that too many innocent people are being falsely identified should be
basing their conclusions on the confidence-accuracy relationship for choosers
and not on the relationship for both choosers and non-choosers. In my
experience, I have never heard experts testify this way when speaking about the
confidence-accuracy relationship.

Figure
2. Linear fit of average degree of confidence to the proportion of correct
responses.
The
use of correlations to measure the relationship between confidence and accuracy
in event memory studies raises another important issue, namely, the fact that
the data points represent the behavior of different witnesses who have
witnessed the identical (or nearly identical) event. Thus, the variation from
data point to data point in Figure 1 is variation from one witness to the next.
Because each witness saw the identical videotape with the same culprit, the
event memory correlation represents differences in confidence and accuracy that
must be due to "pre-existing" psychological differences between the
witnesses and not to the fact that different witnesses saw the culprit under
different learning conditions. The only way these correlations could be high is
if people who have better face memory (e.g., Benton, Sivan, Hamsher, Varney,
& Spreen, 1983) or who attend more closely or who process faces more
deeply, etc., are also people who tend to provide higher confidence ratings.[9] The typical analyses of event memory studies
do not allow for the possibility that the reason some witnesses in the real
world are more confident in their identifications than others is because they
saw the culprit under better learning conditions and therefore had better
memories of the culprit (see Lindsay, Read, & Sharma, 1998, for a similar
argument).
Use
of the single-event, multiple-witness memory procedure opens the door to the
possibility that different participants in the research will use the
measurement scale differently to express their confidence. Not only might
different people be more or less likely to remember what the
"culprit" looked like (for whatever unique and unknown individual
differences) but people whose memories might be equally good (or bad) may also
be more or less likely to label their confidence as very high or very low on
the rating scales. Thus, individual differences in how people use the
confidence scale that are uncorrelated with individual differences that cause
differences in strength of memory for the event will tend to attenuate
single-event, multiple-witness confidence-accuracy correlations.
The
fact that the generalization is over individuals who witnessed the identical
event is important because we can ask what an outside observer might infer from
an in-court confidence statement made by a single witness. In the real world,
individuals who have to decide whether a witness is correct generally do not
have the luxury of hearing from multiple witnesses, all of whom observed the identical
crime from the same visual angle for the same amount of time. Instead, most
decision-makers (e.g., detectives, jurors or prosecutors) hear one witness make
an identification response and then provide a confidence estimate. When a
witness tells us that he is confident in his memory of a particular person or
event, do we infer from this expression of confidence that this witness is more
likely to remember correctly than someone else might be -- a someone else whom
neither the witness nor we have seen or heard? Alternatively, do we infer that
this witness is telling us something about this particular memory compared to
other memories this witness has? If it is the latter, then the between subjects
correlations that make up a major part of the database (see Bothwell et al.,
1987, Sporer, et al., 1995) for the experts' opinions may employ an
inappropriate method of aggregation to study the relationship between
confidence and accuracy. I shall return to this point later.
Individual
Differences in Face Memory and General Confidence: In face memory research
correlations between each participant's overall accuracy (e.g., the percentage
of all of their responses to all of the test faces that are correct) and each
participant's average confidence (for all of their responses, both correct and
incorrect ones) are generally computed. The results for a zero correlation
might look something like the data represented in Figure 3. Note that a
correlation computed on the basis of the confidence and memory data presented
in Figure 3 represents something different than that previously described. In
this case, we are asking whether people who are generally more confident in all
of their attempts to identify faces that they have seen before are also getting
a higher percentage of their identifications correct. Like the event memory
case, these correlations represent individual differences among people who
studied the identical set of faces. However, these are based on averages over
many identifications rather than just one identification of one face. As such,
they represent individual differences in tendencies to be confident and
tendencies to be correct. Are people who are more likely to use the higher ends
of the confidence scale also more likely to identify correctly faces that they
have and have not seen before?

Figure
3. Scatter plot of average degree of confidence to proportion of correct
responses.
Differences
in the Memorability of Faces and General Confidence: Although not typically
reported, another correlation can generally be computed from data obtained from
face memory research. In this case, averages are computed for each face rather
than for each participant. That is, rather than average all the data from all
of the faces that each participant saw, it is possible to average all of the
participant's together for a particular face. In this case, one is asking
whether people are more likely to identify correctly some faces than others. In
addition, are the participants more confident, as a group, in their responses
to those faces that they are more likely to identify correctly? In this case,
the generalization is across "criminals" (actually faces) rather than
witnesses. Are the more memorable faces those that witnesses (as a group) tend
to be most confident about?
One
potentially important difference between the individual-difference and
face-based correlations is that in the latter individual differences in how
people use the confidence scale will average out whereas in the former they
will be exaggerated. If different people tend to use confidence scales
differently but each person is generally more confident in those faces that
they correctly recognize, then we would expect the face-based correlations to
be higher than the individual-difference correlations. This is exactly the
result that Ebbesen & Wixted (1996) reported. The fact that face-based
compared to individual-difference confidence-accuracy correlations may be
higher suggests that one of the reasons single-event, multiple-subject correlations
are generally low is because they are sensitive to individual differences in
how people use confidence scales.
A correlation can be computed for each witness in
fact-memory research. Like the data in Figure 1, witnesses get each of the
facts correct or incorrect. However, since each witness is tested on multiple,
rather than just one, fact, it is possible to compute a correlation for each
witness (e.g., Smith, Kassin, & Ellsworth, 1989). In this case, one is
asking whether the fact responses that a witness is most confident in are more
likely to be correct than the witness's less confident fact responses. Here a
separate generalization can be made for each witness over each witness's many
different attempts to remember things (Gruneberg
& Sykes, 1993; Smith, Kassin, & Ellsworth, 1989).
It
might be worth noting that the same correlations can be computed for each
participant in face memory research (Ebbesen & Wixted, 1996). That is, we can
ask whether for a particular witness those faces in which he expresses the most
confidence are those faces to which he is most likely to respond correctly.
This
method of measuring the association between confidence and accuracy has the
same statistical limitations as that described for event memory (e.g., correct
and incorrect responses are coded 1 and 0 and the resulting correlations will
generally be small). However, these correlations represent variation over
memories for different events/people within each witness and not individual
differences in memory for the same event. That is, they represent
multiple-event, single-subject correlations.
It
is also possible to compute individual difference and fact-based (like
face-based) correlations. That is, the total percentage of correct facts that
each witness remembers can be compared to each witness's average confidence.
Alternatively, the percentage of witnesses who respond correctly to each fact
can be compared to the average confidence that all witnesses report for each
fact.
Very
few studies have examined the relationship between confidence and accuracy over
a wide array of different events/criminals (Lindsay, et al., 1998). In such an analysis,
variation in the event as well as the witness and the criminal contribute to
differences in both accuracy and confidence. That is, different data points
represent different witnesses, criminals, and events (events that might differ
in terms of key factors, e.g., other v same race, stress, retention interval,
disguise, distinctive features, and so on). Lindsay, et al., (1998) and Read,
et al. (1998) reported that this method of constructing correlations between
confidence and accuracy resulted in higher correlations than have been
typically reported. In other words, when the variation that different learning
conditions produce in different people's memory is not held constant, i.e.,
multiple-event, multiple-witness memory procedures are used, the confidence-accuracy
correlations appear to be larger. This is unsurprising for two reasons.
First,
when we allow stimulus variations to influence accuracy, the range of memory
strength and therefore accuracy values may well be larger and more evenly
distributed over subjects than when stimulus variations are held constant. If
subjects do have access to relatively reliable information about the strength
of their memories, an increase in the range of strengths of memory should
increase the size of confidence-accuracy correlations. Second, the
multiple-event, multiple-witness procedure can increase the size of
confidence-accuracy correlations even if subjects do not have direct access to
the strength of their memories. If subjects generate confidence estimates, at
least in part, on observations of the learning and test conditions (they know
they saw the culprit for five minutes) and the meta-theory that they use to
infer confidence is sufficiently accurate, then their confidence estimates will
tend to co-vary with the event differences that control accuracy. Again, this
would tend to increase the size of confidence-accuracy correlations.
If
the multiple-event, multiple-witness procedure for producing variation in
accuracy and confidence data does produce higher confidence-accuracy
correlations, the conclusions reached by Bothwell, Wells, Penrod and others
regarding the relatively weak relationship between confidence and accuracy
might be premature and apply only to single-event, multiple-witness data. Since
in the real world of crime, different witnesses do not see the same crime from
the same visual angle for the same amount of time, one could easily argue that
generalizations made from the single-event, multiple-witness paradigm are
inappropriate.
Ebbesen
and Wixted (1996) report that confidence-accuracy correlations are much higher
(between .35 and .7) when they are based on differences between faces averaged
over witnesses (even holding learning and test conditions constant). They also
found that for over 90% of their subjects, the within-subject, between-face,
confidence-accuracy correlations were positive (i.e., for the large majority of
subjects, higher confidence was consistently associated with a greater probability
that identification responses were correct) although the absolute average size
of the correlations was between .2 and .25. Because different methods of
aggregating the same raw data generate different results regarding the size of
the correlation between confidence and accuracy, it is important that we ask
which method(s) supply the most appropriate estimates when generalizing to the
real world. Should we focus on variation produced exclusively by individual
differences in reactions to a constant criminal event? Alternatively, should we
focus on individual differences based on averages over different culprits and
events? Would it make more sense to focus exclusively on culprit-face
differences (averaged over many different witnesses and/or criminal events)?
Should we focus instead on culprit-face differences within each witness?
Alternatively, we might focus on differences in learning and/or test conditions
(averaged over witnesses but for only one culprit) or combinations of several
of these (and others) all at once.
Do
we want to know whether more confident witnesses to the same crime who saw the
culprit at the same visual angle for the same period of time are also more
accurate? Or, do we want to know whether the confidence that a witness has in
her memory for one thing predicts the odds that the recollection of that thing
compared to other things will be accurate? Or, do we want to know whether
confidence estimates supplied by different witnesses who saw different culprits
under varying conditions is predictive of their accuracy? Or do we want to know
whether some criminals for whom typical witnesses feel more confident are the
criminals that typical witnesses will tend to remember correctly? Clearly,
these are different generalizations. Unfortunately, the differences have not
been adequately discussed in the context of the decision problems faced by
people in the legal system.
Deciding
on the appropriate source of variation to recommend is complicated by the fact
that law and members of the legal system generally see every case as different
(Konecni & Ebbesen, 1982). Nevertheless, the legal system speaks in terms
of "odds." For example, terms such as, "more probably than
not," are used when discussing jury decision standards. Prosecutors ask
witnesses about and witnesses are willing speak about percentages; as in, "I am 90% certain." If every case
is truly different, then such statements are meaningless because odds and
percentages depend on multiple examples of similar events. When a witness says
that she is 90% certain, what events make up the numerator and the denominator
of the percentage? Which of the following is she saying, a) 90 out of 100 times
when my memory is this strong I will be correct, b) 90 people out of 100 who
saw what I saw would be correct, c) I would be able to identify correctly 90
out of 100 criminals who looked as distinctive as the criminal that I saw, d)
90 people out of 100 would correctly identify a criminal with his features, e)
90 out of a 100 times that I see people under the conditions that I saw this
person, I would be able to recognize them, and so on? Clearly, if witnesses are
telling us the odds that equally strong, but different, memories of past people
and events will be correct and researchers are attempting to tell us the
meaning of these claims by examining the size of correlations based on
individual-differences in single-event, multiple-witness studies, the
researchers' conclusions are based on the wrong kind of evidence. Stated
differently, we should base our conclusions about relationships on data that
match the kind of information that jurors and the rest of the legal system
really want to know.
What
information do actors in the legal system want to know when deciding whether to
file charges or reach a guilty verdict? Do these decision-makers care about the
strength of the relationship between confidence and accuracy or do they care
about the odds that a suspect is the guilty culprit? Although probably not
articulated in the same language as a statistician, it seems reasonable that
most actors would focus on odds and not the relationship. The reasoning of key
actors might be something like: if witnesses are very confident in their
identifications, then the odds that their identifications are accurate should
be high, and therefore the odds that the suspect is the guilty culprit should
be high. This reasoning says nothing about how accuracy should change and as
the level of confidence changes. In fact, it is possible for the conditional
probability that suspects are the guilty given that witnesses express high
confidence to be high even though changes in accuracy are weakly or even
unrelated to changes in confidence. If all levels of confidence were associated
with high accuracy, then no relationship between confidence and accuracy would
exist but the probability that identifications were accurate given high (as
well as low) confidence could be close to one.[10]
One
aspect of evidence that decision-makers might be expected to use in estimating
the pool of suitable matching suspects is identification by a witness. How many
innocent people look similar enough to the culprit (who also match whatever
other evidence is available, if any is available) for the witness to identify
them as the person they saw? That is, what are the odds that the police have
arrested an innocent individual who looks enough like the culprit that a
witness would be willing to identify him?
Lineup
diagnosticity
is one measure that some (e.g., Wells & Lindsay, 1980) have suggested
should be used to assess the ability of witnesses to indicate accurately who
the culprit is when shown a lineup. This measure compares the rate at which
subjects falsely identify "innocent" suspects in target absent
lineups to the rate of correct choices of the "guilty" target in
target present lineups (e.g., Wells & Lindsay, 1980, Wells & Luus,
1990). The higher this ratio, the more diagnostic the lineup is thought to be.
Of course, this measure can only be computed in experimental studies (with a
known culprit) that use single-event/culprit, multiple witness paradigms in
which different subjects are shown the same target present or target absent
lineup. This is because lineup diagnosticity would be expected to be different
for different culprits, foils, and suspects.
As
Navon (1990) correctly noted, given the decision problem facing police,
prosecutors, and jurors, lineup diagnosticity is not the measure of
diagnosticity on which the real world should focus its attention. This is
because lineup diagnosticity depends so much on how the experimenter selects
the innocent suspect for the target absent lineup as well as the match between
what the target looked like during the event and what he or she looked like in
the lineup (photo). It seems obvious that the more the innocent suspect looks
like the culprit, the higher the false alarm rate will be (assuming that the
witnesses remember something about the culprit's looks). In addition, the more
a culprit's appearance changes from the event to the lineup, the lower the
correct identification rate will be. In addition, the more the lineup is
constructed in a manner so that the innocent suspect "stands out,"
the higher the false alarm rate will be. Thus, it should be possible in a
laboratory experiment to control the relative rates of correct to false
identifications -- lineup diagnosticity -- by varying the similarity
relationships between the actual culprit and pictures used for the target,
suspect, and foils (e.g., Luus & Wells, 1991). This raises the possibility
that every real world lineup will have a different diagnosticity depending on
such details. Unfortunately, such details cannot be measured in any particular
lineup because they depend, in part, on the match between the suspect's and
culprit's looks (assuming that they are not one in the same). Obviously, the
guilty culprit's looks are generally not known if an innocent suspect is being
charged.
On the other hand, as the ecological likelihood increases, the odds that the suspect is the guilty culprit increase. As a result, the odds that the lineup shown to witnesses is a target absent lineup go down. As the odds that the lineup is a target absent lineup go down, the likelihood that suspect choices are correct goes up. This is true even if lineup diagnosticity is low.
Lineup diagnosticity is measured in
terms of the ratio of two ratios:
(# "guilty" target choices)/(# of
target present lineups)
(# "innocent" suspect choices)/(# of
target absent lineups)
Compare
the following two situations. In the first 100 witnesses are shown a target
present lineup and 50 witnesses pick the target while another 100 witnesses are
shown a target absent lineup and 50 pick the suspect. In the second 100
witnesses are shown a target present lineup and 50 witnesses pick the target
while another 10 witnesses are shown a target absent lineup and 5 pick the
suspect. In each case the diagnosticity ratio is .5/.5 or 1. However, if we ask
about the odds that witnesses who choose someone are correct, the odds are
50/50 or 1 to 1 in the first case and 50/5 or 10 to 1 in the second case. In
short, in actual cases, the odds that witness identifications are correct
depend heavily on the ecological likelihood that the lineup contains the guilty
culprit as opposed to an innocent suspect.
Features of a case linked to the
suspect that will increase ecological likelihood are, by definition,
"distinctive" or unlikely to be associated with a large percentage of
the population of potential suspects. Thus, the facts that the culprit drove
off in a car and that the suspect owns a car do not add to the ecological
likelihood that the suspect is the culprit because these facts can apply to so
many potential suspects, i.e., owning a car is not distinctive. On the other
hand, the fact that the "get-away" car had a pink lightening bolt
painted on its hood and that the suspect owns a car with a pink lightening bolt
on its hood does add to the ecological likelihood that this is the guilty
suspect (although how much might depend on other factors, e.g., did the suspect
report that the car was stolen before the crime was committed, did the suspect
lend the car to someone on the day the crime was committed, and so on).
One set of features that might
affect ecological likelihood is that associated with the "looks" of
the suspect/culprit. For example, suppose a witness recalls that the culprit
had an unusual tattoo on his neck or a very prominent scar on his cheek or that
he was cross-eyed. Since such features would reduce the set of possible
suspects to a very small number, they should add to the ecological likelihood
that a suspect who has such a feature is the culprit. On the other hand, when
witnesses are asked to identify whether the suspect is the culprit from a lineup,
an issue arises about how best to deal with such distinctive features. Many
researchers (e.g., NIJ, 1999; Wells, et. al., 1998) argue that creating a
lineup in which the suspect is the only member with the distinctive feature
decreases the diagnosticity of the lineup (because a target absent lineup in
which the innocent suspect is the only person with the distinctive feature will
produce a high rate of false alarms). After all, if a witness recalls that the
culprit had crossed-eyes and the only individual in the lineup with
crossed-eyes is the suspect, it seems reasonable that the witness would not
even consider choosing any of the foils. As a result, the witness would be
looking at a lineup with a functional size (Lindsay, Smith, & Pryke, 1999;
Wells, Leippe, & Ostrom, 1979) of one instead of near six. If witnesses
tend to use a "relative decision" strategy when picking from a
simultaneous lineup (e.g., Lindsey, Pozzulo, Craig, & Lee, 1997; Lindsey
& Wells, 1985; Puzzulo & Lindsey, 1999), the one picture they will be
most likely to pick should be the innocent suspect. Two strategies have been
suggested to correct for this problem (NIJ, 1999). Either the lineup should be
constructed in a manner in which all of its members have the distinctive
feature, e.g., crossed-eyes, or the distinctive feature should be hidden from
the witness, say by having all of the members of the lineup wear a patch over
one of their eyes. In this way, none of the members of the lineup will
"stand out" from the remaining members.
We can look at this problem from a
different point of view, however. Consider the example of the car with a pink
lightening bolt. Imagine that the police find a car with a pink lightening bolt
painted on its hood and ask the witness to identify the car. Would we require
that the witness pick from a lineup of six cars in which the distinctive
feature (pink lightening bolt) was hidden from view, say by repainting the
hoods of all of the cars with black paint or by painting pink lightening bolts
on the hoods of the five known "innocent" cars? We know of no
researchers who have suggested that this is the way in which witness testimony
about "objects" should be collected. One reason might be because such
procedures seem unnecessary.
But
why would such procedures seem unnecessary in the case of objects but not in
the case of people? After all, witnesses might be identifying the wrong car
because they are recalling the pink lightening bolt and not the entire car.
Surely we would want the witness's identification of the car to be based on
recognition of the "entire" car. On the other hand, how can we expect
a witness to recognize every aspect of the car? Can we expect the witness to
recall the pattern of the scratch marks on the passenger's front door or the tiny
crack in the plastic cover of the left rear blinker? Isn't it enough that the
witness identifies the car? In part the answer might have something to do with
real and perceived ecological likelihoods and assumptions about how witnesses
make identifications.
With
regard to the former, it might seem very unlikely that the police found the
wrong car with a pink lightening bolt on it (because it seems obvious that very
few such cars exist) and as a result we infer that the odds that the car being
shown to a witness is "innocent" are extremely low. As a result, the
odds that a positive identification is correct are very high. In the case of an
identification of a cross-eyed suspect however, one might feel that the
likelihood that the police found the wrong cross-eyed suspect (because there
are so many cross-eyed individuals -- at least many more than cars with pink
lightening bolts painted on their hoods) is considerably more likely. As a
result, the odds that a witness's identification of a cross-eyed individual is
correct seem much lower.
Of
course, the feeling that special procedures are required to protect the accused
from false identifications might be based not on prior expectations about the
odds that lineups contain innocent suspects but rather on the belief that eyewitnesses
are more likely not to reject innocent suspects than "innocent"
objects because face recognition depends more heavily on distinctive features
than "object" recognition. That is, one might assume that witnesses
fail to consider other features of faces besides the distinctive one(s) when
deciding whether a face is the culprit's. Such a process could reflect the way
in which faces are initially encoded (Main, Leland, & Bartlett, 1998) or
the way in which decisions are made during the identification task (e.g., the
presence of a remembered distinctive feature is sufficient evidence to
identify). While there is considerable evidence that distinctive faces (those
rated as more distinctive) are better recognized than those that are less
distinctive (Shapero & Penrod, 1986; Webster, Leland, & Bartlett, 1997;
Wickham, Morris, & Fritz, 2000), the role that particular distinctive
features, e.g., crossed-eyes, play in identification accuracy in lineups has
not been well studied. It is not known, for example, whether the presence of
such features will increase false alarm rates faster than hit rates. In
addition, we do not currently know the effect on the relative rate of hits
compared to false alarms of hiding the feature or of adding foils with similar
distinctive features.
The
problem for the real world decision-maker is estimating the number of people
who match the evidence (e.g., suspect seen driving a similar car, gun found in
suspect's home, etc.) and who look enough like the culprit that a witness would
be willing to say, "That's him." Whether a suspect is similar enough
for a witness to identify him as the culprit depends on several different
mechanisms. The first is the distribution of facial and other characteristics
over the population. How many people look similar enough to a randomly sampled
individual that other people might confuse them? A second mechanism consists of
the process by which the suspect was selected by the police. If the process
leading to a suspect's arrest depends on how much the suspect looks like the
culprit, then the odds, based on random sampling, that an innocent suspect
would look like the culprit will be too small. For example, suppose that a
witness is asked to help a police artist draw a sketch of the culprit. Suppose
further that after this picture is made widely available to the public, someone
calls the police and tells them where they can find a person who this informant
believes looks a lot like the sketch. The police then construct a lineup with
this person as the suspect. Clearly, the odds are a lot higher that the witness
would be willing to say that this suspect is the culprit than if a suspect had
been selected simply because he or she owned a car fitting a description of
that used in the crime. A third set of mechanisms determines the strength of
the witness's memory for the culprit. Although the relevant research has yet to
be done, it is possible that the stronger witness memory is for culprits, the
less likely witnesses will be to choose innocent suspects who look very similar
to the culprit. A fourth set of mechanisms consists of the variables that
control where a witness places his or her decision criterion. How good does the
match between the witness's memory and the suspect (or suspect's picture) have
to be before the witness is willing to say, "That's him"? The fewer
people a witness would be willing to identify as the culprit, the greater the
odds the suspect is the culprit.
We
can restate the issue of the witness's criterion in terms of resemblance. The
odds that the suspect is the guilty culprit should be higher the greater the
resemblance of the suspect to the witness's memory of the culprit. The stricter
the resemblance criteria, the fewer people in the world one would expect to
satisfy that degree of resemblance. If one views confidence as a statement of
the degree of resemblance between the contents of memory and the looks of the
suspect/culprit, then given the above reasoning, it would make sense to assume
that the odds that a witness's identification is correct increases with
increasing confidence. This is similar to the view that Ebbesen and Wixted
(1996) proposed in their signal detection analysis of face memory. However, the
fact that the odds that identifications are correct increases with increasing
confidence is not identical to saying that confidence and accuracy will be
highly correlated.
The
fact that multiple and different mechanisms determine the evidentiary value of
a witness's identification requires that information other than correlation
coefficients and diagnosticity be obtained to evaluate the odds that a
witness's identification might be correct. In particular, decision makers
should want to know whether the odds that the suspect's looks matches the
culprit's is higher than would be expected by random sampling. This is quite a
different issue than the degree to which the witness's memory of the culprit
matches the culprit's looks. Whether the correlation between confidence and
accuracy is big enough provides no information about these issues.
What
are the odds that the police have arrested and charged an innocent suspect who
"stands out" enough because of the method of lineup construction and
who looks enough like what a witness remembers that the witness would be
willing to pick this suspect and then indicate that she was very confident in
that choice? When one phrases the issue this way, it seems clear that the data
and analyses from which conclusions about the relationship between confidence
and accuracy have been drawn are simply irrelevant.
It
should be obvious that the legal system either will choose not to pursue cases
in which the witnesses express low confidence in their identifications or
(according to critics) will tend to cause witnesses with low confidence to
raise their estimates before they testify. One need simply imagine a trial in
which an ID witness testifies that the defendant is the person who raped her
but then says that she is just guessing about the identification to realize
that the legal system is going to select cases, at least in part, on the basis
of witness confidence. Thus, in the huge majority of real world cases, jurors
will be faced with witnesses who are confident. However, researchers who study
the effects of various factors (e.g., stress, duration, lineup procedure, race,
instructions) on eyewitness identification continually fail to report their
findings conditional on the confidence of their subjects.
Consider
for example a study of the effects that duration of exposure might have on
eyewitness accuracy. Surely, with some diminishing returns, longer durations of
exposure will produce more accurate identifications than very short durations.
As already discussed, many eyewitness memory researchers would also ask the
subjects how confident they are in their identifications. The results of
duration (or stress, retention interval, instructions, and so on) are then
presented in terms of some measure of average accuracy for each duration of
exposure. The results for the confidence-accuracy relation are then presented.
Thus, researchers tend to include all of the responses, including those for
which the subjects might have said they were just guessing, in the results for
the factor effects. For example, suppose the design compared the accuracy for a
short duration of exposure, say, .5 seconds, to long one, say, 30 seconds. The
researcher would compare all of the responses for the .5-second exposure with
all of the responses for the 30-second exposure even though the subjects in
.5-second exposure might have said they were just guessing 80% of the time
while those in the 30-second condition might have guessed only 10% of the time.
Thus, the differences in average performance between the short and long
duration conditions would consist of different proportions of
"guessers." This is a problem because the legal system almost surely
doesn't use identifications of witnesses who say they were just guessing. To
generalize the effect of the different durations, researchers would have to
select out those witnesses whose confidence was high enough and then examine
the difference between short and long durations just for them. It might well be
that the size of the duration effect will be a lot smaller when the memories of
only the most confident witnesses are examined.[11]
When
researchers (e.g., Wells, et al., 1998) claim that jurors might be better off
focusing on factors other than confidence, they are basing this suggestion on
their belief that other factors predict witness accuracy "better"
than confidence. However, researchers have never actually examined the relative
predictive accuracy of confidence compared to other factors, especially in
relevant applied settings. Research results are not presented in terms of the
ability of these different measures (e.g., confidence and duration of exposure)
to predict eyewitness accuracy. One reason might be because it is not obvious
how this should be done. Although we can use standard statistical procedures
(e.g., general linear modeling), non-statistical features of the experimental
procedure and design complicate the interpretation of results. For example,
suppose one researcher designed our hypothetical duration of exposure study
with durations of 1 and 2 seconds and another designed it with durations of 1
second and 10 minutes. We would intuitively expect that the effect of study
duration would be much smaller in the former than the latter study. Suppose
both researchers also measured confidence. The first researcher might discover
that individual differences in confidence predicted witness accuracy
"better" than the small difference in duration that he created while
the latter might discover the opposite. Thus, the relative predictive accuracy
of confidence v. other factors will depend heavily on the range of levels of
the factor being varied (or observed).
This
problem is similar to that encountered when single-event, multiple witness
confidence-accuracy correlations are used to generalize to the multiple-event,
single-witness real world of crimes. In order to generalize "effect
sizes" from the laboratory to the field, one has to be sure that the range
and source of variation in the laboratory of the variables of interest are the
same as the range and source of variation that occurs in the settings to which
one hopes to generalize. In addition, it is not clear how to assess
"better" when one predictor depends on individual differences (e.g.,
confidence) that might be more or less reliably measured and the other depends
on differences between situations (e.g., duration) that are held at fixed
levels with perfect reliability in laboratory studies. But then when the
general "better predictor" principle is applied to the real world,
duration is no longer measured with perfect reliability and depends completely
on estimated values whose relationship to actual values is completely unknown.
The
problem is exacerbated by the possibility that confidence and the
diagnostically better factors that jurors are supposed to use are not
independent of each other. For example, we can be pretty sure that average
confidence in memory would be higher in the 10-minute condition than the
1-second condition described earlier. This means that it is quite possible that
some, if not all, of the effect of a factor on memory, might be mirrored by an
effect on confidence. If so, confidence might actually account for some of the
effect of factors on memory. This is exactly what one might expect from a
signal detection analysis (Ebbesen and Wixted, 1996).
Assuming
that the prior reasoning is correct, it raises the possibility that confidence
may well be a "excellent" predictor of eyewitness accuracy when it is
allowed to capture sources of variance that would be typical in the real world,
e.g., differences in criminals' faces, differences in duration of exposure,
differences in attention paid, and so on.
Looking
at this same issue from the side of the factor rather than confidence casts the
problem in a slightly different light. Researchers almost never present their
results in a way that shows the effect of duration (or any other factor) on
those witnesses who expressed the highest confidence compared to those who
expressed the least. The reason that this is so important is that it is
possible that the effect might be much bigger for those with the lowest
confidence and substantially reduced or eliminated for those who are very
confident. For example, Kebbell, Wagstaff, and Covy (1996) reported that
witnesses who were absolutely confident about a recollection were almost
invariably accurate. It is conceivable, therefore, that other factors might
account for very little variation in the accuracy of those responses in which
witnesses are absolutely confident. If this were the case, the conclusion that
long durations might produce many fewer errors than short ones would primarily
apply to witnesses who expressed lower confidence levels and since the legal
system will probably be more likely to eliminate witnesses as their confidence
decreases, the effect that other factors, such as duration, have on accuracy in
laboratory studies would tell us little about the more confident witnesses who
play a role in the real world.
Why
might the effects of situational variables disappear for highly confident
witnesses? If confidence does reflect the strength of witnesses' memories for
events that they have experienced, then as contextual events increase witness
accuracy, the same events might also increase witness confidence. Thus, the
effect of the factor on accuracy might be mirrored by an equivalent effect on
confidence. If this were to happen, we wouldn't need to know the conditions
that existed when a witness was exposed to the criminal event. Knowing confidence
would tell us the end result, namely the strength of the witness's memory,
regardless of how that memory strength was produced.
In
any case, as noted earlier, the issue is not whether duration of exposure has
an effect on accuracy. The issue facing those who have to make the guilt/not
guilty decisions is about the probability of guilt that is associated with the particular
pattern of evidence presented in the case. A given witness saw the culprit for
a given duration of exposure. The jury's job is to estimate the guilt of the
accused given that the witness confidently identified the defendant after
observing him for a particular duration of exposure. The jury's job is not to
determine the nature of the relationship between duration and accuracy. In this
context, researchers should be reporting their results in terms of the
conditional probability of an accurate identification, given that the witness
is highly confident and the duration of exposure is at a particular value. The
applied issue should be the stability of such conditional probabilities.
A
number of studies using slightly different methods have attempted to test,
directly, whether mock jurors who are presented with witnesses who made correct
or incorrect identifications can accurately determine which are which. During
an initial phase in these studies, witnesses are shown a simulated crime. All
of the witnesses are then asked to "testify." During their testimony
they are shown a lineup and asked to choose the culprit as well as to supply an
estimate of their confidence in the identification. A second group of subjects,
mock jurors, are then presented with evidence and shown the testimony of one of
the witnesses. Half of the subjects see the testimony of a witness who
accurately chose the culprit in the lineup and half see a witness who chose
incorrectly. The subjects are then asked to indicate whether the culprit whom
the witness identified is guilty and/or to rate the witness's accuracy.
Studies
such as these (e.g., Lindsay, Wells, & O'Connor, 1989) have reported that
mock jurors tend to respond to the witnesses' confidence estimates and not
whether the witnesses had correctly chosen the culprit. That is jurors tend to
believe the accurate witness at about the same rate as the inaccurate witness.
Furthermore, because they tended to use confidence, the mock jurors tended to
ignore the other information presented in the simulated trial, information that
the other studies have shown significantly effects witness accuracy, e.g.,
stress or retention interval.
Unfortunately,
the logic of this research is seriously flawed and therefore the conclusions
frequently drawn from it are not justified. The logic is as follows: Research
has shown that confidence is not, or only very weakly, related to accuracy.
Research has also shown the factor x (insert your favorite set here, e.g.,
stress, other-race, weapon focus, retention interval, post-event memory
influences, etc.) consistently affects the accuracy of eyewitness
identification (i.e., variations in one or more of these produces significant
mean differences in one or more measures of witness accuracy). When different
jurors sees witnesses who express different levels of confidence and hear case
evidence that includes a description of the level of one or more of these
factors present during the "crime," jurors judge the witnesses'
accuracy on the basis of their confidence rather than on the basis of the level
of the factor(s). As a result, they are frequently wrong.
The
logic is faulty for several reasons. First, the logic assumes that confidence
is not diagnostic of accuracy. We have shown above how this conclusion does not
accurately reflect the nature of the different relationships between confidence
and accuracy. Therefore, the initial premise is wrong.
Second,
the researcher is comparing apples and oranges when comparing the relationship
between the level of confidence and accuracy and that between the level of one
or more eyewitness factors and accuracy. The former is almost always based on
individual differences and the latter almost always on situational differences.
We can attempt to compare apples with apples by asking whether the same
situational variables that produce differences in average accuracy also produce
similar differences in average confidence. In those studies that have reported
average confidence, the answer is generally, "yes." For example, we
know that as duration of exposure goes from a few seconds to 60 seconds, the
average accuracy of subjects increases. In addition, the average confidence
that subjects express in their identifications also increases (e.g., Ebbesen
& Wixted, 1996). Thus, there is a natural tendency for average accuracy and
average confidence to be correlated over learning conditions. As a result,
using confidence to estimate the "strength" of the largely hidden
learning conditions in these studies might be a very rationally and generally
accurate strategy in the real world.
Third,
if we look at the task from the point of view of the mock jurors, we discover
that their task is particularly difficult and different from that of most real
jurors. The mock jurors are given evidence in the mock trial that we know,
because of the experimental design, is unrelated to witness accuracy. That is,
all witnesses saw the same "criminal" event and the same culprit for
the same amount of time under the same learning conditions. Some witnesses then
picked the wrong person from a lineup and others picked the right person. What
information could the jurors possibly use to detect which witnesses picked the
culprit and which picked someone else. Surely detailed information about the
learning conditions that both correct and incorrect witnesses experienced
cannot provide jurors with information about which witness will be correct.
After all, the evidence is identical in all cases because all witnesses saw the
same crime under the same conditions. The only information that would be
available would be individual differences in how the witnesses behaved during
their testimony. Is this representative of actual cases? A moment's reflection
will indicate that jurors are generally presented with evidence that co-occurs
(over trials) with witness testimony. That is, jurors hear about alibis, about
how the culprit was arrested, about other physical evidence, as well as from
other witnesses who might present corroborating information. In this context,
the jurors also hear the witness identify the culprit with whatever confidence
they express. Finally, jurors also hear witnesses report on the conditions of
observation that they experienced, e.g., where they looked, how far away they
were, what they were feeling at the time, and so on. In the typical experiment,
the "other" case evidence is held constant. Thus, the only
potentially predictive information available to the mock jurors is witness
behavior. The opportunity for jurors to rely in what most researchers believe
is diagnostically better situational information is never made available.
An
adequate test of whether jurors appropriately weight confidence compared to
other information requires that we compare the ability of witness confidence
verses situational factors to predict actual guilt with the subjective
"weight" that jurors give to confidence verses situational factors
when they decide guilt. These comparisons have never been performed because we
do not know how well witnesses' confidence estimates verses other factors
predict actual guilt.
Finally,
some of the research on this topic uses a slightly different methodology (e.g.,
Cutler, Penrod, & Stuve, 1988) that is common to most of the research that
attempts to determine the relative weights that decision-makers (e.g., mock
jurors) give to different sources of information. Unfortunately, results from
this methodology are also extremely difficult to interpret. This research
methodology typical varies the different sources of information (in a factorial
design) and then examines the effect of those variations on decision-making.[12]
In general, relative weights are then inferred from the relative sizes of the
effects of the two factors. Researchers assume that the jurors give greater
relative weight to the factor that produced the bigger effect (or accounted for
more of the variance). The reason that this kind of research is so difficult to
interpret is because relative effect size is so dependant on the range of
variation in the factors. Assume, for the moment, that jurors weight two
factors equally and that an experiment varies both of them in a factorial
design. If the range of variation in one factor is small and the range in the
other is large (by whatever measures one chooses), then it is likely that the
factor with the greater range of variation will produce the bigger effect. For
example, imagine that mock jurors either see a witness who is very confident or
one who admits that she just guessed. Imagine further that half of each of
these jurors are told that the witness saw the culprit for 5 seconds and the other
half are told 10 seconds. We might expect the confidence manipulation to have a
bigger effect. But suppose that another experiment is done. However, this time
half of the jurors are told that the witness saw the culprit for 1 second and
the other half are told 10 minutes. Now we might expect the effect of duration
to be much bigger. Thus, the weight inference requires that the subjective
differences between the levels within one factor be equivalent to those for the
other factors to which it is being compared. Unless this equality in size of
the manipulations is demonstrated, the issue of weight cannot be unambiguously
determined.
Unfortunately,
the studies that have examined the relative weight that mock jurors give to
confidence compared to other factors suffer from one or more of the
interpretational and design problems discussed here.
Several
researchers (e.g., Wells & Bransford, 1998) have suggested that confidence
and accuracy are independent because confidence can be changed without accuracy
also changing. For example, some researchers have concluded that repeated
questioning, learning that others agree with one's recollections, and learning
that one's recollections are "correct," tend to increase confidence
without also increasing accuracy (Luus & Wells, 1994; Shaw, 1996; Wells
& Bradfield, 1998, Wells, et. al, 1998). Unfortunately, to claim that the
mechanisms that control accuracy and confidence are independent requires additional
evidence. To understand why requires that we consider the various ways in which
two response production systems (confidence ratings and recognition responses)
might be related. Figure 4 shows a representation of the simplest model of
complete dependence between confidence and accuracy. It suggests that all of
the variables that control accuracy have their effects on the same processes
that control confidence. In this way, whenever a variable changes accuracy, it
will also produce changes in confidence.

Figure
4
It
is not necessary to assume that the same single mechanism controls both
confidence and accuracy to have a system in which the two responses will show dependences,
however. Consider Figure 5. This figure shows a system in which confidence is
controlled by one mechanism and accuracy by another but because all of the
variables that affect one mechanism also simultaneously affect the other,
variations in the two response systems will always co-occur, although the exact
form of the covariation would depend on the nature of the two mechanisms.

Figure 5
It
is also possible to construct models in which the confidence and accuracy
mechanisms are related in time. Figure 6 shows such an example. In this case
all input first affects the accuracy mechanism and then output from that
process affects the confidence process. Clearly this system would cause
accuracy and confidence to co-vary as inputs of different types were varied.

Figure 6
To
break the co-variation between two response systems requires that not all
variables affect both response systems as depicted in Figures 4 through 6.
However, the fact that one set of variables affects one response system but not
the other does not necessarily mean that the two systems are independent.
Consider the model in Figure 7. In this model, variation in input A will cause
accuracy and confidence to co-vary in the same manner as in the model depicted
in Figure 6 because the output of the accuracy process serves as an input to
the confidence process. Of course, variation in input C will only affect the
level of confidence and will have no effect on accuracy. Thus, it is possible
for accuracy to be related to confidence in that variables that affect accuracy
(input A) also affect confidence, but for the level of confidence to be
controlled by other variables (input C) as well.

Figure 7
After
all, confidence is simply a verbal self-rating. As a self-rating it should be
subject to all of the same instructional and motivational factors that affect
all self-ratings (Eiser, 1990). For example, because confidence is a
self-report, instructions in how to use the confidence scale will almost surely
affect witnesses' average measured confidence (e.g., mean confidence ratings)
without affecting the accuracy of their memories. Imagine that we told some
subjects that they were to use the "absolute confidence"
self-description whenever they felt that there was a 65% chance that their
identification response was correct. Imagine that we told others to use the
same self-description only when they felt that there was a 99% chance that
their identification response was correct. It seems likely that the former
would have a much higher average (over several identification attempts)
confidence score than the latter, all other things equal. It also seems likely
that these instructions would have no effect on the accuracy of memories.
The
fact that confidence ratings can be manipulated independently of accuracy does
not mean that accuracy and confidence will not tend to be related nor that
confidence is not likely to be a very good predictor of the accuracy of a
witness's different memories. For example, while it true that instructions
might raise the average confidence for all witnesses who hear them, it can
still be true that for each witness, the things that they are most confident in
will be the things that they are most likely to identify correctly. Figure 8
shows how this might work.

Figure 8
Instructions
might increase the confidence that a witness expresses for each memory about
which they are asked. However, the items in which they are most confident might
still be the items for which they have the strongest memories. Thus, higher
confidence would still be associated with higher accuracy.[13]
True
independence of accuracy and confidence would only occur if the variables that
control accuracy and those that control confidence both affect different
mechanisms and the two mechanisms were unconnected. Figure 9 shows an
example of this model. In this case, variation of input A (while holding input
C constant) would produce an effect on accuracy and no effect on confidence and
similarly variation in input C (while holding input A constant) would produce
an effect on confidence but no effect on accuracy. There is no evidence that
confidence and accuracy are related in this manner. Although the relevant
research has yet to be done, most studies show that variables that produce
changes in accuracy (e.g., duration of exposure, retention interval, attention,
and so on) tend to produce changes in average confidence, as well (e.g.,
Ebbesen & Wixted, 1996).

Figure 9
Does
the fact that confidence can be changed by some manipulations that do not
affect accuracy mean that decision-makers in the legal system should ignore
witness confidence and focus on other variables, e.g., the retention interval,
the type of lineup procedure used, or whether the person who conducted the
lineup was blind as to the suspect? To answer this question in the affirmative
implies that experts and jurors would be more accurate in their judgments about
guilt (and witness accuracy) were they to rely on other factors and ignore
confidence rather than base their judgments on both types of information (or
even just confidence alone). Unfortunately, currently we have no data that
directly tests this hypothesis. Still, if confidence is malleable and other
factors that might predict accuracy, e.g., retention interval, are not, then it
seems only logical to ignore the "unreliable" or variable predictor
in favor of the more reliable ones.
However,
when attempting to generalize these findings to the real world, at least three
issues are important.
As
suggested above, confidence ratings are likely to be affected by all of the
"judgment" variables that affect other types of judgments, e.g.,
attitudes and perceptions. Two board categories of variables that will affect
how people use judgment scales to express subjective experiences are
information about how to map the scale onto their feelings (e.g., instructions,
anchors, social comparison, training) and motivational variables (e.g., rewards
and punishments for using different parts of the scale under various
conditions). The studies showing that confidence is malleable have used
specific procedures to show that college student subjects who know that they
are participating in an experiment will alter their confidence estimates. A key
to being able to generalize the results of these procedures is the nature of
the informational and motivational variables that control the behavior of
actual victims and witnesses when they provide evidence to the legal system.
Suppose, for example, that the large majority of actual witnesses and victims
are extremely motivated to avoid making highly confident false alarms. They
might do so because they fear innocent people will be convicted and/or because
they do not want the system to arrest the wrong person leaving the guilty
person to go free. Would the variables in those studies that have demonstrated
confidence malleability be strong enough to have effects on such people? That
is, how do the motivational and information variables that have been studied in
the laboratory compare to the motivational and informational forces that exist
in the real world? Laboratory studies have not examined this issue because
researchers do not know how actual witnesses and victims are motivated.
Nevertheless, it does seem reasonable to assume that actual victims and
witnesses may establish their confidence criteria differently than subjects who
view videotapes in laboratories.
A
second issue concerns the fact that very few police and investigative agencies
use rating scales to measure confidence. In cases with which I am familiar,
some officers will ask, after a witness has selected someone from a
photographic lineup, "How do you know this person?" If the witness
says, "That's the person who robbed me," the officer concludes that
the witness is very confident. But if the witness says, "He looks like the
person who robbed me," the officer will state that the witness was less
than completely confident. In other words, some officers consider verbal
responses that "identify" the suspect as the culprit as indicating
high confidence. In other cases, witnesses will point to one or more features
and say something like, "That's him. I'm sure that's him. I will never
forget those eyes." Although it is possible that the same variables that
affect how subjects use confidence ratings will affect this kind of verbal behavior,
relevant research has not yet been done.
The
third and possibly most important issue concerns the possibility that the
initial strength of the witness's memory for the event/culprit will affect the
malleability of the witness's confidence. For example, in most laboratory
studies of eyewitness memory, researchers arrange the learning conditions such
that accuracy scores will be far from perfect. Clearly, if subjects are 100%
correct because the learning conditions are so good, then it will be impossible
to observe the effects of other variables on accuracy. As a consequence, in the
huge majority of laboratory studies, memories for events and people will not be
strong (e.g., the proportion of subjects who correctly recognize the culprit
will be much less than 100%, the average proportion of faces that are correctly
recognized will generally be somewhere between chance and 85%, and so on). It
is likely that the average confidence of subjects in these studies will also be
somewhat less than absolute. If this description accurately portrays the
current research, two problems arise. First, as we know from research in other
judgment domains (Eiser, 1990), it may well be harder to influence extreme
ratings than middle-valued ones. Second, as noted earlier, the legal system is
likely to ignore witnesses whose confidence is weak. Although we have no
research that assesses the average cutoff point used by the legal system, it
might well be higher than the average confidence created by researchers. If so,
it could be that most witnesses for cases that are pursued by police and
prosecutors are already initially so confident that the fear that other events
might have caused their confidence to increase is unnecessary.
Still,
it can be argued that unless one knows the precise conditions under which
confidence estimates were obtained, it would be very difficult to derive an
exact estimate of the odds that identifications are correct for a given
confidence level of a given witness to a given crime. This means that
prosecutors and police should spend considerable effort documenting the
procedures by which they obtain information from witnesses. Without precise
information about the way the confidence estimates are obtained, there is
always the possibility that the confidence estimates were influenced by factors
other than strength of memory. But the same problem exits for all
"factors" that might affect accuracy. Unless precise information is
obtained about stress level, unconscious transfer, exact instructions and
procedures used in the lineup (even if it is sequential), and so on, the
possibility exists that the other factors might be affecting witness
performance. That is, our field is simply not capable of providing
"point" predictions for measures such as eyewitness accuracy. Too many
variables act together to affect the outcome.
When
one thinks about the details of the kinds of evidence that might face
decision-makers who are attempting to construct an estimate of the odds that a
particular defendant is or is not guilty, it becomes very difficult to imagine
exactly how research on confidence and accuracy should properly affect such
estimates. For example, let us assume that a decision maker is confronted with
the following evidence concerning an armed robbery of a liquor store:
What
are the odds that this suspect is the culprit and how might we figure it out?
From the perspective of the accuracy of eyewitness identification, how much should
the eyewitness's identification affect our odds estimate? What effect, if any,
should the 95% confidence have on our estimate? If confidence and accuracy are
as poorly related as suggested by Wells and others, then this confidence
estimate should carry near zero weight. Should the fact that the suspect was
only person with crossed-eyes mean that we should completely ignore the lineup
and base our estimate only on the remaining evidence? What if the local police
had selected the suspect not because another police officer recognized the
culprit's description and MO as being similar to the suspect but because after
searching DMV records and doing some legwork they interview the girlfriend and
discover that her boyfriend happens to be cross-eyed and that she sometimes
lends him her car? What if the victim expresses less confidence, say only 50%
certainty. More importantly, should the role that this estimate plays depend on
an analysis of the likelihood that the victim could have a clear memory of what
the culprit looked like? For example, suppose the victim said that he only
looked at the culprit's face once for a very brief moment. Should the fact that
he was only 50% certain carry less weight? Suppose the lineup occurs a year
after the crime and the witness is only 50% certain but all of the other
evidence remains the same? Might it make sense to assume that his 50% certainty
reflects the fact that his memory for culprit has faded with time or should we
ignore confidence estimates because there are so poorly related to accuracy?
Another way of thinking about these issues is in terms of whether the role that
witness confidence should play in odds estimates should depend on the nature of
the other evidence that exists in the case.
It should be clear that the specialized knowledge about eyewitness memory and guilt that psychologists have provides no advantage when attempting to answer the real-world questions that jurors and others within the legal system must answer. We simply do not know how to apply the research and conclusions from that research to the kinds of questions that must be answered in the real world.
Several
researchers (e.g., Clark, 1997; Read, Vokey, & Hammersley, 1990) have
reported negative correlations between confidence and accuracy. Careful
examination of the procedures used in these studies suggest that these results
arise when the experimenters define as incorrect items that are highly similar
to those that subjects studied and are trying to detect. For example, suppose
someone saw a man commit a robbery and was then presented with a lineup
containing the man's identical twin brother instead of the robber. We might
expect witnesses to choose the brother with very high confidence because the
brother looks so similar to the robber. Obviously this would be an incorrect
choice in which a witness would be very confident. We might expect the
likelihood that witnesses would choose the brother, incorrectly, and with high
confidence to increase the better their memory of the robber (at least up to
the point that they were able to distinguish the two reliably). Thus, the more
an innocent suspect looks very much like the guilty culprit, the more likely
witnesses might be to identify that innocent culprit with high confidence.
Furthermore, the better the witnesses' memories for the original culprit, the
more likely they will be to choose the look-a-like with high confidence (again,
up to a point).
The
above reasoning is completely consistent a signal detection analysis. In fact,
the signal detection analysis formalizes the odds that people will find the not
seen faces to appear similar enough to seen faces that subjects will be willing
to pick them as having been seen before. In signal detection language, if the
not seen test items produce higher subjective feelings of having been seen
before than the items that were actually seen, then the correlation between
confidence and accuracy will be negative.
This
brings us back to an issue discussed earlier, namely, what are the odds that
innocent suspects will look enough like the guilty individual that witnesses
will be willing to identify him with high confidence? Clearly, none of the
research conducted thus far can tell us what these probabilities are. On the
other hand, they do suggest that the suspects that are most likely to be
wrongly identified are those who look a lot like the actual culprit.
Unfortunately, whether a given suspect looks a lot like the actual culprit or
barely resembles him is not something that can be measured in real world cases.
The
above review implies that eyewitness memory researchers are far from being able
to quantify the odds that witnesses are correct given various initial
conditions, including particular expressions of confidence by witnesses.
Regardless, the effect of telling or training jurors (and prosecutors, and even
the police) about the complexity of the relationships between confidence and
accuracy on the accuracy of jury decisions is an empirical question that has
only been explored in a few studies. Unfortunately, the external validity of
these studies is potentially flawed because they are based on the same
rationale as those that claim to show that mock jurors depend too heavily on
confidence and too little on other factors. Nevertheless, we can still hope
those who wish to argue that providing jurors with information about the
relationship between confidence and accuracy will improve the accuracy of jury
decisions will conduct additional research that examines this issue using
designs and procedures that allow for generalization. After all, external
validity is an empirical issue.
In
sum, it is a mistake to believe that the results of research that is currently
being done on eyewitness memory and confidence can help jurors or experts
improve their ability to tell accurate from inaccurate witnesses. Fortunately,
this is not a major problem because for the huge majority of cases, decisions
about guilt are made on the basis of the totality of the evidence against the
defendant and not on the size of the correlation between confidence and
accuracy. Unfortunately in their zeal to "make their mark" (not to
mention money) too many experts have attempted to direct jurors away from
important evidence by testifying that there is little or no relationship
between confidence and accuracy and/or that confidence should not be used a cue
to judge whether a witness's identification is accurate.[14]
Berhman, B. W., & Davey, S. L. (1999). Eyewitness memory for
actual crimes: An archival analysis. Dever: American Psychological Society,
11th meeting.
Bothwell, R. K., Deffenbacher, K. A., &
Brigham, J. C. (1987). Correlation of eyewitness accuracy and confidence:
Optimality hypothesis revisited. Journal of Applied Psychology, 72(4),
691-695.
Brigham, J. C., Maass, A., Snyder, L. D.,
& Spaulding, K. (1982). Accuracy of eyewitness identification in a field
setting. Journal of Personality & Social Psychology, 42(4), 673-681.
Clark, S. E. (1997). A familiarity-based account of confidence-accuracy
inversions in recognition memory. Journal of Experimental Psychology:
Learning, Memory, & Cognition, 23(1), 232-238.
Cutler, B. L., & Penrod, S. D. (1989). Forensically relevant
moderators of the relation between eyewitness identification accuracy and
confidence. Journal of Applied Psychology, 74(4), 650-652.
Cutler, B. L., Penrod, S. D., & Stuve, T. E. (1988). Juror decision
making in eyewitness identification cases. Law & Human Behavior, 12(1),
41-55.
Deffenbacher, K. A. (1980). Eyewitness and confidence: Can we infer anything about
their relationship? Law and Human Behavior, 4, 243-260.
Ebbesen, E. B. & Wixted, J. (1996) A
signal detection analysis of the relationship between confidence and accuracy
in face recognition memory. Unpublished paper, University of California, San
Diego.
Eiser, J. R. (1990). Social judgment.
Pacific Grove, CA: Brooks/Cole.
Fleet, M. L., Brigham, J. C., & Bothwell,
R. K. (1987). The confidence-accuracy relationship: The effects of confidence
assessment and choosing. Journal of Applied Social Psychology, 17(2),
171-187.
Gruneberg, M. M., & Sykes, R. N. (1993). The generalisability of
confidence--accuracy studies in eyewitnessing. Memory, 1(3).
Hastie, R., Penrod, S., & Pennington, N. (1983). Inside the jury.
Cambridge, MA: Harvard University Press.
Hosch, H. M., & Platz, S. J. (1984). Self-monitoring
and eyewitness accuracy. Personality and Social Psychology Bulletin, 10(2),
289-292.
Kassin, S. M., Ellsworth, P. C., & Smith,
V. L. (1989). The "general acceptance" of psychological research on
eyewitness testimony: A survey of the experts. American Psychologist, 44(8),
1089-1098.
Kebbell, M. R., Wagstaff, G. F., & Covey, J. A. (1996). The
influence of item difficulty on the relationship between eyewitness confidence
and accuracy. British Journal of Psychology, 87(4), 653-662.
Konecni, V. J., & Ebbesen, E. B. (1982).
An anaylsis of the sentencing system. In V. J. Konecni & E. B. Ebbesen
(Eds.), The criminal justice system: A social-psychological analysis .
San Francisco: W. H. Freeman.
Krafka, C., & Penrod, S. (1985).
Reinstatement of context in a field experiment on eyewitness identification. Journal
of Personality & Social Psychology, 49(1), 58-69.
Leippe, M. R. (1980). Effects of integrative
and memorial and cognitive processes on the correspondence of eyewitness
accuracy and confidence. Law and Human Behavior, 4, 261-274.
Leippe, M. R. (2000). Eyewitness expert
report concerning State of New Jersey v. Wifredo Gonzalez. Report
submitted to the court as a defense expert, Camden County, NJ.
Libuser, M. & Ebbesen, E. B., (1999).
Confidence and accuracy: How are they related? Unpublished technical paper.
University of California, San Diego.
Lindsay, R. C. L., Pozzulo, J. D., Craig, W., & Lee, K. (1997).
Simultaneous lineups, sequential lineups, and showups: Eyewitness
identification decisions of adults and children. Law & Human Behavior,
21(4), 391-404.
Lindsay, D. S., Read, J. D., & Sharma, K.
(1998). Accuracy and confidence in person identification: The relationship is
strong when witnessing conditions vary widely. Psychological Science, 9(3),
215-218.
Lindsay, R. C. L., Smith, S. M., & Pryke, S. (1999). Measures of
lineup fairness: Do they postdict identification accuracy? Applied Cognitive
Psychology, 13(Spec Issue), S93-S107.
Lindsay, R. C., & Wells, G. L. (1985). Improving eyewitness identifications
from lineups: Simultaneous versus sequential lineup presentation. Journal of
Applied Psychology, 70(3), 556-564.
Lindsay, R. C., Wells, G. L., & O'Connor, F. J. (1989). Mock-juror
belief of accurate and inaccurate eyewitnesses: A replication and extension. Law
and Human Behavior, 13(3), 333-339.
Luus, C. E., & Wells, G. L. (1991). Eyewitness identification and
the selection of distracters for lineups. Law & Human Behavior, 15(1),
43-57.
Luus, C. A. E., & Wells, G. L. (1994). The malleability of eyewitness
confidence: Co-witness and perseverance effects. Journal of Applied
Psychology, 79(5), 714-723.
Macmillian, N. A., & Creelman, C. D. (1991). Detection Theory: A
user's guide. New York: Cambridge University Press.
Main, K. M., Leland, L. S., Jr., & Bartlett, G. C. (1998). The
properties of one: Facial memory and the isolation effect. Journal of
General Psychology, 125(2), 192-206.
Navon, D. (1990). Ecological parameters in
nonlineup evidence: A reply to Wells and Luus. Journal of Applied
Psychology, 75(5), 517-520.
Navon, D. (1991). "Ecological parameters
in nonlineup evidence: A reply to Wells and Luus": Correction. Journal
of Applied Psychology, 76(3), 407.
Navon, D. (1992). Selection of lineup foils
by similarity to the suspect is likely to misfire. Law & Human Behavior,
16(5), 575-593.
National Institute of Justice (1999). Eyewitness evidence: A guide for
law enforcement. Report of U. S. Department of Justice, National Criminal
Justice Reference Service, 1-44.
Pennington, N., & Hastie, R. (1990). Practical implications of
psychological research on juror and jury decision making. Special Issue:
Illustrating the value of basic research. Personality and Social Psychology
Bulletin, 16(1), 90-105.
Penrod, S., & Cutler, B. (1995). Witness confidence and witness
accuracy: Assessing their forensic relation. Special Issue: Witness memory and
law. Psychology, Public Policy, & Law, 1(4), 817-845.
Pigott, M., & Brigham, J. C. (1985).
Relationship between accuracy of prior description and facial recognition. Journal
of Applied Psychology, 70(3), 547-555.
Pozzulo, J. D., & Lindsay, R. C. L. (1999). Elimination lineups: An
improved identification procedure for child eyewitnesses. Journal of Applied
Psychology, 84(2).
Read, J. D., Vokey, J. R., & Hammersley, R. (1990). Changing photos
of faces: Effects of exposure duration and photo similarity on recognition and
the accuracy^confidence relationship. Journal of Experimental Psychology:
Learning, Memory, & Cognition, 16(5), 870-882.
Read, J. D., Lindsay, D. S., & Nicholls, T. (1998). The relation
between confidence and accuracy in eyewitness identification studies: Is the
conclusion changing? In E. Charles P. Thompson, E. Douglas J. Herrmann, &
et al. (Eds.), Eyewitness memory: Theoretical and applied perspectives.
(pp. 107-130): Mahwah, NJ, USA.
Robinson, M. D., & Johnson, J. T. (1998). How not to enhance the
confidence-accuracy relation: The detrimental effects of attention to the
identification process. Law & Human Behavior, 22(4), 409-428.
Robinson, M. D., & Johnson, J. T. (1996). Recall memory,
recognition memory, and the eyewitness confidence-accuracy correlation. Journal
of Applied Psychology, 81(5), 587-594.
Shapiro, P. N., & Penrod, S. (1986). Meta-analysis of facial
identification studies. Psychological Bulletin, 100(2), 139-156.
Shaw, J. S., III. (1996). Increases in eyewitness confidence resulting
from postevent questioning. Journal of Experimental Psychology: Applied, 2(2),
126-146.
Smith, V. L., Kassin, S. M., & Ellsworth, P. C. (1989). Eyewitness
accuracy and confidence: Within- versus between-subjects correlations. Journal
of Applied Psychology, 74(2), 356-359.
Sporer, S. L. (1992). Post-dicting eyewitness accuracy: Confidence,
decision-times and person descriptions of choosers and non-choosers. European
Journal of Social Psychology, 22(2), 157-180.
Sporer, S. L., Penrod, S., Read, D., & Cutler, B. (1995). Choosing,
confidence, and accuracy: A meta-analysis of the confidence-accuracy relation
in eyewitness identification studies. Psychological Bulletin, 118(3),
315-327.
Tollerstrup, P. A., Turtle, J. W., & Yuille, J. C.
(1994). Actual victims and witnesses to robbery and fraud: An archival
analysis. In J. D. R. M. P. T. David Frank Ross (Ed.), Adult eyewitness testimony:
Current trends and developments. (pp. 144-160): Cambridge University Press,
New York, NY, US.
Webster, R. A., Leland, L. S., Jr., & Bartlett, G.
C. (1997). The properties of one: Single distinctive stimuli and their effects.
Journal of General Psychology, 124(4), 391-409.
Wells, G. L., & Bradfield, A. L. (1998).
"Good, you identified the suspect": Feedback to eyewitnesses distorts
their reports of the witnessing experience. Journal of Applied Psychology,
83(3), 360-376.
Wells, G. L., Leippe, M. R., & Ostrom, T. M.
(1979). Guidelines for empirically assessing the fairness of a lineup. Law
& Human Behavior, 3(4), 285-293.
Wells, G. L., & Lindsay, R. C. (1980). On
estimating the diagnosticity of eyewitness nonidentifications. Psychological
Bulletin, 88(3), 776-784.
Wells, G. L., & Luus, C.
E. (1990). The diagnosticity of a lineup should not be confused with the
diagnostic value of nonlineup evidence. Journal of Applied Psychology, 75(5),
511-516.
Wells, G., & Murray, D.
(1984). Eyewitness confidence. In G. Wells & E. F. Loftus (Eds.), Eyewitness
testimony: Psychological perspectives . Cambridge: Cambridge University
Press.
Wells, G. L., Small, M., Penrod, S., Malpass, R. S.,
Fulero, S. M., & Brimacombe, C. A. E. (1998). Eyewitness identification
procedures: Recommendations for lineups and photospreads. Law & Human
Behavior, 22(6).
Wickham, L. H. V., Morris, P. E., & Fritz, C. O.
(2000). Facial distinctiveness: Its measurement, distribution and influence on
immediate and delayed recognition. British Journal of Psychology, 91(1),
99-123.
[1] A third series of studies have examined the factors that mock-jurors take into account when deciding either the accuracy of witnesses' testimonies or the guilt of defendants. Experts (e.g., Penrod & Cutler, 1995) have concluded that these studies show that mock jurors "overweight" confidence and "underweight" other factors when making these decisions.
[2] Deffenbacher presented no evidence to support this empirical claim about the features of actual crimes. Thus, to this day, the field does not know what the distribution of such variables as retention interval, durations of exposure, and stress level are for actual crimes.
[3] Partly as a consequence of this research, experts who tend to testify in court for the defense frequently tell juries that eyewitness confidence is not to be trusted (e.g., testimony by R. Bjork, S.C. Fraser, E. Loftus, and K. Pezdek).
[4] It is worth noting that Deffenbacher presented no evidence for this implication of his conclusion nor does it seem likely that this implication will prove correct when real world data becomes available.
[5] The primary reason that there is no empirical support for this conclusion is because the relevant research simply has not been performed. No researcher has bothered to attempt to measure the accuracy rate of victim and witness identifications in actual crime situations. Two studies (Berhman & Davey, 1999; Tollerstrup, Turtle, and Yuille, 1994) have reported on the rate of positive identifications but neither made an attempt to assess the accuracy of those identifications. Clearly research from laboratory simulations cannot possibly provide information about the accuracy of actual witnesses because the conditions in laboratories may be very different from those in actual crimes. We know for example that we can arrange conditions in the laboratory so that accuracy will be 100% or 0%, although most of the time, for sound research design reasons, it is set somewhere in between. Clearly, much additional research is needed before accepting the Deffenbacher conclusion about actual crime situations as correct.
[6] Recently, the NIJ (1999) has made proposals based on interpretations of research findings, about two methods of choosing foils. In one, foils are matched to the descriptions given by witnesses of the culprit, e.g., a white male with short black hair, and in the other foils are matched in terms of the similarity of each face to that of the suspect.
[7] This analysis points out that a major factor determining the rate at which innocent people are charged with crimes that they did not commit is the rate at which suspects who are innocent are arrested and put in lineups by the police. If the huge majority of suspects are guilty, then even witnesses who choose randomly will be choosing guilty suspects. Unfortunately, we have no idea what proportion of lineups is target present and what proportion is blank in the real world. On the hand, we do know that in experimental studies of eyewitness memory, the proportion is almost always set at 50%. It follows that if the rate in the real world is much less, then experimental studies cannot be used to provide information about the rate at which actually innocent people are falsely identified.
[8] The size of these correlations will tend to grow smaller as the percentage of middle-valued confidence estimates grows larger. This is because the best fitting linear function will tend to be midway between middle-value confidence estimates. As a result, the more middle-valued data points subjects produce, the greater the proportion of data points that will be farthest away from the best fitting function.
[9] At the heart of the claim that other stimulus factors are better predictors of witness accuracy than confidence is the idea that variation in stimulus conditions produce variation in witness accuracy. This is the mantra of experimental social psychology. If true, because most experimenters hold almost everything constant that they can, except the independent variable(s), different subject/witnesses will experience the "crime" in almost identical ways, e.g., they will see the same videotape. Thus, researchers should expect, with the exception of individual differences (in ability to remember faces or in how they pay attention, for example) that the accuracy of the subjects should be about the same. Stated differently, the only source of variability in accuracy after holding the crime, criminal, and observation conditions constant is individual differences.
[10] One of the reasons that we have so little evidence about the accuracy of such estimates is that it is unclear how such estimates should be constructed.
[11] The same argument applies to different memories of a particular witness. If a witness says that he or she is just guessing about particular remembered facts, this information might never come out in court.
[12] The following is the abstract from the Cutler, et al., (1988) study: "Examined how 321 undergraduate mock jurors integrated eyewitness evidence to draw inferences about defendant culpability and the likelihood that an identification was correct. Subjects viewed a videotaped trial within which 10 witness and identification factors were manipulated between trials. Subjects demonstrated superior memory for the evidence, and the manipulated variables had their intended impact on appropriate rating scales. Only one variable, witness confidence, had reliable effects on Subjects' perceptions of culpability and on the perceived likelihood that the identification was correct. Eight variables (e.g., retention interval, stressfulness) shown to affect identification accuracy in the literature had trivial effects on inferences. It is concluded that lay-people are insensitive to the factors that influence eyewitness memory."
[13] The same logic can be applied to differences between people. That is, a manipulation might increase the average confidence that all witnesses express, but the rank order among the witnesses might remain the same. The least confident remain least confident relative to the other witnesses.
[14] As an example, in a recent case, State of New Jersey v. Wilfedo Gonzales, Leippe (2000), hired by the defense, wrote to the court, "Yet, as noted above, confidence-in-memory is, at best, only weakly correlated with memory accuracy at the time of identification, and commonly uncorrelated with accuracy once eyewitnesses reach the courtroom, as they become publicly committed to their identification, are aware that the police agree that they 'have the right guy,' are aware of additional evidence that the police and prosecution believe corroborate their testimony, and have probably been coached to be confident." Of course, there are no published studies on what happens to witnesses' confidence estimates when they appear in court in actual cases nor on the odds that witnesses in such cases are accurate.