Protecting victim and witness statement: examining the effectiveness of a chatbot that uses artificial intelligence and a cognitive interview

Rashid Minhas¹,
Camilla Elphick² &
Julia Shaw³

1722 Accesses
9 Citations
1 Altmetric
Explore all metrics

Abstract

Information of high evidentiary quality plays a crucial role in forensic investigations. Research shows that information provided by witnesses and victims often provide major leads to an inquiry. As such, statements should be obtained in the shortest possible time following an incident. However, this is not achieved in many incidents due to demands on resources. This intersectional study examined the effectiveness of a chatbot (the AICI), that uses artificial intelligence (AI) and a cognitive interview (CI) to help record statements following an incident. After viewing a sexual harassment video, the present study tested recall accuracy in participants using AICI compared to other tools (i.e., Free Recall, CI Questionnaire, and CI Basic Chatbot). Measuring correct items (including descriptive items) and incorrect items (errors and confabulations), it was found that the AI CI elicited more accurate information than the other tools. The implications on society include AI CI provides an alternative means of effectively and efficiently recording high-quality evidential statements from victims and witnesses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bias Sensitivity in Diagnostic Decision-Making: Comparing ChatGPT with Residents

Article Open access 07 November 2024

The future of AI clinicians: assessing the modern standard of chatbots and their approach to diagnostic uncertainty

Article Open access 11 October 2024

Those Aren’t Your Memories, They’re Somebody Else’s: Seeding Misinformation in Chat Bot Memories

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

AI CI for research: The AI CI has tremendous potential to study the effectiveness of the CI in different contexts, and widely accessible. For the purposes of this research, we created a research version of the AI CI, and the data presented in the present study were collected using this research version. This research version of the AI CI is available to all who want to use it for research purposes.

References

Brandtzaeg B, Følstad A (2017) Why people use chatbots. International conference on internet science. Springer, Cham, pp 377–392
Chapter Google Scholar
Buhrmester MD, Talaifar S, Gosling D (2018) An evaluation of amazon’s mechanical turk, its rapid rise, and its effective use. Perspect Psychol Sci 13:149–154. https://doi.org/10.1177/1745691617706516
Article Google Scholar
Bull R (2013) What is ‘believed’ or actually ‘known’ about characteristics that may contribute to being a good/effective interviewer? Investig Interviewing: Res Pract 5:128–143
Google Scholar
Cortina M, Magley J (2003) Raising voice, risking retaliation: events following interpersonal mistreatment in the workplace. J Occup Health Psychol 8(4):247
Article Google Scholar
Dienes Z, Mclatchie N (2018) Four reasons to prefer Bayesian analyses over significance testing. Psychon Bull Rev 25:207–218. https://doi.org/10.3758/s13423-017-1266z
Article Google Scholar
EEOC (2016) Select task force on the study of harassment in the workplace. www.eeoc.gov/eeoc/task_force/harassment/report.cfm. Accessed 16 Jan 2018
EHRC (2018). Retrieved May 4, 2018. https://www.equalityhumanrights.com/en/publication-download/turning-tables-ending-sexual-harassment-work
Fisher R, Geiselman R (2010) The cognitive interview method of conducting police interviews: eliciting extensive information and promoting therapeutic jurisprudence. Int J Law Psychiatry 33(5–6):321–328. https://doi.org/10.1016/j.ijlp.2010.09.004
Article Google Scholar
Fisher R, Milne R, Bull R (2011) Interviewing cooperative witnesses. Curr Dir Psychol Sci 20:16–19. https://doi.org/10.1177/0963721410396826
Article Google Scholar
Gabbert F, Hope L, Fisher R (2009) Protecting eyewitness evidence: examining the efficacy of a self-administered interview tool. Law Hum Behav 33(298):307. https://doi.org/10.1007/s10979-008-9146-8
Article Google Scholar
Gabbert F, Memon A, Allan K (2003) Memory conformity: can eyewitnesses influence each other’s memories for an event? Appl Cogn Psychol 17:533–543. https://doi.org/10.1002/acp.885
Article Google Scholar
Gudjonsson G (2018) The psychology of interrogations and confessions: a handbook. John Wiley & Sons, London
Book Google Scholar
Hershkowitz I, Orbach Y, Lamb ME, Sternberg KJ, Horowitz D (2001) The effects of mental context reinstatement on children's accounts of sexual abuse. Appl Cogn Psychol Off J Soc Appl Res Mem Cogn 15(3):235–248
Herbenick D, van Anders M, Brotto A, Chivers L, Jawed-Wessel S, Galarza J (2019) Sexual harassment in the field of sexuality research. Arch Sex Behav 48(4):997–1006
Article Google Scholar
Kaplan R, Van Damme I, Levine L, Loftus E (2016) Emotion and false memory. Emot Rev 8:8–13. https://doi.org/10.1177/1754073915601228
Article Google Scholar
Kebbell M, Milne R (1998) Police officers’ perceptions of eyewitness performance in forensic investigations. J Soc Psychol 138:323–330. https://doi.org/10.1080/00224549809600384
Article Google Scholar
Köhnken G, Milne R, Memon A, Bull R (1999) The cognitive interview: a meta-analysis. Psychol Crime Law 5(1–2):3–27. https://doi.org/10.1080/10683169908414991
Article Google Scholar
Koriat A, Goldsmith M (1996) Monitoring and control processes in the strategic regulation of memory accuracy. Psychol Rev 103(3):490
Lamb M, Hershkowitz I, Orbach Y, Esplin P (2008) Factors affecting the capacities and limitations of young witnesses. Tell Me what happened: structured investigative interviews of child victims and witnesses. Wiley, Chichester, UK, pp 19–61
Chapter Google Scholar
London K, Henry LA, Conradt T, Corser R (2013) Suggestibility and individual differences in typically developing and intellectually disabled children. In: Ridley AM, Gabbert F, La Rooy DJ (eds) Suggestibility in legal contexts: psychological research and forensic implications. Wiley-Blackwell, Chichester, pp 129–148
Google Scholar
Meissner C, Kassin S (2002) “He’s guilty!”: investigator bias in judgments of truth and deception. Law Hum Behav 26:469–480. https://doi.org/10.1023/a:1020278620751
Article Google Scholar
Memon A, Bull R (1991) The cognitive interview: its origins, empirical support, evaluation and practical implications. J Community Appl Soc Psychol 1(4):291–307
Article Google Scholar
Memon A, Holley A, Wark L, Bull R, Koehnken G (1996) Reducing suggestibility in child witness interviews. Appl Cogn Psychol 10:503–518
Article Google Scholar
Memon A, Meissner C, Fraser J (2010) The cognitive interview: a meta-analytic review and study space analysis of the past 25 years. Psychol Public Policy Law 16:340–372
Article Google Scholar
Milne R, Bull R (1999) Investigative interviewing: psychology and practice. John Wiley & Sons Ltd, Chichester
Google Scholar
Milne R, Bull R (2002) Back to basics: a componential analysis of the original cognitive interview mnemonics with three age groups. Appl Cogn Psychol Off J Soc Appl Res Mem Cogn 16(7):743–753
Milne R, Bull R (2016) Investigative interviewing: investigation and probative value. J Forensic Pract. https://doi.org/10.1108/JFP-01-2016-0006
Article Google Scholar
Minhas R, Walsh D, Bull R (2017) Developing a scale to measure the presence of possible prejudicial stereotyping in police interviews with suspects: the Minhas investigative interviewing prejudicial stereotyping scale (MIIPSS). Police Pract Res 18:132–145. https://doi.org/10.1080/15614263.2016.1249870
Article Google Scholar
Ministry of Justice (2011) Achieving best evidence in criminal proceedings: guidance on interviewing victims and witnesses, and guidance on using special measures. Ministry of Justice, London
Google Scholar
Mortimer A, Shepherd E (1999) Frames of mind: schemata guiding cognition and conduct in the interviewing of suspected offenders. In: Memon A, Bull R (eds) Handbook of the psychology of interviewing. Wiley, Chichester, pp 293–315
Google Scholar
Murphy G, Greene C (2016) Perceptual load affects eyewitness accuracy and susceptibility to leading questions. Front Psychol 7:1322. https://doi.org/10.3389/fpsyg.2016.01322
Article Google Scholar
Narchet F, Meissner C, Russano M (2011) Modeling the influence of investigator bias on the elicitation of true and false confessions. Law Hum Behav 35:452–465. https://doi.org/10.1007/s10979-010-9257-x
Article Google Scholar
Perfect T, Wagstaff G, Moore D, Andrews B, Cleveland V, Newcombe S, Brown L (2008) How can we help witnesses to remember more? It’s an (eyes) open and shut case. Law Hum Behav 32:314–324. https://doi.org/10.1007/s10979-007-9109-5
Article Google Scholar
Poole D, Lamb M (1998) Investigative interviews of children: a guide for helping professionals. American Psychological Association, Washington, DC, USA
Book Google Scholar
Prendinger H, Ishizuka M (eds) (2013) Life-like characters: tools, affective functions, and applications. Springer Science & Business Media
Google Scholar
Rahman A, Al Mamun A, Islam A (2017) Programming challenges of chatbot: current and future prospective. Humanitarian technology conference (R10-HTC), 2017 IEEE region 10. IEEE, pp 75–78.https://doi.org/10.1109/R10-HTC.2017.8288910
Chapter Google Scholar
Read J, Connolly D (2017) The effects of delay on long-term memory for witnessed events. In: Toglia MP, Read JD, Ross DF, Lindsay RCL (eds) Handbook of eyewitness psychology: memory for events, vol 1. Lawrence Erlbaum Associates Inc, Mahway, NJ, pp 117–155
Google Scholar
Ridley A (2013) Suggestibility: a history and introduction. In: Ridley AM, Gabbert F, La Rooy DJ (eds) Suggestibility in legal contexts: psychological research and forensic implications. Wiley-Blackwell, England, pp 1–19
Chapter Google Scholar
Rossmo DK (2016) Case rethinking: a protocol for reviewing criminal investigations. Police Pract Res 17:212–228. https://doi.org/10.1080/15614263.2014.978320
Article Google Scholar
Santtila P, Korkman J, Sandnabba NK (2004) Effects of interview phase, repeated interviewing, presence of a support person, and anatomically detailed dolls on child sexual abuse interviews. Psychol Crime Law 10:21–35. https://doi.org/10.1080/1068316021000044365
Article Google Scholar
Shaw J, Porter S (2015) Constructing rich false memories of committing crime. Psychol Sci 26:291–301. https://doi.org/10.1177/0956797614562862
Article Google Scholar
Shawar B, Atwell E (2007) Chatbots: are they really useful? J Lang Technol Comput Linguist 22(1):29–49.
Google Scholar
Shawar B, Atwell E (2005) Using corpora in machine-learning chatbot systems. Int J Corpus Linguist 10:489–516. https://doi.org/10.1075/ijcl.10.4.06sha
Article Google Scholar
Singh A (n.d.) Bayes Factor (Dienes) calculator. Retrieved September 5, 2018. https://medstats.github.io/bayesfactor.html.
Stein L, Memon A (2006) Testing the efficacy of the cognitive interview in a developing country. Appl Cogn Psychol 20:597–605. https://doi.org/10.1002/acp.1211
Article Google Scholar
Taylor D, Dando C (2018) Eyewitness memory in face-to-face and immersive avatar-to avatar contexts. Front Psychol 9:507. https://doi.org/10.3389/fpsyg.2018.00507
Article Google Scholar
Tuckey M, Brewer N (2003) The influence of schemas, stimulus ambiguity, and interview schedule on eyewitness memory over time. J Exp Psychol Appl 9:101–118
Article Google Scholar
Turtle J, Yuille J (1994) Lost but not forgotten details: repeated eyewitness recall leads to reminiscence but not hypermnesia. J Appl Psychol 79:260
Article Google Scholar
Vallano J, Compo NS (2015) Rapport-building with cooperative witnesses and criminal suspects: a theoretical and empirical review. Psychol Publ Policy Law 21:85–89
Article Google Scholar
Walsh D, Bull R (2012) Examining rapport in investigative interviews with suspects: does its building and maintenance work? J Police Crim Psychol 27(1):73–84. https://doi.org/10.1007/s11896-011-9087-x
Article Google Scholar
Westera N, Kebbell M, Milne B (2011) Interviewing witnesses: do investigative and evidential requirements concur? Br J Forensic Pract 13:103–113. https://doi.org/10.1108/14636641111134341
Article Google Scholar
Wixted J, Ebbesen E (1997) Genuine power curves in forgetting: a quantitative analysis of individual subject forgetting functions. Mem Cognit 25:731–739. https://doi.org/10.3758/BF03211316
Article Google Scholar
YouTube (n.d.). Retrieved April 3, 2018, from https://www.youtube.com/watch?v=kg7k5x--k8o&t=22s.

Download references

Acknowledgements

Thank you to software engineer Dylan Marriot for programming the AI CI used in this research.

Funding

Financial support for this project has been obtained from All Turtles, which made all development of the tool, and all research of it possible. The researchers have not been paid for any specific results and have preregistered the study. Still, the team recognises this financial support as a potential source of bias, which is part of the motivation to make the tool widely accessible to all researchers, including those who are not affiliated with All Turtles.

Author information

Authors and Affiliations

School of Human and Social Sciences, University of West London, St Mary’s Road, Ealing, London, W5 5RF, UK
Rashid Minhas
Faculty of Arts and Social Sciences, Open University, Milton Keynes, UK
Camilla Elphick
Division of Psychology and Language Sciences, University College London, London, UK
Julia Shaw

Authors

Rashid Minhas
View author publications
You can also search for this author in PubMed Google Scholar
Camilla Elphick
View author publications
You can also search for this author in PubMed Google Scholar
Julia Shaw
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rashid Minhas.

Ethics declarations

Ethics statement

The present study was approved by the author’s home university and run in accordance with the British Psychological Society code of ethical conduct. A potential conflict of interest has been declared throughout the ethics process because this study was funded by a San Francisco based company called All Turtles on behalf of Spot, and one of the three authors of this paper is the co-creator of Spot. Spot is an AI chatbot that was based in part on the results of the present research but has since been modified for broader purposes. The most recent version of Spot can be accessed for free by individuals via https://app.talktospot.com/. The AI CI used in this study was specifically designed for research purposes, and if you would like to conduct research using this version it is recommended that you contact one of the authors of the present paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1 1.1 Description of AI NLP training

1.1.1 Training the AI

To help the AI learn which words are important and in what contexts, tables are created manually that we feed into the AI. We provide examples and it figures out the relationship between words. The more we have the more the links start to be concrete. For example, we manually indicate to the AI that “boss” is a “job role”, so that it learns to ask follow-up questions about “boss”. For some categories of words, training models already exist. For example, we use a standard library of names. But no standard library exists for words related to workplace harassment and discrimination. Because of this, we have created three particular training datasets of words and phrases to train our AI.

Group 1 relates times and dates. For this we have manually filtered a pre-existing database, filtering out words that were too general for our context like “a few” and other broad numerical descriptions that were not appropriate. Group 2 relates to locations. Here we have created a completely custom library based on workplace-related terms, like “office” or “boardroom”. Group 3 relates to people—including roles, job titles, and names. We have created a bespoke library of workplace-related descriptions like “she is my boss” or “colleague”. The names library is a standard database that has been applied unmodified.

Our own training dataset of about 1000 sentences was created using four main stages. The first stage consisted of brainstorming what we expected to be asked, resulting in about 100 sentences. In the second stage, we harvested words and phrases from news articles described accounts of workplace harassment and discrimination. This provided different syntax and word choice and added to our database. In stage three, we used about 200 reports submitted to the team explicitly for research purposes (from talktospot.com) to add to our database (Note that although many reports have been created using talktospot.com, we do not have access to them unless they are explicitly sent to the research team. This means we cannot assess the quality of the AI in these interactions).

Currently, in stage four, we are developing industry-specific words and phrases based on the industries that are using our tool. Ultimately, the database will be continuously evolving, and the AI should become increasingly attuned to the relevant words and their contexts to improve the follow-up questions and the user experience.

Appendix 2 2.1 Analyses using the Bayes factor

2.1.1 Introduction

Bayes factors are useful for assessing the strength of evidence of a theory, and allow researchers to draw different conclusions from those that can be inferred from orthodox statistical methods alone. Orthodox statistics model the null hypothesis (H₀), generally testing if there is no difference between means. They reveal whether there is a statistical difference between means, but nothing else. Bayes factors can be used to make a three-way distinction, by testing whether the data either support the null hypothesis (H₀), whether they strengthen support for the alternative hypothesis (H₁), or whether there is no evidence either way. Bayes factors also challenge perceptions of the importance of power that are used in statistics, as they indicate that a high-powered non-significant result is not always evidence to support the H₀, but a low-powered non-significant result might be. Similarly, a high-powered significant result might not be substantial evidence of H₁. Finally, using Bayes one can specify the hypothesis in a way that is not possible with a p value (Dienes and Mclatchie 2018).

To calculate a Bayes factor, one needs a model of H₀ (usually that there will be no difference between means), a model of H₁ (which needs to be specified, usually from the mean difference in a previous study) and a model of the data. This means that the Bayes Factor provides a continuous measure of evidence strength for H₁ over H₀, rather than a sharp boundary of significance. However, as a Bayes factor of 3 often aligns with a p < 0.05, a Bayes factor of 3 or more is usually understood as substantial evidence in support of H₁. For symmetry, substantial support for H₀ is usually understood as a Bayes Factor of < 1/3 (Dienes and Mclatchie 2018).

Therefore, in the present research, as well as examining the main effects with statistics, we evaluated the theories in terms of strength of evidence, using Bayesian hypothesis testing. Bayes factors seemed appropriate as the difference between the conditions was designed to be subtle, and the video stimuli were short, so we were expecting non-significant results in some comparisons. The Bayes Factors also allowed us to make more nuanced inferences about the data that did not depend on power calculations.

2.1.2 Methods

For our analyses, Bayes factors (B) were used to determine how strong the evidence for the alternative hypothesis was over the null (Singh n.d.). B_H (0, x), indicates that predictions of H₁ were modelled as a half-normal distribution with a standard deviation (SD) of x (Dienes and Mclatchie 2018). We used previous research into “cognitive” versus “standard” interviews to specify our hypothesis. This showed that cognitive interviews were found to elicit a median of 34% more information than standard interviews (Köhnken et al 1999). Therefore, the SD was set to x = 34% of the highest score in the present experiment. This figure was calculated separately for each set of comparisons (according to the highest score for that set). For correct responses, we predicted that the number of correct responses would increase with the sophistication of the reporting tool, so we used the SD (34%) to test this. For the first analyses (overall correct responses), the SD was set to 6.08.

2.2 Results

2.2.1 Overall correct responses

The Bayes Factor between AICI and Free Recall and between Questionnaire CI and Free recall, indicated that the evidence substantially supported the alternative hypothesis B_H = 182.55 and B_H = 16.65, respectively; those between AICI and Basic Chat CI, between AICI and Questionnaire CI, and between Basic Chat CI and Free Recall were insensitive, B_H = 2.54, B_H = 0.59, and B_H = 1.54, respectively; and those between Questionnaire CI and Basic Chat CI substantially supported the null hypothesis, B_H = 0.07.

The Bayes Factors thus indicated that there was substantial evidence that Questionnaire CI and AICI elicited more correct items overall than Free Recall, even though only AICI did so significantly. They also indicated that there was substantial evidence to support the null (that there was no difference in the number of correct items) when it came to comparisons between Questionnaire CI and Basic Chat CI. Finally, more data were needed to explore the other comparisons. Therefore, while the significance testing indicated that there was no difference between AICI and Basic Chat CI, between AICI and Questionnaire CI, and between Basic Chat CI and Free Recall, the Bayes Factors indicated that the data did not support this conclusion.

2.2.2 Dialogue

For the dialogue items, we focused only on comparisons between the AICI and the other conditions, and the SD was set to 2.49 (34% of the highest score) for these comparisons. The Bayes Factors supported the null hypothesis when comparing AICI and Free Recall, B_H = 0.24, and AICI and Questionnaire CI, B_H = 0.20, but they were insensitive when comparing AICI and Basic Chat CI, B_H = 0.69 (inspection of Fig. 2 shows a mean score of 6.73 for Basic Chat CI users and 7.33 for AICI users). Thus, participants generally performed similarly in all conditions (compared to AICI), but more data were needed to compare the scores between AICI and Basic Chat CI. Thus, while statistical analysis suggested that there was no difference between conditions, Bayes Factors suggest that when comparing the two chatbots, the data did not support this conclusion.

2.2.3 Action

Again, we focused only on comparisons between the AICI and the other conditions, and the SD was set to 1.15 (34% of the highest score) for these comparisons. The Bayes Factors supported the null hypothesis when comparing AICI and Free Recall, B_H = 0.24, and AICI and Questionnaire CI, B_H = 0.24. However, it supported the alternative hypothesis when comparing Basic Chat CI and AICI (inspection of Fig. 2 shows a mean score of 2.47 for Basic Chat CI users and 3.2 for AICI users), B_H = 3.98. The strength of evidence thus indicated that participants performed similarly for these comparisons, apart from when comparing the chatbots, as the evidence suggested that the Basic Chat CI elicited fewer action items than the AICI. Therefore, again the lack of significance when comparing chatbots cannot be interpreted as support for the null, as the Bayes Factor indicates that there was evidence that the AICI performed substantially better than the Basic Chat CI.

2.2.4 Facts

Again, we focused only on comparisons between the AICI and the other conditions, and the SD was set to 1.54 (34% of the highest score) for these comparisons. Inspection of Fig. 3 revealed that the mean score for users of the AICI was lower than those for using the Questionnaire CI and the Basic Chat CI, so rather than the testing the hypothesis that AICI users would perform better than these conditions against the null (there would be no difference between conditions), we tested the strength of evidence of the size of the differences. The Bayes Factors indicated that there was substantial evidence that AICI also elicited fewer than Basic Chat CI and Questionnaire CI, B_H = 814.10 and B_H = 115.47, respectively. For the comparison between Free Recall and AICI, we re-set H₁ to the original prediction. The Bayes Factor indicated that the results between Free Recall and AICI were insensitive, B_H = 1.34.

Basic Chat CI was therefore significantly better at eliciting factual items than AICI, but the Bayes Factor indicated that Questionnaire CI also elicited substantially more items than AICI. However, to evaluate the performance of AICI against Free Recall, more data were needed (inspection of Fig. 3 shows a mean score of 2.5 for Free Recall users and 3.1 for AICI users). Thus, it was not possible to conclude that there was no difference between these conditions.

2.2.5 Description

Again, we focused only on comparisons between the AICI and the other conditions, and the SD was set to 1.46 (34% of the highest score) for these comparisons.

The Bayes Factors supported the alternative hypothesis when comparing AICI and Free Recall, B_H = 50,214.61; AICI and Questionnaire CI, B_H = 3059.80; and AICI and Basic Chat CI, B_H = 57.14. Therefore, in this case, the Bayes Factor supported the significant results for these comparisons.

2.2.6 Overall incorrect responses

For incorrect responses, we expected the number of mistakes to be fewer as the sophistication of the reporting tool improved (the SD was set to x = 0.83).

The Bayes Factor indicated that the results between Questionnaire CI and Basic Chat CI, B_H = 11.99, Questionnaire CI and AICI, B_H = 296.03, supported the alternative hypothesis. The chatbots elicited substantially fewer mistakes than the Questionnaire CI. However, when compared to Free Recall, participants using Questionnaire CI elicited substantially more mistakes, B_H = 51.17. Comparisons between AICI and Basic Chat CI, and Free Recall and Basic Chat CI were insensitive, B_H = 1.99 and B_H = 1.54 respectively, while those between AICI and Free Recall supported the null hypothesis, B_H = 0.29.

Thus, while only participants in the Questionnaire CI condition elicited significantly more incorrect responses overall than Free Recall, the Bayes Factors indicated substantial evidence that Questionnaire CI also encouraged more incorrect responses than both chatbots. Bayes Factors allowed us to conclude that there was no difference in the number of incorrect responses between AICI and Free Recall, indicating that these two tools encouraged accuracy more than the other two. Finally, Bayes Factors indicated that we could not conclude that there were no differences between the Basic Chat CI and Free Recall, or between the Basic Chat CI and AICI.

2.2.7 Errors

For these analyses, we focused again on comparisons between the AICI and the other conditions, and the SD was set to 0.28 for these comparisons.

The Bayes Factor indicated that the results between Basic Chat CI and AICI, and Questionnaire CI and AICI supported the alternative hypothesis, B_H = 11.30 and B_H = 8.61 respectively, and those between Free Recall and AICI supported the null hypothesis, B_H = 0.26 (the SD was set to x = 0.28).

Therefore, while significance testing suggested that it made no difference which reporting tool participants used. Bayesian analysis indicated that participants using AICI made fewer errors than those using Questionnaire CI or Basic Chat CI, and that there was no difference in the number of errors made between AICI and Free recall.

2.2.8 Confabulations

For the final analyses, we also focused on comparisons between the AICI and the other conditions, and the SD was set to 0.57 for these comparisons.

The Bayes Factor indicated that the results between Free Recall and AICI, and between Questionnaire CI and AICI, B_H = 4.30, and B_H = 106.08 supported the alternative hypothesis respectively (performance improved as the sophistication of the tool increased). However, the comparison between Basic Chat CI and AICI, B_H = 0.48 was insensitive.

Therefore, while only Questionnaire CI encouraged participants to confabulate significantly more than Free Recall, Bayesian hypothesis testing indicated that it also encouraged participants to confabulate more than those using AICI. The results also suggested that, rather than being no difference between Basic Chat CI and AICI (inspection of Fig. 2 shows a mean score of 1.07 for Basic Chat CI users and 0.97 for AICI users), there were not enough data to make a conclusion either way.

2.2.9 Discussion

Statistical analyses indicated that the AICI elicited more correct responses without compromising on accuracy and that this chatbot was particularly good at eliciting descriptive details, but could improve on fact gathering. However, it failed to reveal nuances in the data that the Bayes Factors did.

We considered Bayesian hypothesis testing to be appropriate for this type of research, as the differences between conditions were chosen to be subtle, and the stimulus was a short video (1 min 45 s) that was not able to elicit dramatic differences in the number of recalled items in actual terms, so we anticipated that Bayes Factors might clarify the results somewhat. We also wanted to test the minimum number of participants possible. Although we made power calculations to reach this number, as Bayes Factors do not rely on power calculations, we considered them to be suitable to clarify the results. They also confirmed in many instances that the number of participants that we had tested was sufficient.

The Bayes Factors allowed us to make conclusions that were not possible when using statistics, and in some instances supported the statistics, adding weight to the implications. For instance, when it came to recalling correct information, significance testing indicated that overall, the AICI helped people to recall more items overall than Free Recall, that the AICI was better than the other conditions at eliciting description, while the Basic Chat CI was better than the AICI at fact gathering, and the Bayes Factors supported these results.

However, when it came to non-significance, statistical analysis fell short in three ways. An example or two will be given for each. First, the Bayes Factor indicated that while there was no significant difference in the number of action items between AICI and Basic Chat CI, the Bayesian evidence supported the notion that AICI elicited more items. Second, a non-significant result shows support for H₀, but in several cases, the Bayes Factors indicated that this conclusion could not be made. For instance, when comparing the number of dialogue items recalled by AICI and Basic Chat CI users’ significance testing indicated that there was no difference between conditions, but the Bayes Factors did not support this conclusion. Therefore, non-significance is not evidence in support of the null. Third, Bayes Factors were helpful when comparisons did show support for the null. For example, comparisons between Questionnaire CI and Basic Chat CI indicated that there was substantial support for the null in the number of correct items recalled overall, but orthodox statistics failed to demonstrate this.

Thus, using Bayes Factors, we made more nuanced conclusions when it came to describing the effects, indicating that Bayesian hypothesis testing is a useful way of interpreting the results in studies of this kind.

Appendix 3 3.1 Example of AI chatbot

3.2 Mild harassment scenario link

https://youtu.be/C0LMPW46EQs

3.3 Mild harassment scenario screenshot

3.4 Example transcript

1. Please tell me everything you can remember about what happened. Try not to leave anything out, even if it seems trivial. I have as much time as you need.
2. Thank you for telling me about that. You mentioned… Please elaborate.
3. Thank you for telling me about that. Please provide specifics about the month, week, day, or time this happened.
4. You mentioned an office. Please describe.
5. You referred to Mike. Please tell me more about him or her.
6. The way this situation made you feel is important for understanding the impact on your wellbeing. Please describe in more detail how you felt as you were experiencing the situation.
7. How did it affect your wellbeing after it happened?
8. Did you tell anyone about the event?
9. We’re almost done. Before we finish, is there any other evidence of what happened?
For example screenshots, emails, meeting notes, text messages, or recordings?

Rights and permissions

Reprints and permissions

About this article

Cite this article

Minhas, R., Elphick, C. & Shaw, J. Protecting victim and witness statement: examining the effectiveness of a chatbot that uses artificial intelligence and a cognitive interview. AI & Soc 37, 265–281 (2022). https://doi.org/10.1007/s00146-021-01165-5

Download citation

Received: 03 August 2020
Accepted: 17 February 2021
Published: 14 March 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s00146-021-01165-5

Protecting victim and witness statement: examining the effectiveness of a chatbot that uses artificial intelligence and a cognitive interview

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Bias Sensitivity in Diagnostic Decision-Making: Comparing ChatGPT with Residents

The future of AI clinicians: assessing the modern standard of chatbots and their approach to diagnostic uncertainty

Those Aren’t Your Memories, They’re Somebody Else’s: Seeding Misinformation in Chat Bot Memories

Explore related subjects

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics statement

Additional information

Publisher's Note

Appendices

Appendix 1

1.1 Description of AI NLP training

1.1.1 Training the AI

Appendix 2

2.1 Analyses using the Bayes factor

2.1.1 Introduction

2.1.2 Methods

2.2 Results

2.2.1 Overall correct responses

2.2.2 Dialogue

2.2.3 Action

2.2.4 Facts

2.2.5 Description

2.2.6 Overall incorrect responses

2.2.7 Errors

2.2.8 Confabulations

2.2.9 Discussion

Appendix 3

3.1 Example of AI chatbot

3.2 Mild harassment scenario link

3.3 Mild harassment scenario screenshot

3.4 Example transcript

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation