Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Protecting victim and witness statement: examining the effectiveness of a chatbot that uses artificial intelligence and a cognitive interview

  • Open Forum
  • Published:
AI & SOCIETY Aims and scope Submit manuscript

Abstract

Information of high evidentiary quality plays a crucial role in forensic investigations. Research shows that information provided by witnesses and victims often provide major leads to an inquiry. As such, statements should be obtained in the shortest possible time following an incident. However, this is not achieved in many incidents due to demands on resources. This intersectional study examined the effectiveness of a chatbot (the AICI), that uses artificial intelligence (AI) and a cognitive interview (CI) to help record statements following an incident. After viewing a sexual harassment video, the present study tested recall accuracy in participants using AICI compared to other tools (i.e., Free Recall, CI Questionnaire, and CI Basic Chatbot). Measuring correct items (including descriptive items) and incorrect items (errors and confabulations), it was found that the AI CI elicited more accurate information than the other tools. The implications on society include AI CI provides an alternative means of effectively and efficiently recording high-quality evidential statements from victims and witnesses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. AI CI for research: The AI CI has tremendous potential to study the effectiveness of the CI in different contexts, and widely accessible. For the purposes of this research, we created a research version of the AI CI, and the data presented in the present study were collected using this research version. This research version of the AI CI is available to all who want to use it for research purposes.

References

Download references

Acknowledgements

Thank you to software engineer Dylan Marriot for programming the AI CI used in this research.

Funding

Financial support for this project has been obtained from All Turtles, which made all development of the tool, and all research of it possible. The researchers have not been paid for any specific results and have preregistered the study. Still, the team recognises this financial support as a potential source of bias, which is part of the motivation to make the tool widely accessible to all researchers, including those who are not affiliated with All Turtles.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rashid Minhas.

Ethics declarations

Ethics statement

The present study was approved by the author’s home university and run in accordance with the British Psychological Society code of ethical conduct. A potential conflict of interest has been declared throughout the ethics process because this study was funded by a San Francisco based company called All Turtles on behalf of Spot, and one of the three authors of this paper is the co-creator of Spot. Spot is an AI chatbot that was based in part on the results of the present research but has since been modified for broader purposes. The most recent version of Spot can be accessed for free by individuals via https://app.talktospot.com/. The AI CI used in this study was specifically designed for research purposes, and if you would like to conduct research using this version it is recommended that you contact one of the authors of the present paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

1.1 Description of AI NLP training

1.1.1 Training the AI

To help the AI learn which words are important and in what contexts, tables are created manually that we feed into the AI. We provide examples and it figures out the relationship between words. The more we have the more the links start to be concrete. For example, we manually indicate to the AI that “boss” is a “job role”, so that it learns to ask follow-up questions about “boss”. For some categories of words, training models already exist. For example, we use a standard library of names. But no standard library exists for words related to workplace harassment and discrimination. Because of this, we have created three particular training datasets of words and phrases to train our AI.

Group 1 relates times and dates. For this we have manually filtered a pre-existing database, filtering out words that were too general for our context like “a few” and other broad numerical descriptions that were not appropriate. Group 2 relates to locations. Here we have created a completely custom library based on workplace-related terms, like “office” or “boardroom”. Group 3 relates to people—including roles, job titles, and names. We have created a bespoke library of workplace-related descriptions like “she is my boss” or “colleague”. The names library is a standard database that has been applied unmodified.

Our own training dataset of about 1000 sentences was created using four main stages. The first stage consisted of brainstorming what we expected to be asked, resulting in about 100 sentences. In the second stage, we harvested words and phrases from news articles described accounts of workplace harassment and discrimination. This provided different syntax and word choice and added to our database. In stage three, we used about 200 reports submitted to the team explicitly for research purposes (from talktospot.com) to add to our database (Note that although many reports have been created using talktospot.com, we do not have access to them unless they are explicitly sent to the research team. This means we cannot assess the quality of the AI in these interactions).

Currently, in stage four, we are developing industry-specific words and phrases based on the industries that are using our tool. Ultimately, the database will be continuously evolving, and the AI should become increasingly attuned to the relevant words and their contexts to improve the follow-up questions and the user experience.

Appendix 2

2.1 Analyses using the Bayes factor

2.1.1 Introduction

Bayes factors are useful for assessing the strength of evidence of a theory, and allow researchers to draw different conclusions from those that can be inferred from orthodox statistical methods alone. Orthodox statistics model the null hypothesis (H0), generally testing if there is no difference between means. They reveal whether there is a statistical difference between means, but nothing else. Bayes factors can be used to make a three-way distinction, by testing whether the data either support the null hypothesis (H0), whether they strengthen support for the alternative hypothesis (H1), or whether there is no evidence either way. Bayes factors also challenge perceptions of the importance of power that are used in statistics, as they indicate that a high-powered non-significant result is not always evidence to support the H0, but a low-powered non-significant result might be. Similarly, a high-powered significant result might not be substantial evidence of H1. Finally, using Bayes one can specify the hypothesis in a way that is not possible with a p value (Dienes and Mclatchie 2018).

To calculate a Bayes factor, one needs a model of H0 (usually that there will be no difference between means), a model of H1 (which needs to be specified, usually from the mean difference in a previous study) and a model of the data. This means that the Bayes Factor provides a continuous measure of evidence strength for H1 over H0, rather than a sharp boundary of significance. However, as a Bayes factor of 3 often aligns with a p < 0.05, a Bayes factor of 3 or more is usually understood as substantial evidence in support of H1. For symmetry, substantial support for H0 is usually understood as a Bayes Factor of < 1/3 (Dienes and Mclatchie 2018).

Therefore, in the present research, as well as examining the main effects with statistics, we evaluated the theories in terms of strength of evidence, using Bayesian hypothesis testing. Bayes factors seemed appropriate as the difference between the conditions was designed to be subtle, and the video stimuli were short, so we were expecting non-significant results in some comparisons. The Bayes Factors also allowed us to make more nuanced inferences about the data that did not depend on power calculations.

2.1.2 Methods

For our analyses, Bayes factors (B) were used to determine how strong the evidence for the alternative hypothesis was over the null (Singh n.d.). BH (0, x), indicates that predictions of H1 were modelled as a half-normal distribution with a standard deviation (SD) of x (Dienes and Mclatchie 2018). We used previous research into “cognitive” versus “standard” interviews to specify our hypothesis. This showed that cognitive interviews were found to elicit a median of 34% more information than standard interviews (Köhnken et al 1999). Therefore, the SD was set to x = 34% of the highest score in the present experiment. This figure was calculated separately for each set of comparisons (according to the highest score for that set). For correct responses, we predicted that the number of correct responses would increase with the sophistication of the reporting tool, so we used the SD (34%) to test this. For the first analyses (overall correct responses), the SD was set to 6.08.

2.2 Results

2.2.1 Overall correct responses

The Bayes Factor between AICI and Free Recall and between Questionnaire CI and Free recall, indicated that the evidence substantially supported the alternative hypothesis BH = 182.55 and BH = 16.65, respectively; those between AICI and Basic Chat CI, between AICI and Questionnaire CI, and between Basic Chat CI and Free Recall were insensitive, BH = 2.54, BH = 0.59, and BH = 1.54, respectively; and those between Questionnaire CI and Basic Chat CI substantially supported the null hypothesis, BH = 0.07.

The Bayes Factors thus indicated that there was substantial evidence that Questionnaire CI and AICI elicited more correct items overall than Free Recall, even though only AICI did so significantly. They also indicated that there was substantial evidence to support the null (that there was no difference in the number of correct items) when it came to comparisons between Questionnaire CI and Basic Chat CI. Finally, more data were needed to explore the other comparisons. Therefore, while the significance testing indicated that there was no difference between AICI and Basic Chat CI, between AICI and Questionnaire CI, and between Basic Chat CI and Free Recall, the Bayes Factors indicated that the data did not support this conclusion.

2.2.2 Dialogue

For the dialogue items, we focused only on comparisons between the AICI and the other conditions, and the SD was set to 2.49 (34% of the highest score) for these comparisons. The Bayes Factors supported the null hypothesis when comparing AICI and Free Recall, BH = 0.24, and AICI and Questionnaire CI, BH = 0.20, but they were insensitive when comparing AICI and Basic Chat CI, BH = 0.69 (inspection of Fig. 2 shows a mean score of 6.73 for Basic Chat CI users and 7.33 for AICI users). Thus, participants generally performed similarly in all conditions (compared to AICI), but more data were needed to compare the scores between AICI and Basic Chat CI. Thus, while statistical analysis suggested that there was no difference between conditions, Bayes Factors suggest that when comparing the two chatbots, the data did not support this conclusion.

2.2.3 Action

Again, we focused only on comparisons between the AICI and the other conditions, and the SD was set to 1.15 (34% of the highest score) for these comparisons. The Bayes Factors supported the null hypothesis when comparing AICI and Free Recall, BH = 0.24, and AICI and Questionnaire CI, BH = 0.24. However, it supported the alternative hypothesis when comparing Basic Chat CI and AICI (inspection of Fig. 2 shows a mean score of 2.47 for Basic Chat CI users and 3.2 for AICI users), BH = 3.98. The strength of evidence thus indicated that participants performed similarly for these comparisons, apart from when comparing the chatbots, as the evidence suggested that the Basic Chat CI elicited fewer action items than the AICI. Therefore, again the lack of significance when comparing chatbots cannot be interpreted as support for the null, as the Bayes Factor indicates that there was evidence that the AICI performed substantially better than the Basic Chat CI.

2.2.4 Facts

Again, we focused only on comparisons between the AICI and the other conditions, and the SD was set to 1.54 (34% of the highest score) for these comparisons. Inspection of Fig. 3 revealed that the mean score for users of the AICI was lower than those for using the Questionnaire CI and the Basic Chat CI, so rather than the testing the hypothesis that AICI users would perform better than these conditions against the null (there would be no difference between conditions), we tested the strength of evidence of the size of the differences. The Bayes Factors indicated that there was substantial evidence that AICI also elicited fewer than Basic Chat CI and Questionnaire CI, BH = 814.10 and BH = 115.47, respectively. For the comparison between Free Recall and AICI, we re-set H1 to the original prediction. The Bayes Factor indicated that the results between Free Recall and AICI were insensitive, BH = 1.34.

Basic Chat CI was therefore significantly better at eliciting factual items than AICI, but the Bayes Factor indicated that Questionnaire CI also elicited substantially more items than AICI. However, to evaluate the performance of AICI against Free Recall, more data were needed (inspection of Fig. 3 shows a mean score of 2.5 for Free Recall users and 3.1 for AICI users). Thus, it was not possible to conclude that there was no difference between these conditions.

2.2.5 Description

Again, we focused only on comparisons between the AICI and the other conditions, and the SD was set to 1.46 (34% of the highest score) for these comparisons.

The Bayes Factors supported the alternative hypothesis when comparing AICI and Free Recall, BH = 50,214.61; AICI and Questionnaire CI, BH = 3059.80; and AICI and Basic Chat CI, BH = 57.14. Therefore, in this case, the Bayes Factor supported the significant results for these comparisons.

2.2.6 Overall incorrect responses

For incorrect responses, we expected the number of mistakes to be fewer as the sophistication of the reporting tool improved (the SD was set to x = 0.83).

The Bayes Factor indicated that the results between Questionnaire CI and Basic Chat CI, BH = 11.99, Questionnaire CI and AICI, BH = 296.03, supported the alternative hypothesis. The chatbots elicited substantially fewer mistakes than the Questionnaire CI. However, when compared to Free Recall, participants using Questionnaire CI elicited substantially more mistakes, BH = 51.17. Comparisons between AICI and Basic Chat CI, and Free Recall and Basic Chat CI were insensitive, BH = 1.99 and BH = 1.54 respectively, while those between AICI and Free Recall supported the null hypothesis, BH = 0.29.

Thus, while only participants in the Questionnaire CI condition elicited significantly more incorrect responses overall than Free Recall, the Bayes Factors indicated substantial evidence that Questionnaire CI also encouraged more incorrect responses than both chatbots. Bayes Factors allowed us to conclude that there was no difference in the number of incorrect responses between AICI and Free Recall, indicating that these two tools encouraged accuracy more than the other two. Finally, Bayes Factors indicated that we could not conclude that there were no differences between the Basic Chat CI and Free Recall, or between the Basic Chat CI and AICI.

2.2.7 Errors

For these analyses, we focused again on comparisons between the AICI and the other conditions, and the SD was set to 0.28 for these comparisons.

The Bayes Factor indicated that the results between Basic Chat CI and AICI, and Questionnaire CI and AICI supported the alternative hypothesis, BH = 11.30 and BH = 8.61 respectively, and those between Free Recall and AICI supported the null hypothesis, BH = 0.26 (the SD was set to x = 0.28).

Therefore, while significance testing suggested that it made no difference which reporting tool participants used. Bayesian analysis indicated that participants using AICI made fewer errors than those using Questionnaire CI or Basic Chat CI, and that there was no difference in the number of errors made between AICI and Free recall.

2.2.8 Confabulations

For the final analyses, we also focused on comparisons between the AICI and the other conditions, and the SD was set to 0.57 for these comparisons.

The Bayes Factor indicated that the results between Free Recall and AICI, and between Questionnaire CI and AICI, BH = 4.30, and BH = 106.08 supported the alternative hypothesis respectively (performance improved as the sophistication of the tool increased). However, the comparison between Basic Chat CI and AICI, BH = 0.48 was insensitive.

Therefore, while only Questionnaire CI encouraged participants to confabulate significantly more than Free Recall, Bayesian hypothesis testing indicated that it also encouraged participants to confabulate more than those using AICI. The results also suggested that, rather than being no difference between Basic Chat CI and AICI (inspection of Fig. 2 shows a mean score of 1.07 for Basic Chat CI users and 0.97 for AICI users), there were not enough data to make a conclusion either way.

2.2.9 Discussion

Statistical analyses indicated that the AICI elicited more correct responses without compromising on accuracy and that this chatbot was particularly good at eliciting descriptive details, but could improve on fact gathering. However, it failed to reveal nuances in the data that the Bayes Factors did.

We considered Bayesian hypothesis testing to be appropriate for this type of research, as the differences between conditions were chosen to be subtle, and the stimulus was a short video (1 min 45 s) that was not able to elicit dramatic differences in the number of recalled items in actual terms, so we anticipated that Bayes Factors might clarify the results somewhat. We also wanted to test the minimum number of participants possible. Although we made power calculations to reach this number, as Bayes Factors do not rely on power calculations, we considered them to be suitable to clarify the results. They also confirmed in many instances that the number of participants that we had tested was sufficient.

The Bayes Factors allowed us to make conclusions that were not possible when using statistics, and in some instances supported the statistics, adding weight to the implications. For instance, when it came to recalling correct information, significance testing indicated that overall, the AICI helped people to recall more items overall than Free Recall, that the AICI was better than the other conditions at eliciting description, while the Basic Chat CI was better than the AICI at fact gathering, and the Bayes Factors supported these results.

However, when it came to non-significance, statistical analysis fell short in three ways. An example or two will be given for each. First, the Bayes Factor indicated that while there was no significant difference in the number of action items between AICI and Basic Chat CI, the Bayesian evidence supported the notion that AICI elicited more items. Second, a non-significant result shows support for H0, but in several cases, the Bayes Factors indicated that this conclusion could not be made. For instance, when comparing the number of dialogue items recalled by AICI and Basic Chat CI users’ significance testing indicated that there was no difference between conditions, but the Bayes Factors did not support this conclusion. Therefore, non-significance is not evidence in support of the null. Third, Bayes Factors were helpful when comparisons did show support for the null. For example, comparisons between Questionnaire CI and Basic Chat CI indicated that there was substantial support for the null in the number of correct items recalled overall, but orthodox statistics failed to demonstrate this.

Thus, using Bayes Factors, we made more nuanced conclusions when it came to describing the effects, indicating that Bayesian hypothesis testing is a useful way of interpreting the results in studies of this kind.

Appendix 3

3.1 Example of AI chatbot

figure a

3.2 Mild harassment scenario link

https://youtu.be/C0LMPW46EQs

3.3 Mild harassment scenario screenshot

figure b

3.4 Example transcript

  • 1. Please tell me everything you can remember about what happened. Try not to leave anything out, even if it seems trivial. I have as much time as you need.

  • 2. Thank you for telling me about that. You mentioned… Please elaborate.

  • 3. Thank you for telling me about that. Please provide specifics about the month, week, day, or time this happened.

  • 4. You mentioned an office. Please describe.

  • 5. You referred to Mike. Please tell me more about him or her.

  • 6. The way this situation made you feel is important for understanding the impact on your wellbeing. Please describe in more detail how you felt as you were experiencing the situation.

  • 7. How did it affect your wellbeing after it happened?

  • 8. Did you tell anyone about the event?

  • 9. We’re almost done. Before we finish, is there any other evidence of what happened?

  • For example screenshots, emails, meeting notes, text messages, or recordings?

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Minhas, R., Elphick, C. & Shaw, J. Protecting victim and witness statement: examining the effectiveness of a chatbot that uses artificial intelligence and a cognitive interview. AI & Soc 37, 265–281 (2022). https://doi.org/10.1007/s00146-021-01165-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00146-021-01165-5

Keywords

Navigation