Archives

Computational text analysis on unstructured police data: A scoping review

Police reports made following attendance at events such as crashes, domestic violence and theft often contain rich contextual details including indicators of mental health issues or abuse types, and persons/entities involved and their relationships, which are not typically captured in structured administrative data, interviews or official statistics. However, the sheer volume of information along with strict data access protocols render manual analysis impractical. Computational text analysis methods offer a feasible and effective approach to automatically process this underutilized data source.

The research team led by Dr Wilson Lukmanjaya (University of New South Wales) included VISION Research Fellow Dr Darren Cook. The team conducted an overview of studies using computational text analysis (e.g., text mining, natural language processing (NLP)), on unstructured police data, serving as a guide for researchers interested in employing similar methodologies.

Their article, Computational text analysis on unstructured police data: A scoping review, was conducted in accordance with the PRISMA-SCR guidelines, following the two screening processes (title/abstract and full text screening) and the development of a pre-defined protocol. A search was conducted across seven electronic databases covering the past 20 years.

After removing duplicate entries and screening titles/abstracts and full-text publications, 61 studies met the inclusion criteria. Included studies were published between 2004 and 2024, with most from the United States, Australia and the Netherlands.

The scoping review indicates applications of computational text analysis on unstructured police data have moderate to high performance. Common limitations included variable data quality, with reliability depending on the level of detail provided by the police report’s author, and failure to report ethical implications or methodological limitations.

Computational text analysis can extract key information from unstructured police data. However, future research should clearly report ethics approvals and implications, and methodological limitations.

Recommendation

Establishing a structured data-sharing framework between law enforcement and researchers is crucial to facilitate access and support high quality, impactful research in this field.

To download the paper: Computational text analysis on unstructured police data: A scoping review

To cite: Lukmanjaya, W., Halmich, C., Butler, T., Cook, D., Karystianis, G. Computational text analysis on unstructured police data: a scoping review. Crime Sci (2026). https://doi.org/10.1186/s40163-026-00272-2

Photograph from Adobe Photo Stock subscription

Improving police recorded crime data with natural language processing

Understanding and preventing Domestic Violence and Abuse (DVA) is compounded by long-standing data quality issues in police records. Accurate police-recorded crime data is vital for responding to DVA, yet it often contains missing values and inaccuracies.

Across all crime types, the quality of police data in England and Wales has been a concern. While there have been improvements in overall crime data recording since 2014, individual police forces still encounter difficulties adequately recording instances of DVA in police-recorded crime datasets.

Correcting poorly recorded or missing data at this scale is non-trivial and beyond the capabilities of manual intervention alone. Fortunately, the increasing availability of computational solutions and machine learning algorithms such as text mining and natural language processing (NLP) can augment, and to a degree, offset much of this processing. NLP is supported by a growing body of interdisciplinary research, which shows that valuable information can be automatically extracted from unstructured data such as crime reports and case summaries through technology.

However, automated prediction systems are not without risk, particularly when applied in sensitive domains such as policing. Data inherently reflects societal biases that poorly designed AI solutions can amplify, and in the context of DVA, these biases may stem from underreporting of marginalized demographic groups or inconsistencies in police recording practices.

In their recent study, Improving police recorded crime data for domestic violence and abuse through natural language processing, VISION researchers Dr Darren Cook and Dr Ruth Weir (City St George’s University of London) and Dr Leslie Humphries (University of Lancashire), evaluated the capability of supervised machine learning models to automatically extract victim–offender relationship information from free-text crime notes in DVA cases.

Both models demonstrated that such tools could serve as cost-effective and efficient alternatives to manual coding, accurately classifying relationship type in around four out of five cases. The incorporation of a selective classification function improved precision for the most challenging cases by abstaining from low-confidence predictions, though at the cost of reduced coverage. This research represents a meaningful step toward addressing concerns about the completeness and reliability of police-recorded crime data.

Recommendation

Given that police-recorded crime lost its status as an accredited official statistic in 2014 due in part to weaknesses in data collection and processing, the application of data science methods to reliably impute missing values offers a promising route to restoring confidence in these records. Police constabularies are encouraged to use the available technology and implement text mining and NLP solutions to extract valuable information from unstructured data such as crime reports and case summaries.

For further information: Please contact Darren at darren.cook@citystgeorges.ac.uk

To cite: Cook D, Weir R, Humphries, L. Improving police recorded crime data for domestic violence and abuse through natural language processing. Front. Sociol., 24 November 2025, Sec. Medical Sociology Volume 10 – 2025 https://doi.org/10.3389/fsoc.2025.1686632

Photograph from Adobe Photo Stock subscription

Webinar: Exploring Natural Language Processing in violence prevention data

This event is in the past.

Do you work with text data in the field of violence prevention?

Are you interested in exploring how Natural Language Processing (NLP) can be used as an analytical tool?

Research Fellow Darren Cook from the UKPRP VISION Consortium and the Violence & Society Centre at City, St George’s, University of London will demonstrate how NLP techniques can be applied to domestic violence and abuse data in an upcoming webinar on 13 November 2025 from 10 – 10:50 am.

What is NLP?

Natural Language Processing (NLP) focuses on the interaction between computers and human language, such as interpreting and categorising free text from police or medical notes. This approach enables machines to understand, interpret, and generate human language in meaningful and useful ways. It allows computers to analyse and process text data, capturing not only the content but also the intent and emotion behind the words.

13 November 2025, 10 – 10:50 am, online only

To register for the event and receive the Teams link, please email: VISION_Management_Team@city.ac.uk

Introducing VISION research on Domestic Homicide Reviews at the Vulnerability and Policing Futures Conference

Blog by Dr Darren Cook, VISION Research Fellow in Natural Language Processing

Earlier this month, Dr Elizabeth (Lizzie) Cook and I had the opportunity to introduce our developing project on Domestic Homicide Reviews (DHRs) at the Vulnerability and Policing Futures Research Centre’s second annual conference in Leeds. The two-day event brought together academics, practitioners, and policymakers to explore the themes of reducing harm and strengthening justice.

In a session on Measuring Vulnerability: Harnessing Routinely Collected Data, we outlined how natural language processing (NLP) could be used to improve access to and analysis of the Home Office’s growing library of DHR reports. We highlighted both the opportunities and challenges of applying advanced computational methods to such sensitive material and set out our vision for building a tool to make the full corpus of DHRs more searchable. By improving searchability, researchers and policymakers can more effectively explore recurring themes and insights within the reports.

Our talk prompted thoughtful questions and constructive feedback from an interdisciplinary audience of around 30 participants, which will help shape the next stages of the project.

What are Domestic Homicide Reviews?

Domestic Homicide Reviews (DHRs) are reports that examine the circumstances surrounding a death resulting from suspected domestic violence or abuse. Introduced in the Domestic Violence, Crime and Victims Act (2004) and implemented in 2011, these reviews provide a detailed, chronological account of the victim and perpetrator. They are written in narrative form and aim to identify lessons that can be learned from a domestic homicide.

Since June 2023, over 600 reports have been made publicly available through an online repository with a view to improving transparency and to encourage greater opportunity for learning.

Why does access remain a challenge?

Despite the progress made with the repository, researchers and practitioners still face barriers that limit how effectively the reports can be used. In our talk, we focused on two key challenges: (1) At present, each report must be extracted individually, which is impractical for projects working across hundreds of documents, and (2) The repository’s search functions rely on a fixed set of tags added by the Home Office. Users cannot create new categories or adjust existing ones, which restricts the kinds of questions researchers can ask. As a result, while the reports are technically public, their full potential as a resource remains difficult to unlock.

Building better access

Our presentation built on earlier consultation work with public and third sector organisations, and we shared some of the next steps we are planning. These will be set out in more detail in an upcoming research protocol paper co-authored with Sumanta Roy, a member of our VISION Advisory Board and Head of Research, Evaluation & Development at Imkaan, and Ravi Thiara, VISION Co-Investigator and Professor at University of Warwick.

Our central idea is to explore the feasibility of creating a structured dataset that summarises the key features of the DHR library. This would capture information such as victim and perpetrator demographics, the commissioning body, safeguarding risks, recommendations, and missed opportunities.

To do this, we are developing a tool that applies text-mining and natural language processing (NLP) techniques to extract information directly from the reports. The resulting dataset will be both searchable and filterable, allowing users to focus on increasingly specific subsets of the material. We also want to build functionality that enables users to download customised sets of reports rather than relying on individual downloads.

How will this help researchers and policymakers?

By improving access, searchability, and flexibility, our project will make it possible to work with DHRs at a scale that has not previously been possible. Instead of relying on small samples or manual searches, researchers will be able to look across hundreds of cases, identify recurring patterns, and ask new kinds of questions. The creation of a structured dataset will also support more consistent and comparable findings, helping to strengthen collaboration between academics, practitioners, and policymakers.

In the longer term, we hope this work will not only make DHR research more efficient but also ensure that the lessons within these reports are more easily applied to safeguarding practice.

Looking back to the conference

Presenting at the Vulnerability and Policing Futures Conference gave us the chance not only to share our ideas but also to test them with an audience of experts. The thoughtful discussions and questions we received will help guide how we take this project forward.

For further information, please contact Darren at darren.cook@citystgeorges.ac.uk

Image from Adobe Stock subscription.

Knowledge Transfer Partnership award for City St George’s UoL and the National Centre for Domestic Violence

Blog by Dr Darren Cook, VISION Research Fellow in Natural Language Processing

Introduction

I’m pleased to share that a cross-university collaboration between City St George’s School of Social Policy and Global Affairs (SPGA) and School of Science and Technology (SST) has been successful in a recent Knowledge Transfer Partnership (KTP) competition.

Working alongside our industry partner, the National Centre for Domestic Violence (NCDV), our project will develop data science capabilities that enhance automation, scalability, and efficiency at NCDV. This will enable improved support, faster response times, and better outcomes for victims of domestic abuse. The project is due to begin in early 2026 and will run for approximately two years.

Importantly, this is the first KTP involving SPGA. As such, it marks a significant milestone for the school, creating new pathways for impactful collaboration with industry and laying the groundwork for future projects that can translate academic expertise into measurable social change.

What is a KTP?

A Knowledge Transfer Partnership (KTP) is a collaborative programme between UK businesses and universities, supported and part funded by Innovate UK [1].

Each KTP addresses a specific business challenge, enabling the transfer of knowledge and expertise from academia into industry through a KTP Associate. The Associate is employed by the business but supported by an academic supervisor, delivering a structured package of work designed to drive innovation and growth.

KTPs are proven to have a significant impact, generating more than £2 billion for the UK economy between 2010 and 2020 [2]. They also support the professional development of Associates, who gain unique experience at the intersection of academia and industry.

What is the focus of this KTP?

Having been successful in our funding application, I will serve as the Academic Supervisor, working alongside Dr Chris Childs (Academic Lead) and the appointed KTP Associate. Together, we will design and implement advanced data science tools to automate key data processes within NCDV.

By streamlining and scaling these processes, the project will:

Enable NCDV to support more victims in need.
Reduce response times, ensuring urgent legal protections can be accessed more quickly.
Provide a replicable model of innovation that could benefit other organisations in the domestic abuse sector.

This partnership also has substantial potential impact for REF, by generating demonstrable evidence of social and economic benefit from research-led activity.

Who are the NCDV?

The National Centre for Domestic Violence (NCDV) [3] is a Community Interest Company (CIC) that helps victims of domestic abuse obtain urgent legal protection through the courts.

Domestic abuse is a widespread and pressing issue:

An estimated 2.1 million people in England and Wales experienced domestic abuse per year (1.4 million women and 751,000 men) [4].
Police in England and Wales receive over 100 calls relating to domestic abuse every hour [5].

Against this backdrop, the work of NCDV is vital. This KTP will strengthen their capacity to respond to high demand and ensure more victims can access the protection they need.

References

[1] Innovate UK – Knowledge Transfer Partnerships
[2] Innovate UK, KTP Impact Report
[3] National Centre for Domestic Violence
[4] Office for National Statistics, “Domestic abuse prevalence and trends, England and Wales: year ending March 2023.”
[5] Refuge, “Domestic abuse statistics.”

For further information, please contact Darren at darren.cook@citystgeorges.ac.uk

Natural Language Processing: Interrogating free text in mental healthcare records to capture experiences of violence

Violence can be categorised in a variety of ways for example physical, sexual, emotional, and domestic but all cause significant physical and mental morbidity within general populations. Individuals with a severe mental illness have been found to be significantly more likely to experience domestic, physical, and sexual violence compared to the general population. For these individuals, experiences of violence are important risk factors however, this is not routinely collected by mental health services.

In general data on all forms of violence has been inadequately available from healthcare records. This is partly due to the lack of routine enquiry by professionals at points of clinical contact, and partly because instances of violence are difficult to identify in healthcare data in the absence of specific coding systems.

A general challenge for using health records data for research is that the most valuable and granular information is frequently contained in text fields (e.g., routine case notes, clinical correspondence) rather than in pre-structured fields; this includes mentions of violence whether experienced as a victim or perpetrated. Capturing violence experiences across mental healthcare settings can be challenging because most instances are likely to be recorded as unstructured text data. Therefore, natural language processing (NLP), is increasingly in use to extract information automatically from unstructured text in electronic health records, particularly in mental healthcare, on clinical entities.

Dr Ava Mason from Kings College London and VISION researchers Professor Robert Stewart, Dr Angus Roberts, Dr Lifang Li, and Dr Vishal Bhavsar worked with colleagues to apply NLP across different clinical samples to investigate mentions of violence. They ascertained recorded violence victimisation from the records of 60,021 patients receiving care from a large south London NHS mental healthcare provider during 2019. Descriptive and regression analyses were conducted to investigate variation by age, sex, ethnic group, and diagnostic category.

Results showed that patients with a mood disorder, personality disorder, schizophrenia spectrum disorder or PTSD had a significantly increased likelihood of victimisation compared to those with other mental health diagnoses. Additionally, patients from minority ethnic groups for Black and Asian had significantly higher likelihood of recorded violence victimisation compared to White groups. Males were significantly less likely to have reported recorded violence victimisation than females.

The researchers demonstrated the successful deployment of machine learning based NLP algorithms to ascertain important entities for outcome prediction in mental healthcare. The observed distributions highlight which sex, ethnicity and diagnostic groups had more records of violence victimisation. Further development of these algorithms could usefully capture broader experiences, such as differentiating more efficiently between witnessed, perpetrated and experienced violence and broader violence experiences like emotional abuse.

To download the paper: Frontiers | Applying neural network algorithms to ascertain reported experiences of violence in routine mental healthcare records and distributions of reports by diagnosis

To cite: Mason AJC, Bhavsar V, Botelle R, Chandran D, Li L, Mascio A, Sanyal J, Kadra-Scalzo G, Roberts A, Williams M, Stewart R. Applying neural network algorithms to ascertain reported experiences of violence in routine mental healthcare records and distributions of reports by diagnosis. Frontiers in Psychiatry 2024 Sep 10. doi:103389/fpsyt.2024.1181739

Illustration from Adobe Photo Stock subscription

Mental health service responses to violence: VISION symposia at the European Psychiatric Association

An aim of the VISION programme is to examine the nature and extent of contact that people with experience of violence have with various health and justice services.

Findings on mental health services were presented in a series of symposia at the European Psychiatric Association’s Section on Epidemiology and Social Psychiatry this year.

The first brought together six studies on experiences of violence and adversity and implications for mental health service use. These included King’s College London’s Anjuli Kaul presenting on Sexual Violence in Mental Health Service Users and Sian Oram on Mental Health Treatment Experiences of Minoritised Sexual Violence Survivors, with further contributions from Emma Soneson (Oxford), Maryam Ghasemi (Auckland), and Ladan Hashemi and Sally McManus (both City St George’s).

A second symposium highlighted the value of the Adult Psychiatric Morbidity Survey to violence research, with Sally McManus presenting on Threatening or Obscene Messages from a Partner and Mental Health, Self-harm and Suicidality.

Finally, a third symposium featuring VISION researchers Angus Roberts, Rob Stewart and others and highlighted how natural language processing can be used with information collected in mental health settings. Sharon Sondh (South London and Maudsley NHS Foundation Trust) presented on classifying experiences of violence in mental healthcare records.

Natural Language Processing: Improving Data Integrity of Police Recorded Crime

By Darren Cook, Research Fellow in Natural Language Processing at City, University of London

Did you know that police recorded crime data for England and Wales are not accredited by the UK’s Office for Statistics Regulation (OSR)? This decision, made by the OSR after an audit in 2014, was due to concerns about the reliability of the underlying data.

Various factors affect the quality of police-recorded data. Differences in IT systems, personnel decision-making, and a lack of knowledge-sharing all contribute to reduced quality and consistency. Poor data integrity leads to a lack of standardisation across police forces and an increase in inaccurate or missing entries. I recently spoke about this issue at the Behavioural and Social Sciences in Security (BASS) conference at the University of St. Andrews, Scotland.

Correcting missing values is no small feat. In a dataset of 18,000 police recorded domestic violence incidents, we found over 4,500 (25%) missing entries for a single variable. Let’s assume it takes 30 seconds to find the correct value for this variable – that’s 38 hours of effort – almost a full working week. Given that there could be as many as twenty additional variables, it would take over four months to populate all the missing values in our dataset. Expanding such effort across multiple police forces and for multiple types of crime highlights the inefficiency of human-effort in this endeavour.

In my talk, I outlined an automated solution to this problem using Natural Language Processing (NLP) and supervised machine learning (ML). NLP describes the processes and techniques used by machines to understand human language, and supervised ML describes how machines learn to predict an outcome based on previously seen examples. In this case, we sought to predict the relationship between the victim and offender – an important piece of demographic information vital to ensuring victim safety.

The proposed system would use a text-based crime ‘note’ completed by a police officer to classify the victim offender relationship as either ‘Ex-Partner”, “Partner”, or “Family” – in keeping with the distinction made by Women’s Aid. Crime notes are an often-overlooked source of information in police data, yet we found they consistently referenced the victim-offender relationship. The goal of our system, therefore, was to extract the salient information from the free-form crime notes and populate the corresponding missing value in our structured data fields.

Existing solutions based on keywords and syntax parsing are used by multiple UK police forces. While effective, they require manual effort to create, update, and maintain the dictionaries, and they don’t generalise well. Our supervised ML system, however, can be automatically updated and monitored to maintain accuracy.

When tested, our system achieved 80% accuracy, correctly labelling the relationship type in four out of five cases. In comparison, humans performed this task with approximately 82% accuracy – an arguably negligible difference. Moreover, once trained, our system could classify the entire test set (over 1,000 crime notes) in just sixteen seconds.

However, we noted some limitations, the biggest of which was a high linguistic overlap in crime notes between ‘Ex-Partner’ and ‘Partner’ that caused several misclassifications. We believe more advanced language models (i.e., word embeddings) will improve discrimination between these relationships.

We also discovered a potential prediction bias against minorities. Although victim ethnicity wasn’t included in our training setup, we observed reduced accuracy for Black or Asian victims. The source and extent of this bias are subjects of ongoing research.

Our findings highlight the promise of automated solutions but serve as a cautionary tale against assuming these systems can be applied carte blanche without careful consideration of their limitations. Several outstanding questions remain. Is a system with 80% accuracy good enough? Is it better to leave missing values rather than predict incorrect ones? Incorrectly identifying a perpetrator as a current partner rather than an ex-partner could significantly impact the victim’s safety. Additionally, a model biased against certain ethnicities risks overlooking the specific needs of minority groups.

The conference sparked lively and engaging conversation about many of these issues, as well as the role that automation can be play within the social sciences more broadly. A research article describing these results in full is the focus of ongoing work, and the presentation slides are available below as a download.

For further information please contact Darren at darren.cook@city.ac.uk or via LinkedIn @darrencook1986

Dr Darren Cook, An application of Natural Language Processing (NLP) to free-form Police crime notes – 1 download

D Cook_NLP and Police Download

Photo by Markus Spiske on Unsplash

Calling all crime analysts: Share your experiences of using text data in analysis

Are you a crime analyst or researcher? If so VISION would really like to hear about your experiences of using text data in your analysis.

We developed a short survey that will take approximately 5 minutes to complete. Qualtrics Survey | Crime Analyst Survey

This survey is designed to explore your experiences working with free-text data. Your feedback will enable us to evaluate the need for software designed to assist analysts working with large amounts of free text data.

Participation is voluntary and all responses will be anonymous. Information will be confidential and will not be shared with any other parties, and will be deleted once it is no longer needed.

The deadline to provide feedback using the link above is 30 June 2024.

Illustration from licensed Adobe Stock library