Improving police recorded crime data with natural language processing

Understanding and preventing Domestic Violence and Abuse (DVA) is compounded by long-standing data quality issues in police records. Accurate police-recorded crime data is vital for responding to DVA, yet it often contains missing values and inaccuracies.

Across all crime types, the quality of police data in England and Wales has been a concern. While there have been improvements in overall crime data recording since 2014, individual police forces still encounter difficulties adequately recording instances of DVA in police-recorded crime datasets. 

Correcting poorly recorded or missing data at this scale is non-trivial and beyond the capabilities of manual intervention alone. Fortunately, the increasing availability of computational solutions and machine learning algorithms such as text mining and natural language processing (NLP) can augment, and to a degree, offset much of this processing. NLP is supported by a growing body of interdisciplinary research, which shows that valuable information can be automatically extracted from unstructured data such as crime reports and case summaries through technology.

However, automated prediction systems are not without risk, particularly when applied in sensitive domains such as policing. Data inherently reflects societal biases that poorly designed AI solutions can amplify, and in the context of DVA, these biases may stem from underreporting of marginalized demographic groups or inconsistencies in police recording practices.

In their recent study, Improving police recorded crime data for domestic violence and abuse through natural language processing, VISION researchers Dr Darren Cook and Dr Ruth Weir (City St George’s University of London) and Dr Leslie Humphries (University of Lancashire), evaluated the capability of supervised machine learning models to automatically extract victim–offender relationship information from free-text crime notes in DVA cases.

Both models demonstrated that such tools could serve as cost-effective and efficient alternatives to manual coding, accurately classifying relationship type in around four out of five cases. The incorporation of a selective classification function improved precision for the most challenging cases by abstaining from low-confidence predictions, though at the cost of reduced coverage. This research represents a meaningful step toward addressing concerns about the completeness and reliability of police-recorded crime data.

Recommendation

Given that police-recorded crime lost its status as an accredited official statistic in 2014 due in part to weaknesses in data collection and processing, the application of data science methods to reliably impute missing values offers a promising route to restoring confidence in these records. Police constabularies are encouraged to use the available technology and implement text mining and NLP solutions to extract valuable information from unstructured data such as crime reports and case summaries.

For further information: Please contact Darren at darren.cook@citystgeorges.ac.uk

To cite: Cook DWeir R, Humphries, L. Improving police recorded crime data for domestic violence and abuse through natural language processing. Front. Sociol., 24 November 2025, Sec. Medical Sociology Volume 10 – 2025 https://doi.org/10.3389/fsoc.2025.1686632

Photograph from Adobe Photo Stock subscription

Publications