Unification
In a large police force the data may be spread out across multiple data management systems. It's important to identify all key pieces of information that will contribute to the best possible view of the problem and to ensure that this data can be accessed regularly and accurately as often as the model will need to be run once implemented.
​
Responsible team members: Data Engineer and Data Scientist to lead
Define what data is needed, which system(s) it is stored in and if you have software to analyse it
1 / Identify what data features are needed. These should be clearly justified by specific aims and requirements defined beforehand using Rationale and a Data Protection Impact Assessment (DPIA).
2 / Ensure features can be accessed. This may involve discussions with third party system owners (e.g., Connect, Niche) about being able to extract data.
3 / Ensure you have correct software for analysis. This may involve discussions with your IT department.
If using external agencies: personally identifiable information is subject to regulation governing its collection and use; therefore, pseudonymise data before sharing.
Build integrated system using ‘match and merge process’
Build tables and link them together using standardised keys to reduce data anomalies and make the data easier to analyse
1 / Extract relevant features from across the data
2 / Merge records under one identifying variable (e.g., suspect ID) by using overlapping data features (e.g., same DOB, same address)
3 / Measure how close near-matches are using string comparison measures
4 / Working out the links across different records can take a lot of manual work
Pulling out data from backend databases can slow down the operational use of systems and so must be balanced with other needs.
Quality assess the system
1 / Conduct cross-checks to ensure the accuracy and validity of the data as data inaccuracies can lead to mistaken identity or inaccurate intelligence influencing downstream decisions.
2 / Cross-checks can feedback into further match and merge processes e.g., overlaps on one identifying variable may help merge records elsewhere
3 / Conduct unit testing to ensure the code components work as expected
Set up Extract Transform Load (ETL) code to run on a regular basis
1 / Ensure ETL code has quality checks included so new data and data formats are caught
2 / Choose a reasonable frequency to run the checks so that the system is up to date but not overloaded
Run business reports
1 / Business reports can be used to summarise, track and analyse patterns in the data, including trends and irregularities
2 / At this point there may be enough available insight that further modelling is not needed.