Development
This section is intended as indicative of the steps involved in analysis not an exhaustive list of what analysis to conduct. Further information for Data Scientists is available here.
Responsible team members: Data Scientist to lead
Project preparation
It is important to take the time in the beginning of the project to set up data and code versioning. As you explore the solution space including various iterations of features and models, it will be helpful to keep track of the steps that led to the best performing solutions.
Data
Data preparation
The following steps will have to be replicated to run the model in production so ensure that all processes are automated and clearly documented. Since data preprocessing can be an iterative through the project lifecycle, use versioning to link different preprocessing pipelines to the models and results to preserve replicability and keep track of the best models.
Importing / Write code that gathers dataset features from the available unified data and verifies data integrity based on any assumptions that are made on formats, values, or quality.
Labelling / Accurate data labelling improves model predictions. Labels should be consistently applied based on the aim of the model using pre-determined criteria, so the labels reflect this as closely as possible (e.g., high risk cases are labelled as such if they meet certain criterion, such as having a particular harm score or a certain number of convictions).
Cleaning and feature manipulation / Remove or correct entries so the data is valid and reliable. This includes removing cases that are inappropriate, as outlined in rationale, as well as assessing missing data. You may also end up scaling or combining features, so you need to pay special attention to how scaling factors translate to unseen examples.
Splitting the data
1 / Split your data into training, validation and test data
-
Training data: used to develop the model
-
​Validation data: used to verify if the chosen model works correctly
-
Test data: used to further test and validate the model
2 / If there is not enough data available cross-validation can be conducted for training
Modelling
Train the models on the training data and conduct error analysis on the validation data
1 / Evaluate how good each model is for achieving the aim based on pre-set criteria
2 / Choose the best model and investigate error rates, sources of error assess for bias, and limitations within the model
3 / Consider different misclassification costs and consider the asymmetry of misclassification costs (e.g., in risk assessment, incarceration versus wasted police time)
4 / If you uncover patterns in the sources of error or unfairness, go back through feature engineering, data splitting, and modelling to improve the process
5 / Any trade-offs between accuracy and interpretability should be clearly thought through and documented.
Do final checks on the held-out test data
1 / Check for bias on the chosen model, document the trade-offs
2 / Conduct interpretability/explicability tests on the chosen model, record any known patterns
Documentation
1 / Data cards: supply a quick look at the properties of the dataset
2 / Model cards: document processes so models can be compared and evaluated
​