Updated: Jan 18
In Part 1 we outlined, broadly, an approach to reducing false positives.
In Part 2, we explained that we were looking at potential product linkages that may not have been captured and said that we had too many false positives to investigate manually.
Let's finish the story here by looking at why traditional approaches wouldn’t work in our case, and then explain the alternate approach that we used.
The traditional approaches would typically look like one of these:
a) select a random sample
b) profile the sample (e.g. within a spreadsheet) to identify common characteristics to enable manual identification and elimination of false positives.
Why did we not opt for these?
a) random sampling may have its merits, but with techniques available today to better target anomalies, we decided that such an approach would not be defensible (and wouldn't add real value)
b) traditional profiling can work with structured data, for a smaller set of features (columns) - but the free text data translated to hundreds of columns, so this would not be feasible.
We worked through a number of scenarios and options, including the above, and had to discard most of those. We knew that the analytics software that we were using had strong predictive capability - so we decided to make use of it. The process was:
1. Select a subset of the population (a random sample)
2. Manually review the sample, marking false positives as such
3. Pass the result of the review into a few different learner algorithms (e.g. Random Forest, Gradient Boosted Trees) to create a predictive model (in a feature selection loop)
4. Score the models to find the best fit
5. Pass the remainder of the population through the selected model to "predict" the outcome
The result was a significant reduction in false positives - noting that such a process would not be 100% accurate, but certainly better than random sampling alone.
The tools and techniques to improve IA's use of analytics are now readily available - let's use them.