Experiential metrics

Standard metrics (e.g., accuracy, F1 score, recall, false-negative rate, etc.) give a general idea of how well a classifier works. For some classification efforts (such as determining if an image corresponds to a specific user or not), this may be sufficient. But for time-based scenarios, generic metrics are not enough. For example, a facial recognition algorithm may work “99% of the time” on video recordings: it is perfectly suitable if, for every 10-second interval, the technology does not detect the user during the first 100 milliseconds. However, such an algorithm is not suitable if it does not detect the user 1 out of every 100 days. To be useful, metrics must have a time component (e.g., how often during a period of time does it fail, for how long does it fail, at which time in the event of interest does it fail).

For every experience (such as proximity sensing, gesture detection, etc.), I created experiential metrics. That is, metrics that precisely capture aspects of the user experience. Examples include:

Percentage of events for which the sensor detects the user within 2 seconds of the user approaching
Percentage of time the sensor stops detecting the user after a successful detection, relative to the total time during which the user is near the sensor
Percentage of events the sensor does not detect the user for periods of time exceeding 5 seconds

These metrics allow stakeholders to compare failures and make tradeoffs, not based on abstractions but on real-life scenarios.

Each project requires its own experiential metrics, so I developed a framework for stakeholders to rapidly build these metrics using pseudo-code in a Python library. This allows field engineers and project managers to perform A/B tests and make decisions based on what the end-user would experience, instead of on the abstract metrics that the machine learning algorithm uses (such as the F1 Score). These metrics also make it easy to validate that the sensor behave the same in real life as it does in simulation (and catch overfitting problems).

Report abuse