Standard machine learning metrics like F1 score are excellent for static classification, but they often fail to capture the reality of time-based user experiences. A system with "99% accuracy" might be flawless, or it might be infuriating—it all depends on the timing and duration of the 1% failure. A 100-millisecond glitch is trivial, but a feature that fails for one full day out of 100 is a disaster. This revealed a fundamental disconnect between an algorithm's abstract performance and a user's real-world perception.
To bridge this gap, I created a new methodology: Experiential Metrics—a system for measuring performance based on tangible aspects of the user journey. Instead of relying on abstract scores, we began answering dozens of questions a user would actually care about:
Did the sensor detect them within 2 seconds of their approach?
How often did detection drop out while they were still present?
Were there any failure periods longer than 5 seconds?
This approach allowed stakeholders to debate and prioritize trade-offs based on concrete, real-life scenarios. To make this methodology accessible to everyone, I developed a Python framework that allowed everyone, including field engineers and project managers, to build these custom metrics using simple pseudo-code.
This framework empowered the entire team to conduct A/B tests and make decisions based on how the end-user would perceive the system's behavior. We shifted the focus from optimizing abstract algorithms to delivering a predictably excellent user experience, and in the process, created a powerful tool for validating that our simulations matched reality, and for catching issues like model overfitting.