Beyond Online Experimentation: Generative Software That Optimizes Itself

Microsoft
Martin Tingley is Head of Windows Experimentation at Microsoft and former Head of the Experimentation Platform Analysis Team at Netflix.

Delphina
Hugo Bowne-Anderson is an independent data and AI consultant with extensive experience in the tech industry. He is the host of the industry podcast Vanishing Gradients, a podcast exploring developments in data science and AI. Previously, Hugo served as Head of Developer Relations at Outerbounds and held roles at Coiled and DataCamp, where his work in data science education reached over 3 million learners. He has taught at Yale University, Cold Spring Harbor Laboratory, and conferences like SciPy and PyCon, and is a passionate advocate for democratizing data skills and open-source tools.
Key Quotes
Key Takeaways
Experimentation capability is no longer a competitive edge.
With the proliferation of vendor solutions, the ability to run an A/B test has become a commodity. True competitive advantage now comes from how an organization climbs what Tingley describes as a five-level experimentation maturity ladder: moving beyond basic hypothesis testing into automated generative optimization.
Success is the biggest trap for experimentation teams.
Most organizations are stuck at the second level of that ladder: shipping high-investment features based on individual hypotheses. Because these experiments work and get celebrated, teams don't notice that everything is just "okay" and that there's a better way.
Shift from testing variants to optimizing parameter spaces.
Level 3 requires a mental leap: stop viewing A/B testing as a scientific lab report and start viewing it as hill-climbing. Add optionality into every decision point and use iterative testing to optimize over that space.
Humans are too expensive for micro-optimization.
At Level 4, organizations seed decision-making to machines via contextual bandits. Human product managers are a bottleneck for high-frequency, low-stakes decisions like artwork selection or email subject lines: these only provide business value when automated at scale.
Generative AI turns software into a self-optimizing system.
The Level 5 frontier is a closed loop: GenAI generates production-level variants, an experimentation platform evaluates them, and results feed back to generate better versions. Coframe is already doing this for Fortune 500 e-commerce companies, producing production-ready landing page variants in hours instead of weeks.
Map experiments to product areas to inform strategy.
An "experimentation programs" concept Tingley and colleagues developed at Netflix: plot the distribution of treatment effects by product area. One team runs many small experiments with occasional wins: they should automate. Another team runs few experiments but finds high customer sensitivity: they need more throughput. This turns experiment-level data into a capital allocation tool.
Looking at the mean is not enough.
A/B tests can have wildly different results across user segments: power users, geographic regions, cost-conscious vs. premium customers. Only looking at the average hides what matters: always examine heterogeneous treatment effects.
Every failed experiment is a learning opportunity.
An experiment that "didn't work" overall may have worked for a subset of customers. It may have confused users in a way that reveals a new customer need. The culture of humility required for experimentation isn't just about accepting losses: it's about mining them for signal.
Respect your product's "permission to play."
The amount of change users will tolerate varies by product. Netflix users might find a new UI exciting. Windows users open their machine to get a task done fast: radical UI changes break mental models and trust. Experimentation velocity must match the product's core utility.
Incentivize shots on goal over perfect wins.
To democratize experimentation, shift incentives from rewarding "successful ships" to rewarding throughput and learning. Even if teams game the system by running more tests, the institutional capacity built by high-volume experimentation eventually surfaces non-obvious, high-impact winners that no one would have hypothesized.
You can read the full transcript here.
LINKS
- Martin on LinkedIn
- Want Your Company to Get Better at Experimentation? by Iavor Bojinov, David Holtz, Ramesh Johari, Sven Schmit and Martin Tingley (Harvard Business Review)
- Avoid the Pitfalls of A/B Testing by Iavor Bojinov, Guillaume Saint-Jacques and Martin Tingley (Harvard Business Review)
- Martin & Co.'s Seven Part Blog Series on Experimentation at Netflix
- Roberto Medri (Meta) on High Signal: The Incentive Problem in Shipping AI Products — and How to Change It
- Tim O’Reilly on High Signal: The End of Programming As We Know It
- Watch the podcast episode on YouTube
- Delphina's Newsletter
Links From The Show
Transcript
In the spotlight: Our most popular episodes
Listen up: Our latest discussions
Hear the hottest takes on data science and AI.
Get the latest episodes in your inbox
Never miss an episode of High Signal by signing up for the Delphina newsletter.


.png)




.png)
.png)