Episode
28

From Context Engineering to AI Agent Harnesses: The New Software Discipline

Lance Martin of LangChain joins High Signal to outline a new playbook for engineering in the AI era, where the ground is constantly shifting under the feet of builders. He explains how the exponential improvement of foundation models is forcing a complete rethink of how software is built, revealing why top products from Claude Code to Manus are in a constant state of re-architecture simply to keep up. We dig into why the old rules of ML engineering no longer apply, and how Rich Sutton's "bitter lesson" dictates that simple, adaptable systems are the only ones that will survive. The conversation provides a clear framework for leaders on the critical new disciplines of context engineering to manage cost and reliability, the architectural power of the "agent harness" to expand capabilities without adding complexity, and why the most effective evaluation of these new systems is shifting away from static benchmarks and towards a dynamic model of in-app user feedback.
November 12, 2025
Listen on
spotify logoApple podcast logo
Guest
Lance Martin

LangChain

,

Lance Martin is a software engineer with deep technical expertise in applied machine learning. One of the early hires at LangChain, he has focused primarily on the Python open-source library. Before joining LangChain, he served as a tech lead and manager working on computer vision for self-driving cars and trucks at UberATG, Ike, and Nuro. Lance holds a PhD from Stanford University, where his research centered on applied ML.

Guest

,
HOST
Hugo Bowne-Anderson

Delphina

Hugo Bowne-Anderson is an independent data and AI consultant with extensive experience in the tech industry. He is the host of the industry podcast Vanishing Gradients, a podcast exploring developments in data science and AI. Previously, Hugo served as Head of Developer Relations at Outerbounds and held roles at Coiled and DataCamp, where his work in data science education reached over 3 million learners. He has taught at Yale University, Cold Spring Harbor Laboratory, and conferences like SciPy and PyCon, and is a passionate advocate for democratizing data skills and open-source tools.

Key Quotes

Key Takeaways

AI Engineering Works at a New Abstraction Layer.
The ML landscape has fundamentally shifted from a world where every organization trained its own specialized models (like in the self-driving era) to one defined by a few large foundation model providers. Most users now operate at a higher level of abstraction, focusing on prompt engineering, context management, and building agents rather than model architecture and training.

The Bitter Lesson Demands Constant Re-Architecture.
In the age of LLMs, applications are built on an exponentially improving primitive. This dictates that structures and assumptions baked into an architecture today will be made obsolete by tomorrow’s models, forcing continuous, aggressive re-architecture (e.g., one major agent product rebuilt five times in eight months) to avoid bottlenecking future performance.

Start Simple and Build for Verifiable Evaluation.
Lessons from traditional ML still apply, emphasizing that simplicity is essential—use a simple prompt, then a workflow, and only move to an agent if the problem is truly open-ended. Evaluation remains critical, and systems should be designed around "Verifier's Law," meaning tasks are easier to solve if their successful completion is easily verifiable.

Match the Problem: Workflows for Predictability, Agents for Autonomy.
System design should be intentional: Workflows are best for predefined, predictable steps (like running a test suite or a migration), ensuring consistency and repeatability. Agents, which allow the LLM to dynamically direct its own processes and tool usage, are reserved for open-ended, adaptive tasks like complex research or debugging.

Model Improvement Drives Agent Autonomy and Reliability.
Agents have become significantly more viable because frontier models are much better at instruction following, tool calling, and crucially, self-correction. This increase in LLM capacity means the length of tasks an agent can reliably accomplish is doubling approximately every seven months, making longer-horizon tasks possible.

Context Engineering: Reduce, Offload, and Isolate.
Managing the LLM's context window is vital for controlling costs, improving latency, and maintaining output quality, as performance can degrade (context rot) even with very large context windows. Strategies include Reduction (pruning/summarizing old messages), Offloading (saving data to a file system, or using Bash/CLI tools to expand the action space instead of binding numerous tools), and Isolation (using sub-agents for token-heavy tasks).

Ambient Agents Require Thoughtful Human-in-the-Loop Design.
Asynchronous, or ambient, agents (like an email triage system running in the background) are an emerging form factor, but their higher autonomy introduces risk. They must be designed with careful human-in-the-loop checkpoints to prevent them from getting stuck in long, off-track sequences, and should incorporate a memory system to learn user preferences from ongoing feedback.

Protocols Drive Standardization in the LLM Ecosystem.
The rapid proliferation of custom tools and endpoints has led to the emergence of unifying standards like the Model Context Protocol (MCP). The adoption of such protocols and robust frameworks (like LangGraph) is crucial in large organizations to provide a common, well-supported standard for connecting tools, context, and prompts, improving security and developer efficiency.

Evaluation Must Be Dynamic and Component-Driven.
Static benchmarks are quickly saturated by rapidly improving models. Effective evaluation now relies on aggressive "dogfooding," capturing in-app user feedback, inspecting raw execution traces, and rolling new failure cases into dynamic eval sets. Additionally, system quality is improved by setting up separate evaluations for sub-components, such as the retrieval step in a RAG system.

Avoid the Rush to Fine-Tune; Frontier Models Catch Up.
Leaders should be wary of immediately rushing into model training or fine-tuning. The rapid advancements in frontier models mean that capabilities that required custom fine-tuning yesterday (like generating high-quality structured output) are often integrated into the general models today, risking wasted time and effort.

You can read the full transcript here.

Timestamps

00:00 Introduction and Welcome

00:21 Overview of Fine Tuning and Deep Learning

01:12 Workshop Logistics and Resources

04:19 Setting Up for the Workshop

04:33 Introduction to Fine Tuning and Prompting

09:27 Hands-On with CoLab and Hugging Face

18:25 Prompting and Evaluating the Model

35:51 Fine Tuning for Classification

40:42 Introduction to System and User Instructions

41:37 Evaluating Model Performance

42:32 Fine-Tuning the Model

44:55 Preparing the Dataset

47:26 Understanding Hyperparameters

50:03 Training and Inference

55:15 Real-World Applications and Challenges

01:07:44 Alien NPC Dialogue Model

01:15:40 First Run and Initial Impressions

01:16:16 Understanding Model Thinking

01:17:46 Evaluating Model Performance

01:18:40 Training and Validation Insights

01:19:32 Fine-Tuning and Overfitting

01:23:23 Implementing LLM as a Judge

01:25:30 Practical Applications and Use Cases

01:28:10 Advanced Techniques and Best Practices

01:39:50 Final Thoughts and Next Steps

Links From The Show

Transcript

featured

In the spotlight: Our most popular episodes

most recent

Listen up: Our latest discussions

Hear the hottest takes on data science and AI.

Get the latest episodes in your inbox

Never miss an episode of High Signal by signing up for the Delphina newsletter.

By clicking Sign Up you're confirming that you agree with our Terms and Conditions.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.