From Context Engineering to AI Agent Harnesses: The New Software Discipline

LangChain
Lance Martin is a software engineer with deep technical expertise in applied machine learning. One of the early hires at LangChain, he has focused primarily on the Python open-source library. Before joining LangChain, he served as a tech lead and manager working on computer vision for self-driving cars and trucks at UberATG, Ike, and Nuro. Lance holds a PhD from Stanford University, where his research centered on applied ML.

Delphina
Hugo Bowne-Anderson is an independent data and AI consultant with extensive experience in the tech industry. He is the host of the industry podcast Vanishing Gradients, a podcast exploring developments in data science and AI. Previously, Hugo served as Head of Developer Relations at Outerbounds and held roles at Coiled and DataCamp, where his work in data science education reached over 3 million learners. He has taught at Yale University, Cold Spring Harbor Laboratory, and conferences like SciPy and PyCon, and is a passionate advocate for democratizing data skills and open-source tools.
Key Quotes
Key Takeaways
AI Engineering Works at a New Abstraction Layer.
The ML landscape has fundamentally shifted from a world where every organization trained its own specialized models (like in the self-driving era) to one defined by a few large foundation model providers. Most users now operate at a higher level of abstraction, focusing on prompt engineering, context management, and building agents rather than model architecture and training.
The Bitter Lesson Demands Constant Re-Architecture.
In the age of LLMs, applications are built on an exponentially improving primitive. This dictates that structures and assumptions baked into an architecture today will be made obsolete by tomorrow’s models, forcing continuous, aggressive re-architecture (e.g., one major agent product rebuilt five times in eight months) to avoid bottlenecking future performance.
Start Simple and Build for Verifiable Evaluation.
Lessons from traditional ML still apply, emphasizing that simplicity is essential—use a simple prompt, then a workflow, and only move to an agent if the problem is truly open-ended. Evaluation remains critical, and systems should be designed around "Verifier's Law," meaning tasks are easier to solve if their successful completion is easily verifiable.
Match the Problem: Workflows for Predictability, Agents for Autonomy.
System design should be intentional: Workflows are best for predefined, predictable steps (like running a test suite or a migration), ensuring consistency and repeatability. Agents, which allow the LLM to dynamically direct its own processes and tool usage, are reserved for open-ended, adaptive tasks like complex research or debugging.
Model Improvement Drives Agent Autonomy and Reliability.
Agents have become significantly more viable because frontier models are much better at instruction following, tool calling, and crucially, self-correction. This increase in LLM capacity means the length of tasks an agent can reliably accomplish is doubling approximately every seven months, making longer-horizon tasks possible.
Context Engineering: Reduce, Offload, and Isolate.
Managing the LLM's context window is vital for controlling costs, improving latency, and maintaining output quality, as performance can degrade (context rot) even with very large context windows. Strategies include Reduction (pruning/summarizing old messages), Offloading (saving data to a file system, or using Bash/CLI tools to expand the action space instead of binding numerous tools), and Isolation (using sub-agents for token-heavy tasks).
Ambient Agents Require Thoughtful Human-in-the-Loop Design.
Asynchronous, or ambient, agents (like an email triage system running in the background) are an emerging form factor, but their higher autonomy introduces risk. They must be designed with careful human-in-the-loop checkpoints to prevent them from getting stuck in long, off-track sequences, and should incorporate a memory system to learn user preferences from ongoing feedback.
Protocols Drive Standardization in the LLM Ecosystem.
The rapid proliferation of custom tools and endpoints has led to the emergence of unifying standards like the Model Context Protocol (MCP). The adoption of such protocols and robust frameworks (like LangGraph) is crucial in large organizations to provide a common, well-supported standard for connecting tools, context, and prompts, improving security and developer efficiency.
Evaluation Must Be Dynamic and Component-Driven.
Static benchmarks are quickly saturated by rapidly improving models. Effective evaluation now relies on aggressive "dogfooding," capturing in-app user feedback, inspecting raw execution traces, and rolling new failure cases into dynamic eval sets. Additionally, system quality is improved by setting up separate evaluations for sub-components, such as the retrieval step in a RAG system.
Avoid the Rush to Fine-Tune; Frontier Models Catch Up.
Leaders should be wary of immediately rushing into model training or fine-tuning. The rapid advancements in frontier models mean that capabilities that required custom fine-tuning yesterday (like generating high-quality structured output) are often integrated into the general models today, risking wasted time and effort.
You can read the full transcript here.
Timestamps
00:00 Introduction and Welcome
00:21 Overview of Fine Tuning and Deep Learning
01:12 Workshop Logistics and Resources
04:19 Setting Up for the Workshop
04:33 Introduction to Fine Tuning and Prompting
09:27 Hands-On with CoLab and Hugging Face
18:25 Prompting and Evaluating the Model
35:51 Fine Tuning for Classification
40:42 Introduction to System and User Instructions
41:37 Evaluating Model Performance
42:32 Fine-Tuning the Model
44:55 Preparing the Dataset
47:26 Understanding Hyperparameters
50:03 Training and Inference
55:15 Real-World Applications and Challenges
01:07:44 Alien NPC Dialogue Model
01:15:40 First Run and Initial Impressions
01:16:16 Understanding Model Thinking
01:17:46 Evaluating Model Performance
01:18:40 Training and Validation Insights
01:19:32 Fine-Tuning and Overfitting
01:23:23 Implementing LLM as a Judge
01:25:30 Practical Applications and Use Cases
01:28:10 Advanced Techniques and Best Practices
01:39:50 Final Thoughts and Next Steps
Links From The Show
Transcript
In the spotlight: Our most popular episodes
Listen up: Our latest discussions
Hear the hottest takes on data science and AI.
Get the latest episodes in your inbox
Never miss an episode of High Signal by signing up for the Delphina newsletter.


.png)








.png)
