Welcome modal showing purpose-built AI workspace for freelancers

AI UX

A Chat Bubble Is Not a Workflow

Designing the conversation layer for an AI teammate doing real work

Role

Head of Product Design

Duration

Ongoing - pattern model through design partner v1

Team

Founding team (~10), Product, Design, Data, Engineering

Status (Apr 2026): CUX model deployed across freeform & curated workflow paths - in active iteration with 25 design partners

“This case study is the deeper companion to The Manager Seat, which covers the broader product strategy. Here, I want to walk through one specific design decision: what the conversation surface should look like when AI stops only answering questions and starts doing real work.”

The moment chat broke

Early on, the AI teammate lived inside a conversational interface - a familiar chat bubble, similar to ChatGPT or Claude. It answered questions well. Users liked it.

Then a design partner asked it to run a content gap analysis on their site against three competitors.

The chat bubble tried.

It produced a long stream of messages. It scrolled and scrolled. Halfway through, the user lost track of which competitor they were on. The output came back as a single wall of text in a chat history. The user couldn't find it again the next morning. They had to ask for it again, and the AI started over.

We watched this happen in three separate partner sessions in the same week.

The chat bubble was not the wrong shape for answering. It was the wrong shape for doing.

The AI teammate workspace showing structured interface — The AI teammate workspace: a structured interface for planning, executing, and tracking workflow progress.

What I owned

I led design for the CUX model - the interaction layer that sits between a user's request and a useful result.

Concretely:

•The CUX state model (intent → clarification → plan → approval → execution → output → next step)

•The three-path classification (freeform discussion vs. freeform execution vs. curated workflow)

•The plan-before-action pattern and its UI

•The progress disclosure model (live status, execution stages, step reasoning)

•The human-in-the-loop checkpoint pattern

•The output-as-artifact model

•The failure-state design system

I worked closely with the founding PM, the engineering lead, and a small data team. The hardest debates were with engineering - every checkpoint I proposed added latency or complexity. Some of those debates I lost on the first pass and we revisited later.

The design challenge in one sentence

How much of the AI's work should the interface reveal?

Too little, and the product feels like a black box. Too much, and the user becomes an unpaid debugger.

The CUX model is a series of design decisions about exactly how much to show, when to pause, and what shape the work takes on screen.

The CUX state model

The first design move was to stop treating every prompt the same way.

In a normal chatbot, every message goes through the same pipe: user asks, AI answers. That works for a conversation. It breaks for a workflow that needs scope, plan, approval, execution, and output.

The progression we built around:

Intent

User

Free text or workflow trigger

Clarification

Targeted questions, only when needed

Plan

What it understood, what it will do

Approval

User

Continue, modify, or stop

Execution

Stages, step reasoning, live status

Output

Artifact in side panel, handoff in chat

Next step

User or AI

Refine, save, share, schedule

Seven stages. Not every prompt walks through all of them - but the model is the same.

Three paths feed into this model:

Freeform discussion

When it triggers

Casual question, no execution required

What the UI does

Stays as chat. No plan, no heavy UI.

Freeform execution

When it triggers

User-typed request that needs structured work

What the UI does

AI proposes a plan; user approves before execution starts.

Curated workflow

When it triggers

User starts a known workflow from the library

What the UI does

Context is pre-framed; execution still needs confirmation.

Three paths into the same model. Not every message becomes a workflow.

The point of the classification was to stop overdoing it. Not every message needs to become a workflow. Sometimes the best AI experience is knowing when not to make a big deal of things.

Five design decisions

Five decisions where the obvious answer was wrong, or where we revised after testing. These shaped the surface.

1. Plan before action

The strongest trust pattern was also the simplest.

When a request required execution, the AI teammate didn't immediately start. It proposed a plan first.

The plan explained:

•What it understood

•What it was going to do

•What areas it would cover

•What output to expect

Then the user could continue, modify, or stop. Execution never starts in the same message as the plan.

This sounds obvious in retrospect. In the v1 build it wasn't - the engineering instinct was to start work immediately and stream results, because latency mattered. I argued for the pause and lost the argument the first time. We shipped without it. Within two weeks, design partner feedback was unambiguous: "I want to know what it's about to do before it does it." We added the plan checkpoint and the perception of speed actually improved, because users stopped feeling like the AI was racing ahead of them.

Clarifying questions shape the task before execution, reducing guesswork while keeping the flow lightweight and conversational.

2. Progress is not a spinner

Long-running work needs visibility. But visibility doesn't mean showing everything.

We used three layers of progress disclosure:

•Live status - a short signal: "Identifying competitors..."

•Execution stages - high-level workflow phases: Technical Audit, Competitor Identification, Keyword Gap Analysis, Report Generation

•Step reasoning - short, teammate-style explanations inside each stage

The first version showed everything: every model call, every tool invocation, every intermediate result. Partners told us it felt like reading server logs. The product was transparent - and useless.

The revision: progress is a layered surface. The default view is a clean stage list with the current one highlighted. Users can expand a stage to see step reasoning. They can expand further to see raw tool calls if they need to debug. Three layers, opt-in deeper, never forced.

“Logs are transparent, technically. So is a glass wall. Neither is automatically useful.”

The AI teammate proposes a plan, waits for approval, then executes in visible stages so users can follow, guide, and trust the work.

3. Outputs are artifacts, not chat messages

A conversation is a poor final destination for professional work.

Users need something they can inspect, copy, refine, share, and act on. The chat thread is not that thing - it's an event log.

So the output doesn't live as the last message in a chat thread. It appears as an artifact in a right-side panel, with the final chat message acting as a handoff: what was produced, why it matters, what to do next.

“The chat explains the work. The artifact is the work.”

This was the design call with the most downstream effect. Once outputs became artifacts, they could be:

•Re-opened from the workspace later (not buried in scrollback)

•Edited, exported, or shared independently

•Fed as input into another workflow

•Versioned

Conversation and output artifact showing the final AI teammate message and structured output — The conversation explains the work, while the structured output surfaces the result as a usable artifact alongside it.

4. Checkpoints are designed, not error states

AI workflows often need human input mid-execution: a tool needs authorisation, a dataset needs to be picked, a missing input needs to be supplied.

The first version surfaced these as errors. "Workflow paused. Authentication required." It worked, technically. It also made every workflow feel fragile.

We rebuilt these as checkpoints - first-class moments in the UI, not error states. The current stage pauses. A sticky checkpoint card appears with a clear action: Connect, Authorize, Select, Upload, Continue. The workflow resumes when the user acts.

The framing matters. An error feels like the system broke. A checkpoint feels like a natural pause point in real work.

5. Failure needs design too

This is the section I'd point to first if I were showing this case study to a senior designer.

AI workflows fail in ordinary ways. A data source is missing. A scraper returns partial results. A tool times out. A user didn't supply enough context. The model produces an output the Lead doesn't trust.

Most AI products treat failure as infrastructure noise - a generic error message, a retry button, an apology. We tried that first. It was the single most damaging pattern we shipped, in terms of design partner trust.

The redesign separated failure into two axes:

Failure type:

•Non-blocking - the workflow can continue, with reduced coverage or a modified approach

•Blocking - the workflow cannot continue without user action or required data

Response strategy (chosen by the AI, surfaced to the user):

•Continue as planned

•Continue with partial coverage (state explicitly what's missing)

•Replan and continue (show the new plan, get re-approval)

•Stop execution (explain what was learned, what to try next)

Each combination has a designed UI state. The user sees what happened, why it matters, and what comes next - not "something went wrong."

Partial coverage

Missing data source

Workflow continues with the partner data unavailable. Output explicitly notes the gap so the reader knows what's missing.

Non-blocking

Blocking checkpoint

Auth failure

Workflow pauses on a sticky checkpoint. User reconnects the integration; workflow resumes inline without restart.

Blocking

Replan and continue

Tool timeout

AI proposes an alternative tool, shows the revised plan, and requests re-approval before continuing.

Non-blocking

Stop execution

Insufficient context

Workflow halts before wasted execution. AI explains what's missing and how to retry with more context.

Blocking

Partial coverage

Partial scraper result

Output ships with an inline note: which competitors were covered, which were skipped, and why.

Non-blocking

Stop execution

Hallucination caught

Lead-layer review rejects an output before it ships. User sees the rejection reason and a path forward.

Blocking

Six failure shapes, each with a designed response. Every card shows whether execution blocks or continues.

“A good AI experience does not pretend failure will never happen. It shows what happened, why it matters, and what comes next.”

What's measurable today

Live numbers - Apr 2026

·25 design partners onboarded
·18 weekly-active partners (~72% of cohort)
·5 heavy users averaging ~2 hours/day
·~45 minutes average daily usage among weekly-active partners
·7 low-usage partners being evaluated against ICP fit

Q2 / Q3 2026 Targets

·78% accuracy for prompt-based responses (model + Lead-reviewed)
·≤12% of responses classified as hallucinations after Lead review
·65% of free-text queries receiving appropriate routing (chat vs. workflow)
·80% of failure states recovered without abandonment
·60%+ of weekly-active partners crossing into heavy-user threshold

Hallucination-rate audits, accuracy by workflow type, and longer-cycle retention land in Q2 / Q3 2026.

Outcome

The CUX model is now a reusable interaction layer for any AI execution work the product takes on. Adding a new workflow doesn't require rethinking how plans, progress, checkpoints, outputs, or failures should look - those are settled.

It also gave the team a clearer product language:

•When the AI should answer directly

•When it should ask for clarification

•When it should propose a plan

•When execution should pause

•How progress should be shown

•How tools should be referenced

•Where outputs should live

•How failures should be handled

•How next steps should be suggested

That language is now the spine of how new product surfaces are designed inside the workspace.

What I'd do differently

The plan-before-action pattern should have shipped from day one. I lost that argument in the first build because of latency concerns, and we paid for it in two weeks of confused user feedback. I should have insisted harder, or shipped a thinner version that proved the value before backing down.

The failure-state library should be design-system-level, not product-level. Right now failure states live in the product codebase. They should be tokenised, named, and reusable across any AI feature anywhere in the company. That's the project for Q2.

I'd test the three-path classification with users earlier. We named the paths internally before we tested whether users perceived the distinction at all. They mostly didn't - and the v1 routing logic over-classified things as workflows. The classification is now ~85% right, but it took two iterations.

Reflection

Designing AI products changes the meaning of simplicity.

Sometimes simplicity means fewer steps. Sometimes it means one more checkpoint. Sometimes it means showing the plan before hiding the complexity. Sometimes it means letting the system stop politely before doing something stupid.

That last one is underrated.

The interface for AI work cannot be only conversational. It has to be conversational, structured, inspectable, and interruptible.

A chat bubble can answer a question. A real AI teammate needs a way to clarify intent, propose a plan, show progress, ask for help, recover from failure, and hand over a useful result.

That's the difference between a conversation and a workflow.

Previous case study

Next case study

Back Home

AI UX

A Chat Bubble Is Not a Workflow

Designing the conversation layer for an AI teammate doing real work

Role

Head of Product Design

Duration

Ongoing - pattern model through design partner v1

Team

Founding team (~10), Product, Design, Data, Engineering

Status (Apr 2026): CUX model deployed across freeform & curated workflow paths - in active iteration with 25 design partners

“This case study is the deeper companion to The Manager Seat, which covers the broader product strategy. Here, I want to walk through one specific design decision: what the conversation surface should look like when AI stops only answering questions and starts doing real work.”