Kirtan Soni
← All writing

The family jewelry ERP learned to chat in one night

My family's jewelry store runs on an old ERP backed by SQL Server. Every bill, every quotation, every tag of stock is in there — and getting any of it out means clicking through fixed reports or finding someone who writes T-SQL. One night in May I built the third option: a chat window. The first commit landed at 1:48am; by 3:55am the agent had a pipeline for grading its own conversations. Here's what got built that night, and what real use broke afterward.

Two fixed reports and a guy who knows T-SQL

The ERP is JSoft Extreme, and the version the agent talks to is a restored backup running in Docker on my machine — which matters, because it means the agent works against a real, messy, production-shaped dataset, not a toy schema. The users are my family, who ask questions like "who sold the most this month?" and "why didn't last week's quotations convert?" — in a store that runs in Gujarati, which is why there's a toggle that tells the agent to answer in it.

The shape is two processes and a proxy. A Next.js frontend and a LangGraph JS server run side by side; the browser never talks to LangGraph directly — everything goes through a Next.js edge proxy, so the frontend only ever sees /api/*.

request Browser Next.js /api proxy LangGraph :2024
agent loop StateGraph 5 tools MSSQL (Docker)
observability OpenTelemetry Phoenix :6006 daily eval

The agent itself is deliberately boring: a two-node StateGraphagent → tools → agent — that loops until the model stops emitting tool calls. That's the part that took the least time all night. Everything interesting happened in the tools, the prompt, and the plumbing around them — starting with the part where the model was confidently wrong about my own store.

It answered "yesterday" with year-old data

By the time the initial commit landed at 1:48am, the system prompt already carried scar tissue — the commit message lists a "corrected salesman join pattern, HSN code mapping, and dynamic date injection." Each of those was a wrong answer first. The model assumed it was 2025 and cheerfully answered "yesterday" with year-old data. The salesperson leaderboard came out wrong because the obvious column is a lie. The traps, as the prompt now records them:

TrapWhat the prompt says
Wrong yearToday's date is injected at call time. Without it the model happily assumed it was 2025.
Financial yearApril 1 – March 31, with explicit FY ranges spelled out. "Compare FY 2024-25 vs 2025-26" is the most common question in a family business.
NULL SalesmanIDtrnSales.SalesmanID is NULL for most rows. The authoritative field is on the line items — the prompt includes the exact three-table join so salesperson leaderboards come out right the first time.
HSN code liesIn this store's data, HSN 711311 is used for silver, even though the code nominally covers gold alloys. An agent reasoning from general HSN knowledge would mislabel it as 18K gold.
Quotation remarksThe reason a quotation didn't convert ("rate issue", "will come later") hides in free-text Remarks columns. The prompt tells the agent to always fetch them for conversion analysis.

None of this is discoverable from the schema. It's tribal knowledge extracted from sessions that went wrong, written down where the model can't miss it — the same job a senior analyst's onboarding notes would do for a human. The prompt also documents key columns for the core tables, which is why the agent rarely needs a schema tool at all. But it does have tools.

Five tools, one gate

query_database

SELECT-only T-SQL. Default 10 rows, hard cap 200. Injects TOP N into the query so SQL Server never sends more than the cap over the wire.

get_schema_snapshot

Column names for the 12 core tables, cached in-process. The tool description itself warns the model: "use sparingly — the result enters conversation context."

describe_table

Columns + types for a single table, via a parameterized INFORMATION_SCHEMA query. The cheap option when one unknown table comes up.

list_tables

Every table in the database. Mostly a fallback — the prompt documents the tables that matter.

render_chart

Doesn't render anything on the server. Validates payload.type and returns the payload as JSON — the frontend does the rest.

Honesty about the write-protection: it's three string checks — must start with SELECT, no second statement after a semicolon, no SELECT … INTO — not a SQL parser. That's fine for its actual threat model (a well-meaning model writing bad SQL, used by my family on my hardware), but the proper fix is a read-only database login, which is still on the list. Right now the connection uses the admin account, so the tool gate is the only gate.

Charts appear between the sentences

The part people react to: ask for a dashboard and one appears in the chat, between the agent's sentences. The trick is that render_chart is a tool whose "result" is just JSON.stringify(payload). The chat UI (built on assistant-ui) routes render_chart tool calls to a ChartRenderer component instead of the generic tool-call fallback; it parses the JSON, checks payload.type against five known types, and mounts the matching React component.

query_database agent aggregates render_chart(payload) JSON in message ChartRenderer React + D3 SVG

Five payload types: bar, grouped-bar, hourly (a dual line chart of quotes vs bills by hour), kpis (a tile grid with change percentages), and dashboard (KPIs plus a chart grid in one card). The components use D3 strictly for math — scaleBand, scaleLinear, max — and React renders the SVG declaratively. No useEffect DOM mutation, no D3 selections fighting React for ownership of the tree.

Decision — the model formats the numbers. KPI values are pre-formatted strings ("₹11.4L", "46.2%"), not raw numbers. Indian number formatting — lakhs and crores — is genuinely annoying to do generically in the component, and the model already knows the context ("is this rupees or a count?"). The schema pushes that judgment to the caller and keeps the chart components dumb.

2:26am: I needed to watch it think

Forty minutes after the first commit, the second one added tracing — because debugging an agent by reading its chat output is like debugging a program by reading its stdout. The entire setup is one 21-line file. backend/instrumentation.ts registers an OTLP exporter pointed at a self-hosted Phoenix instance (running in the same Docker Compose as the database) and patches LangChain's callback manager via OpenInference's LangChainInstrumentation. It's imported as the first line of the agent module, so it's active before any LangGraph code loads. Every agent invocation, tool call, and LLM request shows up as a linked trace in Phoenix within seconds — full SQL text, full tool results, token counts per call.

By 3:00am Phoenix was more than a trace viewer: the system prompt itself moved into Phoenix's prompt registry.The 3:00am version kept a graceful local fallback if Phoenix was unreachable. The 3:15am version deleted it — a fallback that silently serves a stale prompt is worse than a crash. Phoenix is infra now, same as the database. The backend fetches the named prompt once per process and fills in {{today}}-style variables at call time, which means I can edit and version the prompt — the highest-leverage file in the project — in the Phoenix UI without redeploying the agent.

3:55am: it grades its own homework

The last thing built that night was the thing that watches everything else. Every conversation produces traces; a script run daily turns them into grades. It pulls yesterday's LLM spans from Phoenix, groups them by trace, and scores each conversation two ways.

The first score is pure code — a weighted composite of four signals:

SignalWeightHow it's scored
Token efficiency0.401.0 under an 8K-token budget, linear decay to 0.0 at 30K
LLM-call efficiency0.251.0 for a single call, −0.2 per extra loop iteration
SQL correctness0.20−0.25 per tool result that came back Error:
Schema sufficiency0.15−0.3 per mid-session schema lookup — the prompt should have been enough

The second score is an LLM judge — Haiku, cheap, with structured output (score, label, reason) — that reads only the user's question and the final answer and judges whether the answer actually addresses the question.The judge never sees the SQL or the tool results. It grades the answer, not the work — which is the point, and also why it can be fooled by a confident wrong answer the code score happens to like. Both scores are written back to the trace's root span as Phoenix annotations, so the trace list becomes a sortable quality dashboard. Traces with composite ≥ 0.65 that the judge didn't mark incorrect get appended to a Phoenix dataset — a steadily growing regression set built from real usage, for free.

npm run eval:daily
Evaluating traces from 2026-06-09T18:30Z → 2026-06-10T18:30Z
Found 41 LLM spans across 12 traces.
 
  trace a3f81c02 | efficient 0.91 | tokens=5,840 | calls=2 | errors=0 | quality=correct
  trace 7c20e9b4 | degraded 0.62 | tokens=14,210 | calls=4 | errors=1 | quality=correct
  trace f0d2417a | expensive 0.38 | tokens=27,950 | calls=6 | errors=2 | quality=partial
  ...
 
Done. 12 traces annotated, 7 added to dataset.

(Illustrative output — the shape and thresholds are real, the trace IDs and counts above are not from the live database.)

Decision — efficiency and quality are separate annotations. A cheap wrong answer and an expensive right one fail differently. Folding the judge's quality score into the composite would have hidden exactly the cases I care about — so efficiency is a CODE annotation, quality is an LLM annotation, and the dataset filter requires both.

3pm: the bill for the night before

Then I slept, used it, and looked at the traces. Agent loops over a database are token-hungry: a long system prompt plus growing tool results, re-sent on every iteration. The afternoon commit of the same day was entirely about cost, and two changes did most of the work.

First, prompt-cache breakpoints. The system prompt block is marked with cache_control: ephemeral, and before each model call a helper walks the message history backwards, finds the last substantial tool result (over 100 characters), and marks that block too. On the next turn, everything up to the breakpoint is served from the Anthropic prompt cache — cache reads are billed at $0.30/M tokens versus $3/M for fresh input, so the bulk of a long thread costs a tenth of what it would otherwise.

Decision — make waste impossible, not discouraged. The first version asked the model nicely to limit rows. The afternoon version rewrites SELECT … into SELECT TOP N … before execution, and the result comes back in a compact {cols, rows} format — column names once, value arrays per row — about 3–4× smaller than the usual array-of-objects. Every row a tool returns sits in context for the rest of the thread, so the format is the cost control. The schema snapshot got the same treatment: column types dropped, and one 80-column table evicted from it entirely.

Second, the cost is visible. The meter had been there since the first commit; the afternoon made its number honest. Every AI message carries usage_metadata; a React context sums input, output, cache-write and cache-read tokens across the thread and prices them with the real per-bucket rates. A small pill in the composer shows the running cost in rupees.At a hardcoded 85 rupees to the dollar. The meter is a behavioral nudge, not an accounting system — if the exchange rate drifts 5%, the amber threshold drifts with it and nobody is harmed. It's grey while the thread is cheap, turns amber past about ₹25, and at about ₹64 it goes red and grows a "New chat →" button — because by that point the context is bloated and starting fresh is genuinely cheaper than continuing.

Decision — nudge, don't block. The meter never stops you from sending another message. It just makes the marginal cost of a bloated thread visible at the exact moment you're about to pay it, and puts the escape hatch one click away. That turned out to be enough — long threads stopped happening once the number was on screen.

Three days later, real use found the race

The meter is also where reality landed the first punch. On May 15 I noticed that switching threads while a slow response was still in flight would let the old thread's usage data overwrite the meter of the thread now on screen — and a stream that finished after a switch did the same thing. Classic stale-response race, and it shipped in v1 because a single user testing on localhost never switches threads mid-request. The fix is a staleness guard keyed on the active thread id at every async boundary, plus unmounting the thread component for one frame on switch so the new thread mounts clean.

Two days after that came the unglamorous pass a "works on my machine" project needs before anyone else touches it: four commits of hardening. The database password had been hardcoded in docker-compose.yml; the connection pool had no limits; the frontend would happily start before the backend was listening. All fixed — env-driven config with fail-fast guards, pool limits, healthchecks with startup ordering.Almost all fixed: a literal SA password survived in the setup script and docs for weeks after this pass. The last hardcoded credential didn't leave the repo until mid-June. None of it changed what the agent does — just whether I'd be comfortable leaving it running while I'm not watching.

The graph was never the product

The two-node StateGraph — the thing the project is nominally "about" — is the least interesting file in the repo, and that's the lesson. What makes the agent useful is everything wrapped around it: the traps written down where the model can't miss them, the row cap it can't talk its way past, the cost pill that quietly ended long threads, and the script that grades yesterday every morning. I thought I was building an agent; I was mostly writing an onboarding document with a chat UI attached.

It's not done. There's no test suite — the daily eval is the only automated quality signal. The SQL guard should be backed by a read-only database login instead of string checks. The schema snapshot cache is per-process with no invalidation (fine for a database restored from backups, wrong for a live one). And the eval's call-efficiency weight quietly punishes legitimately multi-step questions; it needs a per-question-complexity baseline rather than a flat "one call is ideal." But the third option exists now: anyone in the family can ask the database a question directly, in English or Gujarati, and watch the answer arrive with a chart.

Stack: Next.js · LangGraph JS · Claude Sonnet · @assistant-ui/react · D3 (scales only) · mssql · SQL Server 2022 + Phoenix in Docker Compose · OpenTelemetry/OpenInference.
Layout: backend/agent.ts (StateGraph + cache breakpoints) · backend/tools.ts (the five tools) · backend/prompt.ts (Phoenix-versioned system prompt) · backend/instrumentation.ts (tracing, 21 lines) · components/charts/ (in-chat React+D3) · lib/token-context.tsx + token-meter.tsx (the cost pill) · scripts/eval-daily.ts (code scores + LLM judge).

Tags: TypeScript, LangGraph, Agents, Claude, Observability