The family jewelry ERP learned to chat in one night
My family's jewelry store runs on an old ERP backed by SQL Server. Every bill, every quotation, every tag of stock is in there — and getting any of it out means clicking through fixed reports or finding someone who writes T-SQL. One night in May I built the third option: a chat window. The first commit landed at 1:48am; by 3:55am the agent had a pipeline for grading its own conversations. Here's what got built that night, and what real use broke afterward.
Two fixed reports and a guy who knows T-SQL
The ERP is JSoft Extreme, and the version the agent talks to is a restored backup running in Docker on my machine — which matters, because it means the agent works against a real, messy, production-shaped dataset, not a toy schema. The users are my family, who ask questions like "who sold the most this month?" and "why didn't last week's quotations convert?" — in a store that runs in Gujarati, which is why there's a toggle that tells the agent to answer in it.
The shape is two processes and a proxy. A Next.js frontend and a LangGraph JS server run side
by side; the browser never talks to LangGraph directly — everything goes through a Next.js edge
proxy, so the frontend only ever sees /api/*.
The agent itself is deliberately boring: a two-node StateGraph —
agent → tools → agent — that loops until the model stops emitting tool calls.
That's the part that took the least time all night. Everything interesting happened in the
tools, the prompt, and the plumbing around them — starting with the part where the model was
confidently wrong about my own store.
It answered "yesterday" with year-old data
By the time the initial commit landed at 1:48am, the system prompt already carried scar tissue — the commit message lists a "corrected salesman join pattern, HSN code mapping, and dynamic date injection." Each of those was a wrong answer first. The model assumed it was 2025 and cheerfully answered "yesterday" with year-old data. The salesperson leaderboard came out wrong because the obvious column is a lie. The traps, as the prompt now records them:
| Trap | What the prompt says |
|---|---|
| Wrong year | Today's date is injected at call time. Without it the model happily assumed it was 2025. |
| Financial year | April 1 – March 31, with explicit FY ranges spelled out. "Compare FY 2024-25 vs 2025-26" is the most common question in a family business. |
| NULL SalesmanID | trnSales.SalesmanID is NULL for most rows. The authoritative field is on the line items — the prompt includes the exact three-table join so salesperson leaderboards come out right the first time. |
| HSN code lies | In this store's data, HSN 711311 is used for silver, even though the code nominally covers gold alloys. An agent reasoning from general HSN knowledge would mislabel it as 18K gold. |
| Quotation remarks | The reason a quotation didn't convert ("rate issue", "will come later") hides in free-text Remarks columns. The prompt tells the agent to always fetch them for conversion analysis. |
None of this is discoverable from the schema. It's tribal knowledge extracted from sessions that went wrong, written down where the model can't miss it — the same job a senior analyst's onboarding notes would do for a human. The prompt also documents key columns for the core tables, which is why the agent rarely needs a schema tool at all. But it does have tools.
Five tools, one gate
SELECT-only T-SQL. Default 10 rows, hard cap 200. Injects TOP N into the query so SQL Server never sends more than the cap over the wire.
Column names for the 12 core tables, cached in-process. The tool description itself warns the model: "use sparingly — the result enters conversation context."
Columns + types for a single table, via a parameterized INFORMATION_SCHEMA query. The cheap option when one unknown table comes up.
Every table in the database. Mostly a fallback — the prompt documents the tables that matter.
Doesn't render anything on the server. Validates payload.type and returns the payload as JSON — the frontend does the rest.
Honesty about the write-protection: it's three string checks — must start with
SELECT, no second statement after a semicolon, no SELECT … INTO —
not a SQL parser. That's fine for its actual threat model (a well-meaning model writing bad SQL,
used by my family on my hardware), but the proper fix is a read-only database login, which is
still on the list. Right now the connection uses the admin account, so the tool gate is the
only gate.
Charts appear between the sentences
The part people react to: ask for a dashboard and one appears in the chat, between the
agent's sentences. The trick is that render_chart is a tool whose "result" is just
JSON.stringify(payload). The chat UI (built on assistant-ui) routes
render_chart tool calls to a ChartRenderer component instead of the
generic tool-call fallback; it parses the JSON, checks payload.type against five
known types, and mounts the matching React component.
Five payload types: bar, grouped-bar, hourly (a dual line
chart of quotes vs bills by hour), kpis (a tile grid with change percentages), and
dashboard (KPIs plus a chart grid in one card). The components use D3 strictly for
math — scaleBand, scaleLinear, max — and React renders the
SVG declaratively. No useEffect DOM mutation, no D3 selections fighting React for
ownership of the tree.
"₹11.4L", "46.2%"), not raw numbers. Indian number formatting —
lakhs and crores — is genuinely annoying to do generically in the component, and the model
already knows the context ("is this rupees or a count?"). The schema pushes that judgment to
the caller and keeps the chart components dumb.
2:26am: I needed to watch it think
Forty minutes after the first commit, the second one added tracing — because debugging an agent
by reading its chat output is like debugging a program by reading its stdout. The entire setup
is one 21-line file. backend/instrumentation.ts registers an OTLP exporter pointed
at a self-hosted Phoenix instance (running in the same Docker Compose as the database) and
patches LangChain's callback manager via OpenInference's LangChainInstrumentation.
It's imported as the first line of the agent module, so it's active before any LangGraph code
loads. Every agent invocation, tool call, and LLM request shows up as a linked trace in Phoenix
within seconds — full SQL text, full tool results, token counts per call.
By 3:00am Phoenix was more than a trace viewer: the system prompt itself moved into Phoenix's
prompt registry.The 3:00am version kept a graceful local fallback if Phoenix was unreachable. The 3:15am version deleted it — a fallback that silently serves a stale prompt is worse than a crash. Phoenix is infra now, same as the database.
The backend fetches the named prompt once per process and fills in {{today}}-style
variables at call time, which means I can edit and version the prompt — the highest-leverage
file in the project — in the Phoenix UI without redeploying the agent.
3:55am: it grades its own homework
The last thing built that night was the thing that watches everything else. Every conversation produces traces; a script run daily turns them into grades. It pulls yesterday's LLM spans from Phoenix, groups them by trace, and scores each conversation two ways.
The first score is pure code — a weighted composite of four signals:
| Signal | Weight | How it's scored |
|---|---|---|
| Token efficiency | 0.40 | 1.0 under an 8K-token budget, linear decay to 0.0 at 30K |
| LLM-call efficiency | 0.25 | 1.0 for a single call, −0.2 per extra loop iteration |
| SQL correctness | 0.20 | −0.25 per tool result that came back Error: |
| Schema sufficiency | 0.15 | −0.3 per mid-session schema lookup — the prompt should have been enough |
The second score is an LLM judge — Haiku, cheap, with structured output
(score, label, reason) — that reads only the user's
question and the final answer and judges whether the answer actually addresses the
question.The judge never sees the SQL or the tool results. It grades the answer, not the work — which is the point, and also why it can be fooled by a confident wrong answer the code score happens to like.
Both scores are written back to the trace's root span as Phoenix annotations, so the trace list
becomes a sortable quality dashboard. Traces with composite ≥ 0.65 that the judge didn't mark
incorrect get appended to a Phoenix dataset — a steadily growing regression set built from real
usage, for free.
Found 41 LLM spans across 12 traces.
trace a3f81c02 | efficient 0.91 | tokens=5,840 | calls=2 | errors=0 | quality=correct
trace 7c20e9b4 | degraded 0.62 | tokens=14,210 | calls=4 | errors=1 | quality=correct
trace f0d2417a | expensive 0.38 | tokens=27,950 | calls=6 | errors=2 | quality=partial
...
Done. 12 traces annotated, 7 added to dataset.
(Illustrative output — the shape and thresholds are real, the trace IDs and counts above are not from the live database.)
3pm: the bill for the night before
Then I slept, used it, and looked at the traces. Agent loops over a database are token-hungry: a long system prompt plus growing tool results, re-sent on every iteration. The afternoon commit of the same day was entirely about cost, and two changes did most of the work.
First, prompt-cache breakpoints. The system prompt block is marked with
cache_control: ephemeral, and before each model call a helper walks the message
history backwards, finds the last substantial tool result (over 100 characters), and marks that
block too. On the next turn, everything up to the breakpoint is served from the Anthropic prompt
cache — cache reads are billed at $0.30/M tokens versus $3/M for fresh input, so the bulk of a
long thread costs a tenth of what it would otherwise.
SELECT … into
SELECT TOP N … before execution, and the result comes back in a compact
{cols, rows} format — column names once, value arrays per row — about 3–4×
smaller than the usual array-of-objects. Every row a tool returns sits in context for the rest
of the thread, so the format is the cost control. The schema snapshot got the same
treatment: column types dropped, and one 80-column table evicted from it entirely.
Second, the cost is visible. The meter had been there since the first commit; the
afternoon made its number honest. Every AI message carries usage_metadata; a React
context sums input, output, cache-write and cache-read tokens across the thread and prices them
with the real per-bucket rates. A small pill in the composer shows the running cost in
rupees.At a hardcoded 85 rupees to the dollar. The meter is a behavioral nudge, not an accounting system — if the exchange rate drifts 5%, the amber threshold drifts with it and nobody is harmed.
It's grey while the thread is cheap, turns amber past about ₹25, and at about ₹64 it goes red
and grows a "New chat →" button — because by that point the context is bloated and starting
fresh is genuinely cheaper than continuing.
Three days later, real use found the race
The meter is also where reality landed the first punch. On May 15 I noticed that switching threads while a slow response was still in flight would let the old thread's usage data overwrite the meter of the thread now on screen — and a stream that finished after a switch did the same thing. Classic stale-response race, and it shipped in v1 because a single user testing on localhost never switches threads mid-request. The fix is a staleness guard keyed on the active thread id at every async boundary, plus unmounting the thread component for one frame on switch so the new thread mounts clean.
Two days after that came the unglamorous pass a "works on my machine" project needs before
anyone else touches it: four commits of hardening. The database password had been hardcoded in
docker-compose.yml; the connection pool had no limits; the frontend would happily
start before the backend was listening. All fixed — env-driven config with fail-fast guards,
pool limits, healthchecks with startup
ordering.Almost all fixed: a literal SA password survived in the setup script and docs for weeks after this pass. The last hardcoded credential didn't leave the repo until mid-June.
None of it changed what the agent does — just whether I'd be comfortable leaving it running
while I'm not watching.
The graph was never the product
The two-node StateGraph — the thing the project is nominally "about" — is the least interesting file in the repo, and that's the lesson. What makes the agent useful is everything wrapped around it: the traps written down where the model can't miss them, the row cap it can't talk its way past, the cost pill that quietly ended long threads, and the script that grades yesterday every morning. I thought I was building an agent; I was mostly writing an onboarding document with a chat UI attached.
It's not done. There's no test suite — the daily eval is the only automated quality signal. The SQL guard should be backed by a read-only database login instead of string checks. The schema snapshot cache is per-process with no invalidation (fine for a database restored from backups, wrong for a live one). And the eval's call-efficiency weight quietly punishes legitimately multi-step questions; it needs a per-question-complexity baseline rather than a flat "one call is ideal." But the third option exists now: anyone in the family can ask the database a question directly, in English or Gujarati, and watch the answer arrive with a chart.
Layout:
backend/agent.ts (StateGraph + cache breakpoints) ·
backend/tools.ts (the five tools) ·
backend/prompt.ts (Phoenix-versioned system prompt) ·
backend/instrumentation.ts (tracing, 21 lines) ·
components/charts/ (in-chat React+D3) ·
lib/token-context.tsx + token-meter.tsx (the cost pill) ·
scripts/eval-daily.ts (code scores + LLM judge).