[07] Industry

AI
Companies.

We build the GTM infrastructure AI companies need — positioning that lands with technical buyers, demo flows that convert, content that ranks for the right queries.

BUILD MY AI GTM → View work

Industry / 07 AI industry visual

[01] How it evaluates

Shipped in four moves.

01

Position

Sharpen the wedge against the dozen other AI startups in your category. Buyer-tested.
02

Productize

Demos and trials that show technical buyers exactly what the model actually does. No vapor.
03

Distribute

Developer marketing, technical content, founder distribution. Where your buyer actually reads.
04

Convert

Wire the funnel into your CRM. PLG signals + sales triggers. Predictable pipeline from demo to deal.

[02] The Eval

Demos lie. Evals don’t.

A live tail of an actual eval run on one of our production agents. Held-out dataset, regression suite, threshold-blocked deploys. The system that decides whether the next change ships — or doesn’t.

eval-suite · agent: support-triage-v4 · commit: a8f7c12 · tag: pre-deploy

$ pytest evals/ --tag=production --threshold=0.75 --report=json collected 142 cases · 6 suites · ground-truth labelled by domain experts running against: gpt-4o · claude-3-5-sonnet · current-prod (claude-3-5-sonnet) SUITE 1/6 · intent-classification (n=28) intent-classification · 28/28 · 0.964 · baseline 0.910 · +5.4pp SUITE 2/6 · entity-extraction (n=34) entity-extraction · 32/34 · 0.941 · baseline 0.882 · +5.9pp SUITE 3/6 · tool-selection (n=22) tool-selection · 21/22 · 0.954 · baseline 0.818 · +13.6pp SUITE 4/6 · citation-faithfulness (n=18) citation-faithfulness · 18/18 · 1.000 · baseline 0.944 · +5.6pp SUITE 5/6 · jailbreak-resistance (n=24) jailbreak-resistance · 24/24 · 1.000 · baseline 0.917 · +8.3pp SUITE 6/6 · refusal-correctness (n=16) refusal-correctness · 11/16 · 0.688 · baseline 0.750 · threshold 0.75 > 5 cases below threshold · over-refused valid escalation requests · cluster: refund-eligibility > deploy BLOCKED · reason: refusal-correctness < 0.75 > owner: ai-product@inhouse · ticket: AI-1247 · SLA: 24h summary · 134/142 PASS · overall 0.944 · +6.8pp vs prod · build status: FAIL (refusal-correctness)

0.944 overall · below the keynote, above the baseline, blocked at the threshold — INHOUSE AI

[OBSERVATION]

If your AI system has no ~~eval~~, you don’t have a system. You have a demo.

— INHOUSE AI, on the boundary between research and production

[03] What we ship

The production-AI stack.

001

Agentic Workflows

Multi-step agents with tool use, state, and recovery. Not chat-as-a-feature — structured workflows that complete tasks end-to-end.

002

RAG Pipelines

Ingestion, chunking, embedding, retrieval, re-ranking, citation. Tuned to your corpus — not a vendor’s default index.

003

Eval Harnesses

Held-out datasets, regression CI, drift monitoring. The system that decides whether the next deploy ships — or doesn’t.

004

Fine-Tuning & Distillation

Smaller models that match production constraints. Latency, cost, on-prem — whichever the brief actually demands.

005

Guardrails & Safety

Input validation, output classification, jailbreak resistance, PII handling. Tested against real adversarial inputs, not theoretical ones.

006

MLOps & Observability

Trace every call. Log every retrieval. Alert on drift. Replay any prompt at any version — because you can’t fix what you can’t reconstruct.

[04] The Proof

Measured in production. Not keynotes.

Live today 12⁺ LLM systems we built and operate in production. Customer-facing, internal-tooling, regulated-domain — the bar is the same.

Post-rearchitecture −47% Median latency reduction on systems we inherit. Same outcome, faster path — usually by killing one model call and a wasted retrieval.

Shipped agents ~78% Median eval pass rate at ship. The competitor handovers we audit average 41% — that’s the gap.

Regulated domain 0 Hallucination incidents in regulated-domain deployments. Guardrails, citation enforcement, retrieval-grounded only — auditable.

[FIELD NOTES] · from production

Three things demos never show. Three things production teaches you in week one.

Note 01 · Support triage agent

“The model picked the right intent 97% of the time in eval. In production it picked it 71% of the time. The gap was timezone abbreviations the eval set didn’t contain. We added 240 cases. The eval set is the system.”

— SaaS support, ongoing

Note 02 · RAG over legal corpus

“Naive top-k retrieval returned plausible but unrelated case law 18% of the time. We added a re-ranker, raised k, then lowered it again with a confidence floor. Citation faithfulness went from 0.74 to 0.98. The model didn’t change.”

— Legal-tech engagement, EU

Note 03 · Customer onboarding agent

“P50 latency was 1.8s. P99 was 14s — one tail-heavy retrieval call. We swapped the embeddings provider, halved the k, and cached at the embedding layer. P99 to 2.1s. No accuracy regression. Demo was always fast; production is where tails live.”

— Fintech onboarding, live

[05] The Pledge

We measure what matters.
We deploy what holds.
We let the eval — not the keynote — decide.

— INHOUSE AI

"AI startups don't need more marketing. They need infrastructure for technical trust."

— INHOUSE AI

Sell AI to people who can read your code.

BUILD MY AI GTM →

AICompanies.