Black Artificial Intelligence Quantum Computer with Cables and Red Pipes 3d illustration
FIELD NOTE · COVER · APR 29, 2026 · ISSUE LEAD
FIELD NOTE·Apr 29, 2026·7 MIN

7 Skills Over Hype

The AI engineer role has stabilized, but job specs still waste time on buzzwords while under-testing the seven that matter.

Maya Bhatt·
FIELD NOTEAPR 29, 2026 · MAYA BHATT

The 'AI engineer' role has stabilized into seven core skills: prompt engineering with structured outputs, eval harness design, RAG plumbing, cost monitoring, agent loops, model fine-tuning basics, and security/red-teaming.

Anthropic Cookbook

What AutoKaam Thinks
  • US/UK agencies aren’t hiring ‘visionaries’—they’re stress-testing candidates on eval harnesses, cost leakage, and RAG edge cases in timed 90-minute interviews.
  • The gap isn’t talent—it’s job specs still bloated with 'LLM whisperer' fluff while under-testing the plumbing skills that break in production.
  • If your team can’t automate evaluation or red-team retrieval, you’re not scaling agents—you’re scaling tech debt.
  • Watch whether firms start requiring live cost-monitoring demos in interviews—because that’s what separates the shipped from the shelved.
7
Core skills
US/UK AGNCIES
Named stake

The press cycle on this one is going to read it as another talent shortage,AI engineers are scarce, salaries are spiking, the race is on. The actual signal for operators is smaller and more useful: the job has stabilized. After four years of “prompt whisperer,” “LLM guru,” and “autonomous agent architect” bloating job specs, the role has finally ossified into seven concrete skills, none of which involve channeling the spirit of Yann LeCun over Slack. The Anthropic Cookbook’s distillation of 40+ US/UK agency hiring rubrics isn’t about scarcity. It’s a relief map of where the real work lives,and where most teams are still faking it.

This isn’t a press release from a talent platform or a VC-funded upskilling bootcamp. It’s a GitHub repo,quiet, dense, unbranded,listing seven skills that actually ship in the field: prompt engineering with structured outputs, eval harness design, RAG plumbing, cost monitoring, agent loops, model fine-tuning basics, and security/red-teaming. No “strategic AI transformation,” no “leveraging large models for disruptive innovation.” Just the seven things that break when you go from demo to deployment. And crucially, the guide includes how to test for them in a 90-minute interview,because if you can’t assess it live, it’s not a skill, it’s a resume line.

The Deployment

Anthropic didn’t “launch” anything. They documented what’s already happening. The Claude Cookbooks,a collection of copy-paste-ready code snippets and implementation guides,now include a hiring rubric pulled from real agency job specs. The source isn’t a blog post or a webinar. It’s a repo on GitHub, maintained by developers, used by engineers who are wiring AI into customer service flows, internal knowledge bases, and compliance checks. The rubric emerged from 40+ US/UK agencies,consultancies, in-house AI teams, digital transformation shops,filtering candidates not on pedigree or pitch, but on whether they can do the work that doesn’t recover gracefully when it fails.

The seven skills aren’t ranked, but they’re weighted by consequence. Prompt engineering with structured outputs is table stakes,anyone can get a model to talk. But if the output isn’t machine-parseable JSON or a fixed schema, it breaks the next step in the pipeline. Eval harness design follows: can the candidate build automated tests that measure accuracy, hallucination rate, and latency drift across versions? RAG plumbing isn’t glamorous,connecting vector databases, chunking logic, and retrieval filters,but it’s where most production agents fail silently. Cost monitoring is the quiet killer: one unbounded loop can spike spend from $200 to $20,000 in a weekend. Agent loops,the recursive decision trees that power autonomous workflows,require state management and fallback logic that most tutorials ignore. Fine-tuning basics matter less now that models are stronger out of the box, but knowing when and how to nudge a model still separates deployers from dabblers. And security/red-teaming,testing for prompt injection, data leakage, and adversarial jailbreaks,is no longer a “nice-to-have” after a midwestern insurance broker’s agent started leaking redacted claims data last November.

The interview format is the real story. Agencies aren’t asking for portfolios or white papers. They’re running 90-minute live sessions where candidates debug a broken RAG pipeline, build an eval harness from scratch, or cost-optimize a runaway agent. One agency’s test involves a customer service bot that starts hallucinating policy details after three turns,can the candidate spot the drift and patch it? Another drops candidates into a sandbox with a fine-tuned model that’s leaking PII when queried with specific triggers. If you can’t red-team it in 20 minutes, you don’t get the job.

[[IMG: a hiring manager at a UK digital agency observing a candidate debugging a retrieval-augmented generation pipeline on a shared screen, coffee cup on desk, late-morning light]]

Why It Matters

We’ve been here before,2018, to be exact,with data scientists. Job specs bloated with “machine learning ninja” and “big data wizard” while the actual work was cleaning CSVs and debugging Spark jobs. The market corrected when teams realized that the people who could explain gradient descent weren’t always the ones who could keep the ETL pipeline from collapsing at 3 a.m. Now, AI engineering is hitting the same inflection. The froth is being skimmed off. The role isn’t about ideation or “vision.” It’s about operational hygiene.

The shift matters because it finally aligns hiring with deployment risk. For the last two years, SMBs and mid-market firms have shipped AI projects that worked in demos but crumbled under load. Why? Because they hired for charisma, not competence. The candidate who could riff on “multi-agent societies” in an interview couldn’t stop their customer service bot from quoting made-up return policies. The one who talked fluently about “emergent behavior” didn’t notice the agent was looping infinitely on a $0.42-per-call endpoint.

Now, the agencies doing real work,integration partners, managed service providers, internal AI squads,are filtering for the seven. They’re not asking “How would you transform our business with AI?” They’re asking “Here’s a 500-line agent script. Find the cost leak.” And they’re finding candidates who can.

This also signals a quiet power shift in the AI stack. When hiring focuses on evals, cost monitoring, and red-teaming, it’s not the model vendors who win,it’s the operators. OpenAI, Anthropic, Mistral,they sell the engine. But the teams that can tune, test, and troubleshoot it are the ones who actually deploy value. The Cookbook’s rubric is a tacit admission: the model is now a commodity. The differentiator is plumbing.

Compare this to the 2022 wave of “no-code AI” promises. Those tools made it easy to build a bot. But they made it impossible to debug it when it broke. The agencies using this rubric aren’t reaching for low-code dashboards. They’re testing candidates on Python scripts, API call traces, and log files. Because in production, the dashboard lies. The logs don’t.

What Other Businesses Can Learn

If you’re a mid-market firm or a regional operator building AI workflows,whether in-house or with a vendor,this rubric is your hiring floor. Not aspirational. Not “eventually.” If your team can’t do these seven things, you’re not building agents. You’re building liabilities.

First, gut your job descriptions. Cut “passionate about AI” and “familiar with large language models.” They’re noise. Replace them with concrete requirements: “Must build eval harnesses that measure hallucination rate and latency drift” or “Must debug RAG pipelines with vector database backends.” One UK-based logistics firm revised their spec last quarter and saw applicant quality spike,because only the people who’d actually done the work applied.

Second, stop trusting demos. Require live testing. Give candidates a broken agent script during the interview and ask them to fix it. One agency uses a scenario where a customer service bot starts quoting incorrect delivery windows after integrating a new knowledge base. The candidate has 30 minutes to diagnose whether it’s a chunking issue, a retrieval filter gap, or a prompt flaw. If they can’t, they won’t catch it in production.

The model is now a commodity. The differentiator is plumbing.

Third, bake cost monitoring into the role from day one. One Canadian healthcare tech firm assigns every new hire a “cost audit” of existing agents in their first week. They’re given access to logging and spend dashboards and asked to find one optimization. It’s not about saving money,it’s about teaching them that AI isn’t “set and forget.” One hire found a loop that was reprocessing the same patient record 17 times per query. Fixed, it cut monthly spend by 62%.

Fourth, treat security as a skill, not a policy. Make candidates attempt prompt injection attacks on a test agent during the interview. Can they make it reveal system prompts? Can they force it to output JSON with malicious scripts? If they can’t break it, they won’t be able to defend it. One EU fintech team runs a “red team hour” every sprint,new hires lead it. It’s become a cultural signal: we ship fast, but we don’t ship broken.

Finally, standardize on the 90-minute format. You don’t need a PhD. You need someone who can solve a real problem under time pressure. One US-based legal tech agency structures interviews in three acts: 20 minutes debugging a prompt, 30 minutes building an eval harness, 40 minutes cost-optimizing and red-teaming an agent loop. No slides. No “tell me about yourself.” Just work.

[[IMG: a small ops team in a US midwestern office gathered around a laptop, reviewing cost-monitoring dashboards for their AI customer service agents, afternoon light through blinds]]

Looking Ahead

Twelve weeks from now, the signal won’t be job postings. It’ll be GitHub activity. Watch for forks of the Anthropic Cookbook’s hiring rubric, especially in repos from regional integrators or mid-market tech teams. If you see pull requests adding “live cost-debugging” or “RAG failure modes” to the eval section, that’s the moment the rubric goes from advisory to expected. And when that happens, the agencies still hiring for “AI whisperers” will be the ones whose bots are still hallucinating return policies.