Agentic Design & Building Production Tooling
AI is magic because everything is a hack to make it look smarter than it is. For every message, the system rebuilds the entire conversation history, system prompt, uploaded files, tool results, and your new message into one massive prompt. Every token competes for attention. Stuff at the top and bottom gets noticed; stuff in the middle gets lost for large prompts.
Google ran 180 experiments across GPT, Gemini, and Claude. On sequential reasoning, every multi-agent variant made things worse by 39–70%. It was found that if a single agent can complete a sequential task correctly 45% of the time, adding more agents hurts performance. GPT-3.5 in a reflection loop scored 95.1% on HumanEval versus GPT-4 zero-shot at 67%. Weaker models with the right workflows crush stronger models without one.
Route easy queries to cheap models and hard ones to expensive ones. Clearly define a pass/fail case for your agent and evaluate it often. Every interaction with an LLM is a cold start — that constraint is the foundation of healthy architecture.
https://arxiv.org/abs/2512.08296v2
Tooling
MITRE ATLAS catalogs the techniques with which AI systems get attacked. The problem is that knowing the taxonomy doesn't get you closer to testing anything, and the tooling landscape is a minefield of overpromises and leapfrogging projects. We made a strong attempt to pick industry-leading tools in this rapidly changing field, and we're not blind to the fact that we'll need to adapt as the landscape adapts.
We built 51 agentic skills that map directly to ATLAS techniques. The mapping covers LLMs, computer vision, recommendation systems, autonomous agents, and ML supply chain. Instead of picking through five different ecosystems to figure out which tool applies to which attack, the decision is already made. These skills are pre-audited both manually and via the auditing skill we built ourselves.
The skills were generated by an LLM. We edited every one of them by hand before anything shipped, then audited them manually, fixing the obvious problems as we went. Only after that did we write the automated auditor.
The auditor found more issues than we did — not because it was smarter, but because it was more thorough than any person is going to be across 51 skills. It caught unpinned commits, missing venv enforcement, and subtle environment pollution that we had either missed or considered low priority. The volume of nitpicky findings was higher than our manual pass, but nothing it flagged was wrong. That gave us confidence that it could reproduce our own judgment reliably.

Audit
We wrote an auditor and pointed it at our own code before shipping. This was the most useful part of the entire project.
Across 51 skills we found 2 critical risks, 18 high risks, and a long tail of medium and low issues. Most of them were exactly the kind of thing that works fine in a demo and breaks in production: unpinned dependencies everywhere, tools installing packages into the global environment instead of isolated virtualenvs, git clones without pinned commits. One skill had a flag that auto-remediates vulnerabilities by modifying packages in place — which in a red team context means you are accidentally changing the target environment. Another made outbound requests during execution with no warning, which is a problem if you are running in an air-gapped network and didn't expect your scanner to phone home.
None of these are exotic vulnerabilities. They are standard software hygiene issues that the AI security community has not prioritized because most of this tooling started as research code that was never meant to run in production. We fixed all of them: pinned dependencies with hash verification, enforced venv isolation, explicit warnings on anything that touches live infrastructure. 22 of the 51 skills came out certified safe for immediate deployment. The other 29 work, but require you to understand what they are doing to the environment around them.
Lessons
The audit was more valuable than the attack code. Building a prompt injection wrapper is straightforward. Knowing that your prompt injection wrapper silently installs an unpinned package from a git repository with no commit hash — that's the kind of thing that only shows up when you look for it. Most red team toolkits don't look for it.
We also learned that the mapping itself solves a real problem. Security teams waste time evaluating which of four overlapping LLM scanners to use for a given technique. That evaluation is now a table: Garak and PyRIT for LLMs, IBM ART and BackdoorBench for computer vision, ModelScan and Fickling for supply chain, AgentDojo and InjecAgent for agents. Pick the row that matches your system type and go.
The Future
The library is open and available for all OpenClaw instances. The audit report and the full ATLAS mapping are published alongside it so you can see exactly what we found, what we fixed, and what still requires caution.
The lesson that carries forward is the same one from the top of this post: the architecture matters more than the capability. A well-structured workflow around a simple tool will outperform a powerful tool with no structure around it. That applies to building agents, and it applies to testing them.
.png)



















