There’s a moment that happens in every organization racing to adopt AI. Someone discovers that if you add “think step-by-step” to your ChatGPT prompt, you get better results. Or that asking the AI to “be confident” or “be objective” somehow improves its accuracy. Before long, you have people spending hours crafting the perfect prompt, treating language models like temperamental oracles that need the right incantation.
This is the prompt engineering trap. And if your AI strategy relies on it, your engineers might be building on quicksand.
When AI Wrote Better Prompts Than Humans
In early 2024, researchers at Broadcom/VMWare had a straightforward question: do all those prompt engineering techniques actually work consistently? link
They tested 60 different prompt combinations across multiple language models, including popular techniques like chain-of-thought reasoning. The results were surprising: even chain-of-thought prompting sometimes helped and other times hurt performance, with “the only real trend” appearing to be “no trend”. What worked for one model failed for another. What succeeded on one dataset crashed on the next.
Then they tried something different. Instead of having humans craft prompts, they let the AI optimize its own prompts algorithmically. The automatically generated prompts outperformed the best human-engineered prompts in almost every case.
But here’s what really caught attention: the prompts the AI created were bizarre. One optimal prompt was just an extended Star Trek reference: “Command, we need you to plot a course through this turbulence and locate the source of the anomaly”. Apparently, thinking it was Captain Kirk helped that particular model solve grade-school math problems.
No human would have discovered that. No amount of “prompt engineering best practices” would have led there.
The Fundamental Misunderstanding
To put it bluntly: A lot of people anthropomorphize these things because they ‘speak English.’ No, they don’t. It doesn’t speak English. It does a lot of math. link
This is the core issue. When we treat language models as if they understand language the way humans do, we are making a category error. We’re trying to communicate with a statistical pattern-matching engine as if it were a colleague.
Prompt engineering is essentially trial-and-error optimization of a black box. Throw different phrases at the model, see what sticks, document it in a “prompt library,” and hope it keeps working.
This isn’t just an academic point. It has real implications for how to build production AI systems.
From Prototypes to Production: The Reality Gap
It’s very easy to make a prototype, and it’s very hard to productionize it.
Getting ChatGPT to give a good answer with the right prompt is one thing. Building a reliable, scalable, governable AI system that business depends on is something else entirely.
Consider what production actually requires:
Reliability.Systems needs to handle edge cases gracefully. It needs to fail safely when things go wrong. It needs to work consistently, not just when the stars align when writing a good prompt
Consistency. Traditional software testing strategies are maladapted for nondeterministic LLMs, making it challenging to ensure AI behaves predictably across different inputs and contexts.
Governance. Know what your AI is doing and why. Engage in audit trails. Trace decisions back to data. Demonstrate compliance.
Scalability. What works for 10 queries a day needs to work for 10,000. What runs acceptably on a demo needs to run economically at scale. You will run against request limits. LLM’s are not cheap to run and require a lot of free space on GPU’s.
Security and privacy. Enforce careful privacy routines. Use scrubbers. Host your own models in private networks through well known cloud providers. Check terms with these providers and build a trusted relationship with them.
None of this comes from better prompts. A lot of it requires engineering.
The Rise of LLMOps: Engineering Over Incantation
The industry is already evolving past prompt engineering as a standalone discipline. LLMOps encompasses language model training, fine-tuning, monitoring, and deployment, as well as data preparation—a far cry from just crafting clever prompts.
Many large companies are pioneering LLMOps, which includes prompt engineering in its lifecycle but also entails all the other tasks needed to deploy a product. This includes model versioning, performance monitoring, data pipeline management, and infrastructure orchestration.
The shift is telling. Machine learning operations (MLOps) engineers are best positioned to take on LLMOps roles—not because they’re better at writing prompts, but because they understand how to build and maintain production ML systems.
What Real AI Engineering Looks Like
When we talk about AI sovereignty at Thisworkz, we’re not talking about finding better prompts. We’re talking about engineering control into your AI systems from the ground up.
Real AI engineering means:
Building systems, not crafting queries. Instead of treating an LLM as a black box you prompt, you build pipelines that process data reliably, generate structured outputs, and integrate with your business logic.
Creating reproducible results. Your AI’s behavior should be deterministic and traceable, not dependent on finding the right magic words to whisper to an API.
Implementing governance by design. From data lineage to model versioning to output validation, governance isn’t bolted on—it’s architected in.
Optimizing for production, not demos. Performance, cost, reliability, and security aren’t afterthoughts. They’re requirements from day one.
This is ML engineering. It’s why we say we’re ML engineers, not prompt engineers.
The Question Your Organization Needs to Answer
Here’s the critical question: Are you building AI capabilities within your teams, or are you renting them through clever API usage?
If someone spent three days finding the perfect prompt that gets GPT-4 to format your data just right, what happens when OpenAI deprecates that model? What happens when they change their pricing? What happens when a competitor needs that same capability but your “secret sauce” is just a prompt that stops working with the next model update?
You haven’t built a capability. You’ve discovered a workaround.
There’s nothing wrong with using external AI services. But if your entire AI strategy consists of engineering better ways to use someone else’s black-box model, you’re not building sovereignty. You’re building dependency.
Moving Forward: From Dependency to Engineering
The path to AI sovereignty doesn’t start with better prompts. It starts with better questions:
- What do we actually need our AI to do reliably?
- What control do we need over our AI’s behavior and outputs?
- How do we validate that our AI is working correctly?
- What happens when we need to change or scale our approach?
These are engineering questions, not prompt questions. And they require engineering answers.
As one researcher noted, “the nature of that interaction will just keep on changing as AI models also keep changing”. The models will evolve. The APIs will change. The best prompts today will be obsolete tomorrow.
But solid engineering principles? Those persist.
In the next part of this series, we’ll explore one of the most hyped “solutions” in AI right now: Retrieval Augmented Generation (RAG). Vendors promise it will solve the hallucination problem and deliver “zero errors.”
Spoiler: They’re wrong. And understanding why requires moving past marketing to engineering reality.
About AI Sovereignty
At Thisworkz, AI Sovereignty means taking full control of your AI solutions through proper ML engineering. We help organizations move from API dependency to engineered systems with reliable outputs, complete data governance, and traceable decision-making. Because your AI strategy deserves better than clever prompts and crossed fingers.
Want to explore what AI sovereignty looks like for your organization? Let’s talk about the engineering, not the prompts.
