As I mentioned in the last entry, when you build psychology tests with AI, quality tends to be all over the place. At first I thought it was just a limitation of AI. But the more tests I built, the clearer it became — it wasn't that AI couldn't do it well. I just hadn't given it proper standards. Asking fresh each time meant three options one day, two the next, five the day after. The number of result types varied too — four in some tests, six in others. No consistent criteria meant no consistent output.
I Decided to Build a Guide
I had a feeling things would spin out of control if left alone. It was obvious that once there were 10 or 20 tests, fixing them one by one later would be a nightmare. If everything was built to a consistent standard from the start, there would be far less to correct later. So I decided to build a test creation guide. If I gave this guide to AI upfront, it could create tests to a consistent standard without me needing to re-explain everything each time.
I Asked AI How to Build the Guide — AI Fired Back with Questions
I asked AI how to build the guide itself. And AI responded by firing back a volley of questions. How many questions should there be? What's the minimum number of options? How should the test concept be defined? How do you design the emotional hooks for users? What's the file structure? Are there existing documents to reference? When I actually tried to answer, I realized quite a few things were still undefined. I was trying to build a guide, but I hadn't even properly organized the standards for myself.
So I reversed the order. First, I defined the standards myself. Minimum 12 questions, minimum 3 options with 4 recommended, minimum 5 result types. Then I handed those standards to AI, and only then did it produce a proper guide. It added examples of good and bad cases, and built a checklist at the end. The guide document ended up pretty substantial.
Quality Improved with the Guide — Except for One Thing
Once I handed over the guide, something changed. Result types came out more balanced than before, and the writing tone stayed somewhat consistent. It wasn't a sudden leap in quality — more like the floor got higher. The worst-case outputs became less frequent.
Whenever I spotted something that needed improvement, I'd tell AI 'this rule isn't being followed' and refine the document bit by bit. But there was one item I revised dozens of times: the number of options. Even with 'minimum 3, recommend 4' clearly written in the guide, tests with only two options kept appearing. I tried emphasizing the item more, adding examples, explicitly calling out the wrong case — nothing fully fixed it. Some days it followed the rule fine, other days it'd go back to two. That was when I first realized: some rules just won't stick no matter what the guide says.
Switching Models Made the Document Suddenly Work
While working in Antigravity, I switched models for some reason — to Gemini 2.5 Flash. And something strange happened. Items that had never been followed suddenly started being followed. Options consistently came out as four, and the result type structure aligned with the guide. At first I thought it was a coincidence, but the same thing happened time after time.
Gemini 3.0 Pro is the more powerful model, so why does following a guide feel better with 2.5 Flash? I don't know the exact reason, but here's my guess: 3.0 Pro is a relatively new model that's still receiving ongoing patches. My suspicion is that those updates shift behavior in small ways, and that's affecting guide adherence too. In practice, 3.0 Pro did follow the guide well on some days and not at all on others — that kind of inconsistency was definitely there. Maybe it's just a model that hasn't fully stabilized yet.
Current Approach: Split Models by Purpose, Check Every Time
So now I split models by purpose. For tasks that require complex reasoning — like planning or coding — I use Gemini 3.0 Pro. That work it handles clearly better. For repetitive tasks that require following a guide, like generating test data, 2.5 Flash feels more stable at the moment, so I pass those over to it. That might change once 3.0 Pro stabilizes further, but for now this is the setup that works.
And regardless of which model I use, I always review the output. If something doesn't follow the guide, I re-link the guide document and point out 'this part wasn't handled properly.' At first I found this tedious, but it's actually a natural part of the process. We review human-written work — of course there will be problems if we just use AI output without checking. I've come to accept that putting AI inside a review workflow, rather than trying to fully automate it, is the right approach.
The Guide Extended to Games Too
Not long after building the test guide, I decided to add games. Seeing casual games like reaction tests and the apple-catching game become popular at work, I figured games could help bring more people to the site. Tests and psychology content alone felt like a limited pipeline.
Having learned from building the test guide, I wanted to build the game guide properly from the start too. I defined standards first, then built the guide around them. Things like requiring common hooks across all games, standardizing the result screen component — all of that went in. It was the same process as building the test guide, but this time I knew the order from the start, so it came together much faster.
In the next entry, I'll write about what happened when I started adding games — from seeing popular games at work and thinking 'I should build one of these,' to what I actually ran into while implementing them.
Documentation Is Not a Tool — It's Infrastructure
One thing I've come to feel through all of this: using AI well and doing documentation well aren't separate skills. Even with the same AI model, the gap between having a guide and not having one is significant. And even with a guide, how it's written makes a difference. In the end, delegating well to AI also starts with concretely defining what you want beforehand.
I did the documentation early because I sensed things would spiral out of control later — and that judgment was right. As tests and games pile up, the cost of going back to fix things made without a guide compounds. Documentation isn't a tool for improving efficiency — it's the infrastructure that lets a project keep running. I learned that by building the guide itself.