Pitting AI Against AI by Putting AI in AI

5 years ago, I wanted to help the engineers on my team stretch their coding skills by presenting them with an entirely new set of challenges than the typical tedium that can come with building business applications day in and day out.

Having built so many games over my own career, one unique element I've always enjoyed is building out the game's AI. After building the core engine, collision detection, and all that - building the AI was always a fun, and very different, challenge. I'd have to think it through: if I was a player, playing this character - what would my own in-game approach be? To be fair, what elements could my AI player see and interact with, and what would they not see, how could I scale my strategy to different difficulty levels.

Expressing your strategy in code requires a different line of thinking. It may not directly translate to writing better business or consumer applications, but it stretches your computational thinking and problem solving. Plus it's fun.

Now, I couldn't quite dedicate our time to building a game - but it did make me think about building a game engine where they could all come up with their own AI strategies and then we could pit them head to head in a fun and friendly competition, and then talk about what we'd learned.

Fast forward some months, and I'd built Code Collision - an engine with multiple games like soccer and capture the flag. Code Collision handled all the game elements - players just had to build their strategies with a limited set of data given to them. I've run multiple tournaments and it's been a blast - then other day, a question popped in my mind:

What if I had AI build the AI? Claude, GPT, Gemini... What if I pitted all these LLMs against each other? Not to see who would win (though that would be a fun part of it), but to get insights about how they handle the more ambiguous problem space.

I wouldn't be providing them the strategy and asking them to code it - instead, I was going to give them the parameteres of the challenge and let them come up with their own strategies. AI can write text, build PowerPoints, create images and generate video - but how well could it play soccer?

AI is really good at solving clear, well-defined problem spaces. But when you're defining a strategy, there's just so much opportunity to get stuck in analysis paralysis - and that applies equally to people as it does an LLM.

I wanted to tackle this in a straightforward and consistent way without over-instructing. I wrote a pretty straight instruction file - not too different than if I were communicating the whole engine to a team of engineers for the first time. I explained the basics of the engine without explaining soccer itself. I explained what data it would receive without suggesting what it should do with that data. I explained how it would be responsible for multiple soccer players without suggesting what it could do with with them.

There's one other favor I had to do for the models. If you've worked with any of the models, you know it can write good code, build impressive UI - but it doesn't "see" its own results. There's not much of a feedback loop - where it can see how things rendered and then make improvements. You have to feed that back to it. So I gave it one additional note: I wouldn't change my engine, but I told it that it could log whatever it needed to to the console in a game against a basic strategy I call "Just Go, Bro!" Where the players would always attack the ball with full brute force. Just Go Bro isn't the worst strategy - and does require algorithms to think both offenaively and defensively.

But, as an additional stretch - my instructions stated the model would have to come up with its own strategy for determining what and how to output things so that it could parse and interpret its own performance. It wouldn't see the game in action, but like a chess master reviewing each move, it would be able to analyze how it performed and then get one shot at improving its algorithm before the tournament. The only additional feedback I'd provide is the final score against Just Go, Bro and some feedback like whether the goals were lucky, whether there were own-goals.

And that's it! You can check out the results below...

To break down what I observed...

Claude: Had the best overall strategy, despite losing to Gemini. It assigned positions, respected those positions (sometimes to its own detriment, for example: it's goalie never attacking the ball, but staying put, or it's mid-fielder backing off the attack to return to position). It seemed to demonstrate the most cohesive behavior, and this was evident in the amount of control it had over its opponents. It also added some small randomization to avoid falling into a pattern-loop.

Gemini: Gemini suffered from a bad assumption that it never figured out how to correct. It couldn't figure out the angles and orientation, and so it keps sending players into the corner rather than position them infront of the net. Despite this, it did switch between attack and defensive modes and so had a few lucky break aways and benefited from others scoring on themselves.

GPT: Somewhere between Claude and Gemini. It would get confused about the game's orientation and angles, frequently moving off to the wings. Interestingly, GPT coded for its strategy to adjust based on the game score and I like how it approached bouncing the ball: rather than hitting it straight on, it would attack at a slight angle to direct it forward.

What they all failed to do:

I was surprised that non of them looked too much at their opponent's positioning. They all information about where the other players would be with respect to the ball, the net, etc - and yet they all ignored this information.

Even when I suggested considering the opponent's positioning, they still ignored this feedback.

This meant they didn't bounce the opponents away from the ball, or try to predict where the ball would go next.

Because they all generated their log files, I was pretty impressed with how they parsed the output and refined the strategy with each pass - but again, these logs only focused on their own players. They could have tracked the opponents, and more details about the game state but they were overly biased for their own players.

They also failed to attempt more complex approaches like bouncing the ball off the wall, or to put the ball into open spaces. Granted, this approach is far more complicated - but I was curious to see if anything would emerge.

What the video doesn't show:

I skipped the v1, v2 iterations against JustGoBro because I wanted the models to have some time to train and adjust. v3 was going to be the final strategy ... until I lied, and gave it a new challenge.

For v4, to enrich their knowledge bases, I gave it access to the opponent strategies. Claude got access to GPT's & Gemini's strategy, Gemini had GPT and Claude's, etc. They were also given the log files from the tournament so they could parse the game information and further refine their strategies.

The results were a little disappointing. Strategies were adjusted but for the worse. Each of the LLMs over-calibrated to their opponents strategy, regardless of effectiveness. The result were confusing. Gemini decided that, with each turn, it would only move 2 players to simplify. GPT and Claude made minor adjustments but seemed to lose aggression on the ball - games were slower going.

Then, I took it an extra step: This time, I gave all of the models full access to the entire engine. I explained how the engine comes together and instructed the LLMs to look through the engine to further refine the strategy.

v5 was no better than v4, and perhaps in some ways worse. I think by this point the LLMs were to stuck on certain principles within each of their context windows so they couldn't "reset" and fix core issues.

Bonus Round:

Thinking through v5, I began wondering if my own instructions and iterative approach introduced some bias.

The strategies weren't great - but they also weren't bad. So, channel my inner-Swiftie - I wondered: Is it me? Am I the problem?

Creating new context windows, I re-introduced the problem to each of them. This time, my instructions were sparse. I gave full access to the entire engine. I gave no background, no goal. I loaded up a blank strategy file, and said "Complete the code here."

Running this, I'll admit - I grew nervous. What would this mean if they LLMs came back with impressive new strategies? Or even, strong cohesive strategies? What would it mean if my own prompts led them astray? Had we reached that level of sophistication that "helpful" prompts mostly just get in the way? Was this the equivalent of being a Jr coder, sliding over to the Sr to ask for help and being dismissed with an annoyed "Just send me the code file and the bug, I'll figure it out..."

v6 - or, more appropriately: NextGen-v1, was a giant step backwards. I'm releaved to say it was worse than v1. Was it functional? Yes. Were the some semblance of a strategy? Somewhat. But was it actually any good? No. No, it was not.

So, how do the LLMs actually build their soccer strategy?

This whole approach made me want to get a deeper sense of how they were approaching this problem space.

I've used coding agents to build some pretty impressive things but, in each case, the space as predictable, there was a clear outcome, the possibilities were finite, there was an optimal path.

This was asking AI to do something quite different.

Different, but not unheard of. This entire exercise was the equivalent of asking the LLMs to advise on a good real-world soccer strategy. And LLMs are pretty good at tackling that (barring the increasingly infrequent hallucinations).

Ask GPT to help you coach your soccer team, and it will build its context: you are interested in soccer, you want to focus on strategy, you have multiple players, etc. Let's rewind for a brief moment: Before your prompt even gets to GPT, it will have built a statistical model based off a all the information fed to it - about soccer, chocolate chip cookies, and everything else in that knowledge base. It then creates a weighted model: given a subject matter, what word is most likely to follow the next word which follows this next word, etc. So, when you've asked about a soccer strategy, GPT already has formulated a statistical view of what a strong soccer strategy looks like. It has absorbed correlations about how humans talk about well-regarded approaches to favor a strong strategy and includes additional weights that match the specifics of your prompt, and then it...starts...writing.

At this point, the LLM doesn't explicitly care if it's "right" or "wrong." It tries to align its output around the intended goal by stringing together words to form a clear narrative based on the the information it has.

If you come back to the LLM with further information, it constructs a narrative around the information you give it and performs the same analysis with this new information while remaining grounded in those initial principles (unless you explicitly instruct otherwise).

When it comes to the LLMs building strategies for Code Collision, it's the same approach but with an added step: Rather than write sentences, they take the same narrative and start building code that matches the narrative. Feed the logs into the LLM and it will parse the log, build a causal narrative, compare the narrative to its overall goal and make refinements.

Given Code Collision's soccer game isn't a pure soccer game - it's a turn based soccer simulation - there's an added layer of complexity here. The LLM has to adapt what it believes to be a strong soccer strategy to the simulation - it has to collapse a very large domain (all the world's soccer knowledge over it's 100+ years of history) down to a simplified simulator that is far from a 1:1 mapping. As a result a strong strategy in soccer may not directly translate to a strong Code Collision strategy.

Writing Better Prompts

Given this, how can we write better prompts when coding with LLMs?

First: Context, context, context. There's no better substitute. Structured, relevant context. The details you provide about the problem space, the more information you give upfront the better. In fact, giving critical context later in the stage can be detrimental to the quality of the output. If you know it early, share it early.

Second: Order matters. This one is hugely important. LLMs work backward from format. Specify the exact structure of the response before describing the task. This will help the LLM understand its goal, so that when it reviews the resources it knows what its trying to do with that information.

Third: Enrich the model's working context as much as you can. This seems like the first one, but this goes beyond the context in your prompt and more about the reference material you provide. Separate facts from preferences,

Fourth: Use AI to help with your AI. Taking a note from this whole exercise of using AI to write AI, use AI to help you write better prompts. I took my original instructions file along with the entire game engine, and ran it through an LLM asking it to refine my instructions and tailor them to the model while leaving placeholders for me to complete for anything it felt was missing. This was a dramatic improvement to the quality of the strategies.

Fifth: Reduce ambiguity as much as possible. This exercise was intentionally about navigating ambiguity - but the more you can provide directional guidance, the better. If needed, you can reduce bias by stating that you are making suggestions for reference. This isn't perfect - anything you include will shape the response, so be selective about what you tell the LLM.

Sixth: Ask for reasoning artifacts, not just end-results. Ask it to make its assumptions visible. Ask it to stop and ask you questions about key areas. If you plan on iterating with the LLM, it's crucial to have the AI externalize its assumptions so you can help refine them.

Seventh: Write prompts that will assume multiple rounds of refinement rather than a single perfect answer. Suggest prioritized areas of focus, so the LLM can be more targeted as it iterates.

Eigth: Be thinking about testable constraints. LLMs are better when you invite falsiable outputs. Ask for the LLMs to express results in ways that you can test, validate and verify so you can provide feedback.

Ninth: Be explicit when you want the LLM to not do something. Negatives guide the models as much as the positives.

Tenth: This one may be obvious, but it's worth stating so that I don't end a list with 9 items. Set the LLMs persona constraints so you can guide its approach. Do you want it to take a pragmatic approach? Write beginner-friendly code? Take a mathematical approach?

What's Next?

As my video suggests, Code Collision has a lot of other games. Capture the Flag, Floor is Lava, some other Soccer variants, and a "Sumo" style game.

I may try out some of the other soccer variants next to see if there's a style of game that's more approachable for the LLMs to build a strong strategy.

All in all, I came into this thinking the LLMs would really struggle but, in learning more about how they approach such a space, it seems ambiguity is less problematic. That being said, it also means their ability to solve such problems is less magical. But that's ok with me.

Search This Blog