Short pieces on how AI recommendation measurement works, why it was built, and what the data is showing.
There is a question sitting at the end of every GEO and AEO engagement that most agencies have not answered yet, and it is not a complicated question, but it is the one that will matter most when a client eventually asks it. The question is simply this: did it work?
Not did the site improve, and not did the score go up, and not did the citations increase in the tool dashboard. Did AI actually start recommending my business more often than it did before, and can you show me proof of that with real data from the models themselves?
The reason most agencies cannot answer that question today is not because they chose the wrong tools. It is because they are measuring the right things in the wrong sequence, treating prediction and monitoring and observation as though they are all the same activity when they are not, and each one answers a different question entirely.
The tools that scan a site before any work begins are doing something genuinely useful, which is predicting what is likely to happen based on how well a site is structured for AI readability, and the better ones are thorough about it, walking through schema and entity clarity and topical coverage and making educated guesses about what the models are likely to do with what they find. That prediction has real value. It tells an agency where to focus. It justifies the engagement. It gives a client something concrete to look at before a single dollar of work gets spent.
But a prediction is not an observation. A scan that says a site is well-structured for AI recommendation is not the same as evidence that AI is recommending it, and those two things cannot be used interchangeably when a client is asking whether the work produced a result.
The monitoring tools are doing something different again, which is watching whether a business gets named when a specific prompt gets run, and tracking that over time so changes in citation behavior become visible. That is closer to observation. It answers the question of whether a business appeared in a response, and for many use cases that is exactly what an agency needs to know, especially when the goal is maintaining presence in AI-generated answers and catching drops before they become problems.
The distinction that matters is between appearing and being chosen. A business can be named in an AI response as a point of comparison, as a runner-up, as an example of a type of business in a geography, without ever being the answer to a buying-intent query. Monitoring that a name appeared is not the same as monitoring that a name was selected, and for clients whose goal is to win the recommendation when a buyer is ready to act, that gap is where the story lives.
The ARO Index is doing something else entirely, which is observing selection behavior across real buying-intent queries, run simultaneously across all four major AI platforms, and recording not just whether a business appeared but whether each model chose it, how confidently, how consistently, and how that stacks up against every other audited business in the same category and market.
That is not a better version of scanning or monitoring. It is a different measurement answering a different question. The scan answers: is this site ready? The monitor answers: is this business appearing? The Index answers: is this business being selected, by which models, and how does that compare to who else is getting selected instead?
All three questions are worth asking. The sequence matters.
For an agency running a GEO or AEO engagement, the cleanest proof-of-work looks something like this. You audit the client at the start so you have a selection baseline. You do the work. You audit again at the end so you have a selection comparison. What changed is no longer a guess based on site structure or a tally of citation events. It is a before and after on the actual behavior of the models when a buyer asks.
That is what a client is really asking for when they ask whether it worked. They want to know if the answer changed. The ARO Index is the layer that shows whether it did.
The tools are not competing with each other. They are measuring different things at different points in the same engagement. The only question is whether the right measurement is sitting at the end of the process, where the proof actually has to live.
ARO Index is the AI recommendation research layer of TaG Makes. The Index tracks which businesses AI platforms select across real buying-intent queries, by market and by category. Access the data at aroindex.com.
For this analysis I used a site I control: tagmakessc.com, my own Charleston, SC marketing agency. That choice is deliberate. When the test subject is your own site, you know exactly what is on it and exactly what did not change between runs. That removes the biggest source of doubt when a score moves. This analysis, conducted by Therese Grittner for the ARO Index, ran tagmakessc.com against six distinct buyer queries between April and June 2026 to isolate why the same website returns different ARO Scores from one audit to the next. The findings are relevant to business owners and agency partners who see a score move and want to know whether something is wrong.
Here is the most common question the ARO Index gets after someone runs an audit twice: why did my score change? Nothing on the site changed. The score still moved. Sometimes by a point. Sometimes by fifty.
That movement is not a flaw in the measurement. It is the measurement. An ARO Score reflects what AI platforms actually did when asked a buyer's question, and AI platforms do not return a fixed answer. Understanding why the number moves is the difference between reacting to noise and reading a real signal.
Most people treat an ARO Score like a credit score: one number, attached to them, that only changes when they do something. Improve the site, the number goes up. Leave it alone, the number sits still.
The data does not behave that way. An ARO Score is not a property of your website. It is a record of how AI platforms responded to a specific question at a specific moment. Change the question, or simply ask the same question again, and the answer can shift.
Below are real audit results for tagmakessc.com, all from the ARO Index. Because I own the site, I can confirm it was identical across every run. Nothing was edited, republished, or touched between audits. The only variables were the query asked and the moment it ran.
| Query asked | ARO Score | Models selecting | Date |
|---|---|---|---|
| best agency to get my business recommended on ai | 95 | 4 of 4 | Jun 4 |
| how do I see if chatgpt recommends my business in charleston sc | 61 | 2 of 4 | Jun 4 |
| how can I prove AI is recommending a business in atlanta GA | 44 | 1 of 4 | Jun 4 |
| how can I prove AI is recommending a business in atlanta GA | 23 | 0 of 4 | Jun 6 |
Two separate forces are visible in that table, and telling them apart is the whole point.
The gap between 95 and 44 was not about quality. It was about framing. When the query read like a buyer searching for a service provider, "best agency to get my business recommended on ai," all 4 models selected the site. When the query named a different city the business is not based in, Atlanta, GA, the score collapsed. That is correct behavior. AI platforms should not recommend a Charleston, SC business as a local Atlanta, GA option.
This is the larger of the two forces. Query framing routinely moved the score by 40 points or more in this dataset. A score is always a score for a question. There is no single ARO Score for a website detached from what someone asked.
Now look at the two Atlanta, GA rows. Same exact query. Two days apart. The score dropped from 44 to 23, and model selection fell from 1 of 4 to 0 of 4. The site did not change. The question did not change. The models simply returned a different result on a different run.
This is model non-determinism. ChatGPT, Claude, Gemini, and Perplexity do not return identical answers to identical prompts. Ask twice, get two shortlists. This force is smaller than query framing, but it is always present, and it is why a score can drift a few points between back-to-back runs with no other explanation.
The inconsistency is not interference with the signal. The inconsistency is the signal.
The practical read comes down to size and direction of the movement.
A small move, a few points, between runs of the same query is model noise. It means the platforms are slightly uncertain about where you belong, and that uncertainty is normal. It is not a reason to change anything.
A large move tied to a different query is the more useful finding. It tells you which questions you are selected for and which you are invisible for. That is a map, not a malfunction.
A sustained move in one direction across many runs of the same query is the one to watch. That is not noise. That is the competitive ground actually shifting under you, and it is the kind of change the ARO Index is built to surface over time.
A business selected by 4 of 4 models on repeated runs sits in a different position than one selected by 2 of 4, even when a single snapshot makes them look alike. Consistency across models and across runs is the thing worth measuring, because that consistency is what a buyer actually experiences when they ask an AI platform for a recommendation. One reading is a dice roll. The pattern across many readings is the truth.
So when an ARO Score moves, the first question is not "what broke." It is "what moved, by how much, and in which direction." The answer tells you whether you are looking at noise, a framing difference, or a real change in how AI platforms see you.
Most tools will tell a business whether it showed up in an AI answer, and that sounds useful until you realize showing up and getting picked are two completely different things, which is the harder question the ARO Index actually asks: when a real buyer asks ChatGPT, Claude, Gemini, or Perplexity for a recommendation, who does the AI choose?
The mechanic is simple to explain, it's just genuinely hard to do well.
You take a business, you take a real buying-intent query, the kind a customer actually types when they're ready to spend money, "best HVAC company in Mount Pleasant" and not just "HVAC," and you put that exact query to all four of the major AI models and record what each one does, whether it recommended the business or skipped right past it, how high it placed it, and how consistent that answer stayed across all four. That consistency is the part that matters most, because one AI saying yes is noise, four AIs agreeing is a signal, and the gap between them is where the real story lives. The ARO Score is the single number that falls out of all of it, one read on how consistently AI selects a business when a buyer is ready to act.
Here's what the Index is not. It is not a mention counter, because being named in a paragraph is not the same as being recommended, and it is not a site-readiness checker that predicts how you might do someday, because it only watches what the models actually did. Predictive tools guess. The Index observes.
That distinction is the entire point, because everyone else is measuring whether you appeared, and the Index is measuring whether you were chosen.
This whole thing started as a question I could not let go of, which was whether AI would actually recommend a small business over a big one, not in theory and not as a nice idea, but specifically, like if someone asked ChatGPT for the best option in their town, would the local shop doing the better work actually show up, or would the model just hand the answer straight to whoever already had the biggest footprint and call it a day.
So I started checking, by hand, one business and one query and four models at a time, writing down whatever came back, and it was ridiculous, I was doing it manually like some kind of spreadsheet hermit, and it did not scale past about the third afternoon before I knew I had to build the thing properly, but the answers were interesting enough that stopping was never really on the table.
The obsession had a reason sitting underneath it, which is that small businesses are about to compete in a place most of them don't even know exists yet, because when a customer asks an AI instead of scrolling through Google, a brand new gatekeeper quietly decides who gets seen and who doesn't, and if nobody is measuring that, then small businesses are flying straight into it blind while the big players already have teams in a room somewhere figuring it all out, and the local shop has nobody.
I wanted everyone to be able to see it, not a guess, not a sales pitch dressed up as a score, but an actual measurement of what AI does when a real buyer asks. That's what the Index turned into, and it outgrew the by-hand stage fast, but the reason behind it never changed, I built it because I wanted to know, and then I wanted everyone else to be able to know too.
Honest status, because honestly that's the only kind worth publishing.
The Index is early and it is seeded, and here's exactly what that means, no spin on it. To get businesses into the Index in the beginning, the field got built from existing search presence, meaning businesses were surfaced from where they already ranked and then run through the audit, which was the fast way to get a real foundation in place, and it works, and it got the whole thing off the ground when it needed to get off the ground.
It also means the current picture skews favorable, because a field built from businesses that already have a web footprint is, by definition, a field of likely performers, so the recommendation rates are running high right now because of how that field got assembled and not because AI actually recommends everyone, and that number is going to move as the Index fills up with the real mix, the businesses running their own audits, the client work, the owners who genuinely want to know where they stand instead of where they hope they stand. The seeds were the starting line. They were never the baseline.
So why publish any of this before the baseline is clean? Because a research source earns trust by showing the work, including the parts that aren't finished yet, and anyone can sit on their data until the numbers flatter them, but this is being built out in the open where you can watch it happen.
One finding already holds even on a seeded field, which is that the four AI models do not agree on who to recommend, because looking at the very same businesses they keep reaching different conclusions, and whatever shaped that field shaped it equally for all four of them, so the disagreement between the models survives the seeding completely intact. That part is real, and it's the exact thread the next volume picks up.
This is the Seed Edition. The real numbers are coming as the Index fills with the real mix. Run your audit now to be part of the data.
Get notified when new research drops.
Are you a business or an agency?