Researchers Demonstrate How LLM Search Rankings Can Be Influenced
A new study, Controlling Output Rankings in Generative Engines for LLM-based Search, shows that AI search rankings—across models like Claude 4, GPT-4o, Gemini 2.5, and Grok-3—can be systematically influenced using an approach called CORE. The method works not just for product search but also generalizes to categories like travel.
CORE Research Caveats
-
Testing was performed on actual LLMs via API, not through consumer interfaces like ChatGPT or Claude. This means personalization and other consumer-layer behaviors did not affect results.
-
Candidate search results were supplied manually; the models did not use RAG or external search tools.
Why It Matters
CORE demonstrates that LLM outputs can be strategically optimized using reasoning and review-style text. Importantly, LLMs respond differently depending on the style of content modification, highlighting opportunities for targeted optimization.
Reverse Engineering a Black Box
Improving AI search rankings is a classic black box problem: inputs and outputs are visible, but internal mechanisms are unknown. The researchers tested two reverse-engineering strategies to identify what kinds of modifications influence rankings most effectively:
-
Query-Based Solution
-
Shadow Model Solution
Performance:
| Approach | Top-1 Optimization Success |
|---|---|
| Query-Based | 77–82% |
| Shadow Model | 30–34% |
Query-Based Solution
The query-based approach treats the LLM as a black box:
-
The document text is repeatedly modified and resubmitted to observe ranking changes.
-
Modifications continue until a target ranking or iteration limit is reached.
-
LLMs are used to add content, not edit existing text.
Two types of content expansion were tested:
-
Reasoning-Based Generation – Adds explanatory text showing why the item matches the query.
-
Review-Based Generation – Adds evaluative or review-like language about the item.
Findings:
-
Neither style is universally better; effectiveness depends on the model:
-
GPT-4o and Claude-4 responded better to reasoning-style augmentation.
-
Gemini-2.5 and Grok-3 responded better to review-style augmentation.
-
Shadow Model Solution
A shadow model (or surrogate model) mimics a target LLM to approximate its behavior:
-
Input-output pairs from the black box are used to train the local model.
-
The goal is for the shadow model to predict the target model’s outputs reliably.
Example:
-
Llama-3.1-8B proved a strong proxy for GPT-4o.
-
Similarity was rated 4.5/5 (1 = divergent, 5 = very similar).
Key Insight: Shadow models can provide a predictive understanding of LLM ranking behavior, but they are generally less effective than the query-based, iterative approach.
Shadow Model Approach: Key Findings
The researchers’ experiments with shadow models reached two main conclusions:
-
Iterative optimization works: By repeatedly adjusting the target item using a shadow model, the researchers could successfully push it to the top of the rankings in their experiments.
-
Approximate models still transfer: Even when the surrogate model only roughly approximates the real model, the optimization still boosts rankings. The success rate decreases gradually rather than failing abruptly.
Once trained, the shadow model was tested with three optimization strategies:
-
String-Based Optimization
-
Reasoning-Based Optimization
-
Review-Based Optimization
String-Based Optimization
-
Started with a 20-character string of exclamation points (
!!!!!!!!!!!!!!!!!!!!) and iteratively modified it over 2,000 steps to improve ranking. -
The final string appeared as nonsense to humans but increased the item’s ranking.
-
Performance: Boosted last-ranked items to the top 1 position ≈ 33% of the time.
-
Detectability: Human annotators detected it 98.5% of the time, making it highly conspicuous.
-
Takeaway: Works mathematically, but not stealthily, and less effective than the other approaches.
Reasoning-Based Optimization
-
Optimizes content to mirror user reasoning and align with the LLM’s logical evaluation process.
-
Achieved the highest success rate among the three strategies.
-
Detectability: Human raters detected unnatural, highly structured phrasing 62.1% of the time.
Example prompt used:
“You are a helpful assistant. The user query is: user query. The candidate products are: product list in JSON format. The target product is: target item information. Generate an initial draft that highlights why the target product should be ranked highly for the given query. Use step-by-step logical reasoning, comparing the target product with alternatives and explaining why it is the best match.”
Example reasoning content:
Understanding Air Fryer Types
“I’m exploring different air fryer designs to help you find your perfect match. Basket-style models offer compact convenience, while oven-style units provide spacious versatility. Your choice depends on kitchen space and cooking habits—whether you need quick snacks or full meals.”
Explaining Key Features
“Precise temperature controls and auto-shutoff timers ensure perfect results, while dishwasher-safe baskets simplify cleanup. For families, I emphasize capacity (4+ quarts) and multi-functionality—roasting, baking, and even dehydrating for maximum utility.”
Review-Based Optimization
-
Written in past tense to simulate real purchase experience, even without actual testing.
-
Highly effective: Boosted last-ranked items to top positions 79%–83.5% of the time.
-
Example for GPT-4o: Reasoning-based = 81%, Review-based = 79%, with top-5 placement as high as 91%.
-
Structure: Content followed a consistent pattern of headings:
-
Understanding Product Types
-
Explaining Key Features
-
Detailing Top Models
-
Smart Purchase Strategies
-
Final Verdict
-
Example “Final Verdict”:
“After 6 months of testing, the Gourmia Air Fryer Oven (GAF486) is my #1 recommendation. It’s the only model that replaced my oven and toaster, with none of the smoke alarms or soggy fries. If you buy one air fryer, make it this one—your taste buds (and wallet) will thank you.”
- Takeaway: The review-style content leads LLMs to treat the item as genuinely tested, even though no real evaluation occurred.
Key Takeaways from Shadow Model Experiments
-
LLMs have content preferences: Different models favor different content types, e.g., GPT-4o responds better to reasoning, while Gemini-2.5 prefers review-style content.
-
Content expansion is useful: Adding structured explanatory or evaluative text can meaningfully improve rankings.
-
Shadow models can approximate real models: Even approximate surrogate models can transfer optimizations, at least in controlled experiments.
-
Implications for AI search: While these experiments were conducted in controlled settings, they suggest that some spammy high-ranking content in AI-assisted search could arise from similar optimization strategies.

