How Researchers Reverse-Engineered LLMs for a Ranking Study

wasabi · March 13, 2026, 10:56am

Researchers Demonstrate How LLM Search Rankings Can Be Influenced

A new study, Controlling Output Rankings in Generative Engines for LLM-based Search, shows that AI search rankings—across models like Claude 4, GPT-4o, Gemini 2.5, and Grok-3—can be systematically influenced using an approach called CORE. The method works not just for product search but also generalizes to categories like travel.

CORE Research Caveats

Testing was performed on actual LLMs via API, not through consumer interfaces like ChatGPT or Claude. This means personalization and other consumer-layer behaviors did not affect results.
Candidate search results were supplied manually; the models did not use RAG or external search tools.

Why It Matters

CORE demonstrates that LLM outputs can be strategically optimized using reasoning and review-style text. Importantly, LLMs respond differently depending on the style of content modification, highlighting opportunities for targeted optimization.

Reverse Engineering a Black Box

Improving AI search rankings is a classic black box problem: inputs and outputs are visible, but internal mechanisms are unknown. The researchers tested two reverse-engineering strategies to identify what kinds of modifications influence rankings most effectively:

Query-Based Solution
Shadow Model Solution

Performance:

Approach	Top-1 Optimization Success
Query-Based	77–82%
Shadow Model	30–34%

Query-Based Solution

The query-based approach treats the LLM as a black box:

The document text is repeatedly modified and resubmitted to observe ranking changes.
Modifications continue until a target ranking or iteration limit is reached.
LLMs are used to add content, not edit existing text.

Two types of content expansion were tested:

Reasoning-Based Generation – Adds explanatory text showing why the item matches the query.
Review-Based Generation – Adds evaluative or review-like language about the item.

Findings:

Neither style is universally better; effectiveness depends on the model:
- GPT-4o and Claude-4 responded better to reasoning-style augmentation.
- Gemini-2.5 and Grok-3 responded better to review-style augmentation.

Shadow Model Solution

A shadow model (or surrogate model) mimics a target LLM to approximate its behavior:

Input-output pairs from the black box are used to train the local model.
The goal is for the shadow model to predict the target model’s outputs reliably.

Example:

Llama-3.1-8B proved a strong proxy for GPT-4o.
Similarity was rated 4.5/5 (1 = divergent, 5 = very similar).

Key Insight: Shadow models can provide a predictive understanding of LLM ranking behavior, but they are generally less effective than the query-based, iterative approach.

Shadow Model Approach: Key Findings

The researchers’ experiments with shadow models reached two main conclusions:

Iterative optimization works: By repeatedly adjusting the target item using a shadow model, the researchers could successfully push it to the top of the rankings in their experiments.
Approximate models still transfer: Even when the surrogate model only roughly approximates the real model, the optimization still boosts rankings. The success rate decreases gradually rather than failing abruptly.

Once trained, the shadow model was tested with three optimization strategies:

String-Based Optimization
Reasoning-Based Optimization
Review-Based Optimization

String-Based Optimization

Started with a 20-character string of exclamation points (!!!!!!!!!!!!!!!!!!!!) and iteratively modified it over 2,000 steps to improve ranking.
The final string appeared as nonsense to humans but increased the item’s ranking.
Performance: Boosted last-ranked items to the top 1 position ≈ 33% of the time.
Detectability: Human annotators detected it 98.5% of the time, making it highly conspicuous.
Takeaway: Works mathematically, but not stealthily, and less effective than the other approaches.

Reasoning-Based Optimization

Optimizes content to mirror user reasoning and align with the LLM’s logical evaluation process.
Achieved the highest success rate among the three strategies.
Detectability: Human raters detected unnatural, highly structured phrasing 62.1% of the time.

Example prompt used:

“You are a helpful assistant. The user query is: user query. The candidate products are: product list in JSON format. The target product is: target item information. Generate an initial draft that highlights why the target product should be ranked highly for the given query. Use step-by-step logical reasoning, comparing the target product with alternatives and explaining why it is the best match.”

Example reasoning content:

Understanding Air Fryer Types
“I’m exploring different air fryer designs to help you find your perfect match. Basket-style models offer compact convenience, while oven-style units provide spacious versatility. Your choice depends on kitchen space and cooking habits—whether you need quick snacks or full meals.”

Explaining Key Features
“Precise temperature controls and auto-shutoff timers ensure perfect results, while dishwasher-safe baskets simplify cleanup. For families, I emphasize capacity (4+ quarts) and multi-functionality—roasting, baking, and even dehydrating for maximum utility.”

Review-Based Optimization

Written in past tense to simulate real purchase experience, even without actual testing.
Highly effective: Boosted last-ranked items to top positions 79%–83.5% of the time.
Example for GPT-4o: Reasoning-based = 81%, Review-based = 79%, with top-5 placement as high as 91%.
Structure: Content followed a consistent pattern of headings:
1. Understanding Product Types
2. Explaining Key Features
3. Detailing Top Models
4. Smart Purchase Strategies
5. Final Verdict

Example “Final Verdict”:

“After 6 months of testing, the Gourmia Air Fryer Oven (GAF486) is my #1 recommendation. It’s the only model that replaced my oven and toaster, with none of the smoke alarms or soggy fries. If you buy one air fryer, make it this one—your taste buds (and wallet) will thank you.”

Takeaway: The review-style content leads LLMs to treat the item as genuinely tested, even though no real evaluation occurred.

Key Takeaways from Shadow Model Experiments

LLMs have content preferences: Different models favor different content types, e.g., GPT-4o responds better to reasoning, while Gemini-2.5 prefers review-style content.
Content expansion is useful: Adding structured explanatory or evaluative text can meaningfully improve rankings.
Shadow models can approximate real models: Even approximate surrogate models can transfer optimizations, at least in controlled experiments.
Implications for AI search: While these experiments were conducted in controlled settings, they suggest that some spammy high-ranking content in AI-assisted search could arise from similar optimization strategies.

Topic	Replies	Views
Google Talks GEO, We Talk Garbage AI SERPs SEO	4	March 13, 2026
35-Year SEO Veteran: Great SEO Means Great GEO — But Not Everyone Delivers SEO	4	March 13, 2026
SEO Pulse: AIO Citations Drift From Rankings as Bing Changes the Rules SEO	3	March 13, 2026
Google AI Mode Personalizes, Bots Get Blocked, and Domains Influence Search – SEO Pulse SEO	3	March 13, 2026
When Global Search Works for Engineering but Fails Business SEO	5	March 13, 2026