Socrates

Structured Questioning Unlocks Latent Knowledge in AI Research Agents

A question-only advisor — no tools, no answers, no directives — improves Kaggle test scores by +55.9% on average across five MLE-bench tasks.

Vrabac, Hebbar, Manawat, Palanimalai, Verboomen, Juneja, Bhatia, Baskaran  ·  Hexo.ai

1 / 10

The puzzle

What LLMs know

>88%

on MMLU machine-learning content — cross-validation, leakage, overfitting. They can explain it all.

What LLMs do as agents

16.9%

Kaggle bronze rate on MLE-bench. They tune hyperparameters, introduce target leakage, and fixate on one model family.

Same model. Same parametric knowledge. The gap is between what the model knows and what it applies during autonomous operation.

2 / 10

Our claim

The bottleneck is knowledge activation, not knowledge capacity.

Relevant parametric knowledge exists in the weights but is not surfaced into the working context at decision time.

Why standard fixes don't work

3 / 10

The protocol

Socrates protocol diagram

Pair a tool-using Scientist with a question-only Socrates advisor. Before each experiment, the Scientist writes a plan. Socrates reviews and responds with only questions, until it emits [APPROVED].

4 / 10

The Socratic constraint

Socrates is forbidden from:

No answers

Cannot say "use 5-fold CV".

No directives

Cannot say "try a different architecture".

No tools

No code execution. No file I/O. Only questions.

Because the advisor cannot transfer information, any improvement must originate from the Scientist's own parametric knowledge — merely activated by the question.

5 / 10

Three conditions, same LLM

ConditionDescriptionWhat it isolates
Scientist-onlySingle agent, full tools, no supervision.Standard single-agent baseline.
Baseline PISecond agent in the same scaffold — but only generic encouragement ("please keep iterating").Matched in tokens and turns. Controls for having any second agent.
SocratesFull protocol: question-only advisor, [APPROVED] gate.The intervention.

A Socrates-over-Baseline-PI win therefore measures the nature of the questioning, not the presence of supervision.

6 / 10

Results — Kaggle test scores

TaskScientistBaseline PISocratesΔ vs Sci
Statoil0.2550.2510.229+10.5%
COVID0.3890.3080.294+24.4%
Ventilator1.5340.8150.853+44.4%
NFL0.1980.5370.584+195.4%
Smartphone6.2855.9935.984+4.8%

Best score on 4 of 5 tasks · mean improvement +55.9% · beats Baseline PI on 4 of 5.

7 / 10

Why it works — four mechanisms

1. Catches methodological errors

COVID: train/val gap 80% → 27%. 512/963 features had zero importance. NFL: caught val-set threshold tuning. Ventilator: caught target leakage.

2. Forces diversification

Statoil unprompted: 12/16 experiments are GBM tweaks. Socrates: 9 experiments across 9 model families.

3. Empirical investigation during review

Scientist keeps tool access — "how many features have zero importance?" triggers analysis right then. Discoveries happen during review, not before.

4. Approach evolution

NFL: features 10→18 (expand). Ventilator: 29→15 (contract). COVID: 963→250 (regularize). Same advisor, opposite directions — the Scientist retrieves its own domain knowledge.

8 / 10

When it fails & variance check

The one loss: Ventilator

Baseline PI 0.815 MAE beats Socrates 0.853.

The search space rewards volume of feature-interaction sweeps. Socrates' structured-review overhead costs experiments that Baseline PI spent on more model variants.

Take-away: structured questioning helps where methodology — not iteration count — is the bottleneck.

Could this be seed noise?

10-seed Scientist-only run on Smartphone:

  • mean 4.07, SD 0.63
  • SD is 15.5% of the mean (LLM agents are high-variance)

Socrates achieved 3.47 — below mean − 1 SD (3.44). Not seed noise alone.

9 / 10

Take-away

The bottleneck for autonomous LLM research agents is knowledge activation, not knowledge availability.

A question-only advisor — deliberately limited, no tools, no answers — is enough to bridge that gap on a meaningful fraction of tasks. It's cheap, it's simple, it drops into existing agent scaffolds.

Resources

10 / 10

navigate  ·  F fullscreen  ·  ? help