Skill metrics, not vibes

Methodology

Qelly's public proof page reports four metrics so a reviewer can separate skill from variance. ROI alone is too noisy to be diagnostic at small sample sizes. The four metrics below cover the spectrum from "is the probability model well-calibrated" to "is the LLM filter actually earning its keep."

Closing-line value (CLV)

The percent beat of our entry price versus the market's closing price on the same outcome.

CLVi = (closei − entryi) / entryi × 100
CLV = mean(CLVi) across executed decisions where both prices exist

Positive CLV is predictive of long-run profitability even when realized P&L is too small to be diagnostic. If we consistently enter before the price moves toward our estimate, we have edge. CLV survives the variance that ROI cannot.

Brier score

Mean squared error of our predicted probability against the realized binary outcome. Lower is better. Range [0, 1]; coin-flip ≈ 0.25; perfect oracle = 0.

Brier = (1 / N) × Σ (pi − yi
where pi is Qelly's predicted probability and yi ∈ {0, 1} is the realized outcome.

Includes both executed and LLM-rejected decisions — the forecast was made regardless of whether we acted on it. A low Brier score with a small sample is informative but not conclusive.

Calibration buckets

A reliability diagram: 10 bins by predicted probability, observed positive rate per bin.

A perfectly calibrated model has each bucket's midpoint ≈ its observed rate. A model that says "70% confident" should resolve "yes" about 70% of the time. Calibration is independent of edge — you can be calibrated and unprofitable, or miscalibrated and lucky.

Shadow strategy: implied-probability bettor

A hypothetical equity curve that replays the same market opportunities, sized with the same fractional Kelly multiplier, but using the market-implied probability instead of Qelly's.

In an efficient market this strategy returns ≈ 0% (after fees). The gap between Qelly's equity curve and this baseline is Qelly's alpha — the portion of P&L attributable to having a better probability estimate than the market itself.

Shadow strategy: Kelly-only (no LLM)

A hypothetical equity curve that executes every candidate that survived Kelly sizing + risk policy, ignoring the LLM's approve/reject vote.

This isolates the LLM's marginal contribution. If LLM-filtered equity ≈ Kelly-only equity, the LLM isn't earning its keep and we should drop it. If the gap is meaningful and positive, the LLM is doing useful work.

Optimizer backends

Every decision is tagged with the optimizer that produced its sizing recommendation.

  • Classical — convex constrained optimizer running on CPU. The production default for stable, well-conditioned problems.
  • Quantum (CPU) — the same QUBO formulation we route to quantum hardware, solved on CPU using quantum-annealer-derived heuristics. Used when problem structure (sparse, block-diagonal, cardinality-constrained) makes the QUBO encoding the better path. The CPU path acts as a shadow / pre-QPU staging environment for the next item.
  • D-Wave QPU — direct calls to D-Wave Advantage / Advantage2 quantum annealers. Currently calibrating in transition to production routing as part of the World Cup pilot.
  • Manual — an operator override (rare, audited).

Sample size caveats

  • CLV becomes meaningful at roughly n ≥ 30 settled decisions. Under that, treat it as suggestive, not diagnostic.
  • Brier and calibration similarly need n ≥ 30 for narrow confidence intervals.
  • Shadow strategies inherit the variance of the underlying outcomes. Treat the equity curves as illustrative until the sample is large enough.
  • Reported numbers are not a performance claim. They are skill diagnostics on a small, company-funded research portfolio.

Open caveats

The things a serious reviewer will note by their absence. Stated here on purpose.

  • Slippage and market impact. Reported skill metrics assume best-bid / best-ask fills. On thin Polymarket order books, realized vs. paper edge can differ by 20–100%. We will publish realized-fill slippage once we have enough data.
  • Capacity. Position-size caps are calibrated for the current research portfolio. Scaling beyond the current bankroll would require revisiting venue depth and market-impact assumptions.
  • Regime sensitivity. Sports markets, political markets, crypto-milestone markets, and macro markets have different liquidity, sharpness, and predictability. Per-category performance will be reported once samples permit.
  • Selection bias. The engine only acts on events that pass minimum-edge and risk-policy filters. Reported accuracy is conditional on this selection.
  • Settlement risk. Polymarket resolutions are occasionally contested (notably during the 2024 election cycle). The risk policy does not yet model resolution-dispute risk explicitly.
  • Multiple testing. If we iterate model variants or risk policies during the pilot, displayed skill numbers are conditioned on the selected configuration. Deflated Sharpe / PBO checks will apply before any public benchmark publication.