Methodology

Boringly rigorous,
not salesy

Most GTM systems fail because the feedback loop is noisy, laggy, and easy to fool. We treat revenue generation as an engineering problem—aiming for convergence, not luck.

Experimental Design

Multi-arm, multi-segment Bernoulli experiments with pre-defined sample sizes and stopping rules.

Statistical Power

Sample sizes calculated for defensible confidence at ~35% minimum detectable effect—not guessed.

Academic Foundation

Same rigor used in clinical trials and quantitative finance. Bayesian updating with honest error bars.

The Five Architectural Primitives

We break GTM down into five components that must be explicitly designed for statistical validity. Skip any one, and your conclusions are noise.

01

Unit of Inference

Response probability, not contracts

The fundamental error most organizations make is attempting to infer market fit from the contract layer. Contracts are sparse—0.1% conversion rates in typical B2B would require tens of thousands of samples for statistical validity.

We shift the estimand to response. At 3-5% baseline response rates, we get sufficient data volume for Bayesian updating within an operational window. Response is the biomarker of commercial health.

p_{i,m} = P(response | ICP_i, message_m)Primary estimand
02

Portfolio Requirement

9-12 ICPs tested simultaneously

Selection bias is the most common failure mode in GTM research. A team picks 2-3 segments they 'feel' are right, tests them, fails, and concludes the product has no market. This is scientifically invalid.

Testing fewer than 8 ICPs creates high probability that the global maximum isn't even in your test set. With 10 segments, we can quickly identify laggards and reallocate to leaders—the essence of explore/exploit.

I ∈ [9, 12]ICP range for robustness
03

Control Layer

A/B/C message variants per ICP

A segment is not a monolith—it's a group reacting to a specific stimulus. If you send one message to a segment and it fails, you don't know if the segment is bad or the message was bad.

Three variants allow detection of nonlinear effects, protection against a bad control, and separation of signal from noise. By averaging across 3 distinct messages, we isolate true ICP quality from creative variance.

M = 3 variants (A/B/C)Message conditioning
04

Power Layer

700 contacts per variant

Most sales teams guess at sample sizes. This is statistically illiterate. We must calculate the sample size required to support posterior convergence—enough data to update beliefs with confidence.

At N=100, a 4% response rate yields only 4 responses with massive standard deviation. At N=700, we expect ~28 responses with narrow credible intervals—enough to distinguish between 4% and 6% segments.

N ≈ 700 per variantDefensible confidence, 35% MDE
05

Entity Layer

Company-level clustering

Outbound is not IID at the company level. If you email 3 people at Microsoft, their responses are correlated—they share a boss, budget, and business context.

We treat the company as the cluster, adjusting for correlation with 2-3 contacts per account. This prevents 'contact bloat' where high-volume outreach hits the same companies repeatedly, skewing data.

Companies = People / 2.5Clustering adjustment

The Core Calculations

From first principles to concrete numbers. This is the "Bill of Materials" for a statistically defensible GTM experiment.

Step 1

People per ICP

M × N = 3 × 700 = 2,100

3 message variants × 700 contacts per variant gives stable estimation of response probability.

Step 2

Total Exploration Set

I × M × N = 10 × 3 × 700 = 21,000

The sum of statistical power required to validate 10 distinct hypotheses simultaneously.

Step 3

Target Companies

21,000 / 2.5 ≈ 8,400

Defensible range: 8,000-11,000 companies, critical for TAM validation before sprint begins.

Experiment Parameters Summary

ParameterValueRationale
ICPs (I)10Portfolio approach, Bandit exploration logic
Message Variants (M)3A/B/C for nonlinear effect detection
Contacts per Variant (N)700Defensible confidence, 35% MDE, 3-5% baseline
Total Contacts21,000Statistically defensible exploration set
Total Companies8,000-11,0002-3 contacts per company, clustering adjustment
Total Sequences3010 ICPs × 3 variants each

Why This is Academically Sound

  • Separates signal from noise—we never discard a segment because of a bad subject line
  • Avoids post-hoc cherry-picking—sample size defined ex-ante, not when we see a good result
  • Supports Bayesian updating—large enough samples for meaningful posterior distributions
  • Respects independence assumptions—company-level clustering ensures honest error bars

Academic Description

"A multi-arm, multi-segment Bernoulli experiment with sufficient sample sizes to support posterior convergence and cross-segment comparison."

This design relies on first principles of experimental design rather than sales heuristics—the same rigor used in clinical trials, high-frequency trading, and A/B testing at scale.

The Difference in Practice

Most GTM approaches confuse the illusion of rigor with actual rigor. Here's what separates statistical validation from guesswork.

Traditional GTM Research

"Let's test 2-3 ICPs we feel are right"

Selection bias—high probability your best segment isn't even tested

"We sent 100 emails and got 2 replies"

Statistically illiterate—credible interval so wide it tells you nothing

"This message didn't work, so this ICP is bad"

Confounds variables—can't distinguish market signal from creative noise

"We closed a deal, so this segment works"

False positive risk—one bluebird deal proves nothing repeatable

"Let's see how Q4 goes before deciding"

Quarterly reviews are autopsies—market drift makes old data obsolete

Celerio Statistical Validation

Test 10+ ICPs with portfolio diversification

Bandit logic ensures your winner is actually in the test set

700 contacts per variant, calculated ex-ante

Narrow credible intervals that support real decisions

A/B/C variants separate ICP from message

Averaging across messages isolates true segment quality

Response probability as primary estimand

Leading indicators with sufficient data volume for inference

2-week sprint with pre-defined stopping rules

Posterior convergence while the game is still being played

The questions boards and investors actually care about

Which segment works? Is this real signal or noise? Are we scaling a winner or amplifying variance? Most teams answer with gut feel and cherry-picked stories. We answer with math.

80%
Power
35%
MDE
2wk
Sprint

Ready to validate with rigor?

Stop guessing which segments work. Get statistically defensible answers within 2 weeks.

Discuss your GTM challenges

30-minute discovery call to assess if our methodology fits your stage, TAM, and timeline.

Get early platform access

Join the waitlist for self-serve ICP hypothesis generation and validation dashboard.

Built with v0