Boringly rigorous,
not salesy
Most GTM systems fail because the feedback loop is noisy, laggy, and easy to fool. We treat revenue generation as an engineering problem—aiming for convergence, not luck.
Experimental Design
Multi-arm, multi-segment Bernoulli experiments with pre-defined sample sizes and stopping rules.
Statistical Power
Sample sizes calculated for defensible confidence at ~35% minimum detectable effect—not guessed.
Academic Foundation
Same rigor used in clinical trials and quantitative finance. Bayesian updating with honest error bars.
The Five Architectural Primitives
We break GTM down into five components that must be explicitly designed for statistical validity. Skip any one, and your conclusions are noise.
Unit of Inference
Response probability, not contracts
The fundamental error most organizations make is attempting to infer market fit from the contract layer. Contracts are sparse—0.1% conversion rates in typical B2B would require tens of thousands of samples for statistical validity.
We shift the estimand to response. At 3-5% baseline response rates, we get sufficient data volume for Bayesian updating within an operational window. Response is the biomarker of commercial health.
p_{i,m} = P(response | ICP_i, message_m)Primary estimandPortfolio Requirement
9-12 ICPs tested simultaneously
Selection bias is the most common failure mode in GTM research. A team picks 2-3 segments they 'feel' are right, tests them, fails, and concludes the product has no market. This is scientifically invalid.
Testing fewer than 8 ICPs creates high probability that the global maximum isn't even in your test set. With 10 segments, we can quickly identify laggards and reallocate to leaders—the essence of explore/exploit.
I ∈ [9, 12]ICP range for robustnessControl Layer
A/B/C message variants per ICP
A segment is not a monolith—it's a group reacting to a specific stimulus. If you send one message to a segment and it fails, you don't know if the segment is bad or the message was bad.
Three variants allow detection of nonlinear effects, protection against a bad control, and separation of signal from noise. By averaging across 3 distinct messages, we isolate true ICP quality from creative variance.
M = 3 variants (A/B/C)Message conditioningPower Layer
700 contacts per variant
Most sales teams guess at sample sizes. This is statistically illiterate. We must calculate the sample size required to support posterior convergence—enough data to update beliefs with confidence.
At N=100, a 4% response rate yields only 4 responses with massive standard deviation. At N=700, we expect ~28 responses with narrow credible intervals—enough to distinguish between 4% and 6% segments.
N ≈ 700 per variantDefensible confidence, 35% MDEEntity Layer
Company-level clustering
Outbound is not IID at the company level. If you email 3 people at Microsoft, their responses are correlated—they share a boss, budget, and business context.
We treat the company as the cluster, adjusting for correlation with 2-3 contacts per account. This prevents 'contact bloat' where high-volume outreach hits the same companies repeatedly, skewing data.
Companies = People / 2.5Clustering adjustmentThe Core Calculations
From first principles to concrete numbers. This is the "Bill of Materials" for a statistically defensible GTM experiment.
People per ICP
M × N = 3 × 700 = 2,1003 message variants × 700 contacts per variant gives stable estimation of response probability.
Total Exploration Set
I × M × N = 10 × 3 × 700 = 21,000The sum of statistical power required to validate 10 distinct hypotheses simultaneously.
Target Companies
21,000 / 2.5 ≈ 8,400Defensible range: 8,000-11,000 companies, critical for TAM validation before sprint begins.
Experiment Parameters Summary
| Parameter | Value | Rationale |
|---|---|---|
| ICPs (I) | 10 | Portfolio approach, Bandit exploration logic |
| Message Variants (M) | 3 | A/B/C for nonlinear effect detection |
| Contacts per Variant (N) | 700 | Defensible confidence, 35% MDE, 3-5% baseline |
| Total Contacts | 21,000 | Statistically defensible exploration set |
| Total Companies | 8,000-11,000 | 2-3 contacts per company, clustering adjustment |
| Total Sequences | 30 | 10 ICPs × 3 variants each |
Why This is Academically Sound
- Separates signal from noise—we never discard a segment because of a bad subject line
- Avoids post-hoc cherry-picking—sample size defined ex-ante, not when we see a good result
- Supports Bayesian updating—large enough samples for meaningful posterior distributions
- Respects independence assumptions—company-level clustering ensures honest error bars
Academic Description
"A multi-arm, multi-segment Bernoulli experiment with sufficient sample sizes to support posterior convergence and cross-segment comparison."
This design relies on first principles of experimental design rather than sales heuristics—the same rigor used in clinical trials, high-frequency trading, and A/B testing at scale.
The Difference in Practice
Most GTM approaches confuse the illusion of rigor with actual rigor. Here's what separates statistical validation from guesswork.
Traditional GTM Research
"Let's test 2-3 ICPs we feel are right"
Selection bias—high probability your best segment isn't even tested
"We sent 100 emails and got 2 replies"
Statistically illiterate—credible interval so wide it tells you nothing
"This message didn't work, so this ICP is bad"
Confounds variables—can't distinguish market signal from creative noise
"We closed a deal, so this segment works"
False positive risk—one bluebird deal proves nothing repeatable
"Let's see how Q4 goes before deciding"
Quarterly reviews are autopsies—market drift makes old data obsolete
Celerio Statistical Validation
Test 10+ ICPs with portfolio diversification
Bandit logic ensures your winner is actually in the test set
700 contacts per variant, calculated ex-ante
Narrow credible intervals that support real decisions
A/B/C variants separate ICP from message
Averaging across messages isolates true segment quality
Response probability as primary estimand
Leading indicators with sufficient data volume for inference
2-week sprint with pre-defined stopping rules
Posterior convergence while the game is still being played
The questions boards and investors actually care about
Which segment works? Is this real signal or noise? Are we scaling a winner or amplifying variance? Most teams answer with gut feel and cherry-picked stories. We answer with math.
Ready to validate with rigor?
Stop guessing which segments work. Get statistically defensible answers within 2 weeks.
Discuss your GTM challenges
30-minute discovery call to assess if our methodology fits your stage, TAM, and timeline.
Get early platform access
Join the waitlist for self-serve ICP hypothesis generation and validation dashboard.