Run controlled experiments with variants, traffic allocation, statistical analysis, and guardrail metrics
Experiments & A/B Testing
Flaggr's experiment system lets you run controlled A/B tests with configurable variants, traffic allocation, metric tracking, and statistical analysis. Experiments are built on top of feature flags — each experiment is backed by a flag that controls variant assignment.
The experiments API is under active development. This guide documents the data model and planned API surface. The core evaluation and variant assignment features are available today through feature flags with variants and rollout percentages.
How Experiments Work
An experiment extends a feature flag with:
- Variants with explicit weights (traffic allocation)
- A control group for comparison
- Metrics to measure success (conversion, revenue, engagement)
- Statistical analysis to determine significance
- Guardrail metrics that auto-pause if key indicators degrade
┌─────────────────────────────────────────────────┐
│ Experiment: "Shorter Signup Flow" │
│ │
│ Hypothesis: Reducing signup steps from 4 to 2 │
│ will increase conversion by 15% │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Control (50%)│ │Treatment(50%)│ │
│ │ 4-step flow │ │ 2-step flow │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ Primary metric: signup_completed (increase) │
│ Guardrail: error_rate < 5% (pause if exceeded) │
└─────────────────────────────────────────────────┘
Experiment Data Model
Experiment
| Field | Type | Description |
|---|---|---|
id | string | Unique experiment identifier |
projectId | string | Project this experiment belongs to |
serviceId | string | Service that evaluates the flag |
name | string | Human-readable experiment name |
description | string | What you're testing and why |
hypothesis | string | Your predicted outcome |
status | string | draft, running, paused, completed, archived |
flagKey | string | The feature flag that controls variant assignment |
variants | array | Variant definitions with weights |
trafficAllocation | number | Percentage of total traffic in the experiment (0-100) |
primaryMetric | object | The main metric you're trying to improve |
secondaryMetrics | array | Additional metrics to track |
guardrailMetrics | array | Metrics that trigger auto-pause if they degrade |
statisticalConfig | object | Statistical analysis configuration |
minimumSampleSize | number | Minimum users before results are significant |
maximumDurationDays | number | Auto-stop after this many days |
autoStopOnSignificance | boolean | Stop early when results reach significance |
Experiment Lifecycle
┌───────┐ start ┌─────────┐ significance ┌───────────┐
│ Draft │ ────────> │ Running │ ───────────────> │ Completed │
└───────┘ └─────────┘ └───────────┘
│ ▲ │
pause │ │ resume archive │
▼ │ ▼
┌────────┐ ┌──────────┐
│ Paused │ │ Archived │
└────────┘ └──────────┘
| Status | Description |
|---|---|
draft | Experiment designed but not started. Variants and metrics can be edited. |
running | Actively assigning users to variants and collecting metrics. |
paused | Temporarily halted. Existing assignments are preserved. |
completed | Results are in. Winner determined or no significant difference. |
archived | Historical record. Results preserved, experiment inactive. |
Variants
Each experiment has two or more variants. One must be marked as the control.
{
"variants": [
{
"name": "control",
"value": false,
"weight": 50,
"isControl": true
},
{
"name": "treatment-a",
"value": true,
"weight": 25,
"isControl": false
},
{
"name": "treatment-b",
"value": "v3",
"weight": 25,
"isControl": false
}
]
}| Field | Type | Description |
|---|---|---|
name | string | Variant identifier (e.g., control, treatment-a) |
value | any | The flag value served to users in this variant |
weight | number | Traffic percentage (all weights must sum to 100) |
isControl | boolean | Whether this is the control/baseline variant |
Traffic Allocation
The trafficAllocation field controls what percentage of total traffic enters the experiment. The remaining traffic gets the flag's default value.
Total traffic: 100%
├── In experiment (trafficAllocation: 80%)
│ ├── Control: 50% of 80% = 40% of total
│ └── Treatment: 50% of 80% = 40% of total
└── Not in experiment: 20% of total (gets default value)
Metrics
Metric Types
| Type | Description | Example |
|---|---|---|
conversion | Binary outcome (did/didn't) | Signup completed, purchase made |
revenue | Monetary value | Order total, subscription value |
count | Frequency count | Page views, API calls, clicks |
duration | Time measurement | Session length, time to checkout |
custom | Custom numeric metric | NPS score, engagement index |
Defining Metrics
{
"primaryMetric": {
"id": "signup-conversion",
"name": "Signup Conversion Rate",
"type": "conversion",
"eventName": "signup_completed",
"direction": "increase",
"filters": [
{ "property": "source", "operator": "equals", "value": "organic" }
]
},
"secondaryMetrics": [
{
"id": "time-to-signup",
"name": "Time to Signup",
"type": "duration",
"eventName": "signup_completed",
"direction": "decrease"
},
{
"id": "page-views",
"name": "Pages Viewed During Signup",
"type": "count",
"eventName": "page_view",
"direction": "decrease"
}
]
}| Field | Type | Description |
|---|---|---|
id | string | Unique metric identifier |
name | string | Human-readable metric name |
type | string | conversion, revenue, count, duration, custom |
eventName | string | Event name to track (matches your analytics events) |
direction | string | increase (higher is better) or decrease (lower is better) |
filters | array | Optional event filters to narrow the metric scope |
Guardrail Metrics
Guardrail metrics protect against unexpected negative effects. If a guardrail threshold is breached, the experiment is automatically paused or stopped:
{
"guardrailMetrics": [
{
"metric": {
"id": "error-rate",
"name": "Error Rate",
"type": "conversion",
"eventName": "error_occurred",
"direction": "decrease"
},
"threshold": 0.05,
"action": "pause"
},
{
"metric": {
"id": "page-load",
"name": "Page Load Time",
"type": "duration",
"eventName": "page_loaded",
"direction": "decrease"
},
"threshold": 3000,
"action": "alert"
}
]
}| Action | Behavior |
|---|---|
alert | Send notification, experiment continues |
pause | Pause the experiment for manual review |
stop | Stop the experiment immediately |
Statistical Configuration
Configure how results are analyzed:
{
"statisticalConfig": {
"confidenceLevel": 0.95,
"minimumDetectableEffect": 0.05,
"method": "frequentist",
"correctionMethod": "bonferroni"
}
}| Field | Type | Description |
|---|---|---|
confidenceLevel | number | Required confidence level (0.0–1.0, typically 0.95) |
minimumDetectableEffect | number | Smallest effect size worth detecting (e.g., 0.05 = 5%) |
method | string | frequentist (p-values) or bayesian (posterior probabilities) |
correctionMethod | string | Multiple comparison correction: bonferroni or benjamini-hochberg |
Frequentist vs Bayesian
| Method | Best For | Output |
|---|---|---|
| Frequentist | Classic hypothesis testing, clear yes/no decisions | p-value, confidence interval |
| Bayesian | Continuous monitoring, "probability of being best" | Posterior probability |
Multiple Comparison Correction
When testing multiple variants or metrics, correction methods prevent false positives:
| Method | Approach |
|---|---|
| Bonferroni | Conservative — divides significance level by number of comparisons |
| Benjamini-Hochberg | Less conservative — controls false discovery rate |
Results
When an experiment completes, results include per-variant analysis:
{
"results": {
"totalSampleSize": 10000,
"variantResults": [
{
"variantName": "control",
"sampleSize": 5000,
"metrics": [
{
"metricId": "signup-conversion",
"value": 0.12,
"confidenceInterval": [0.11, 0.13],
"pValue": null,
"relativeEffect": 0,
"isSignificant": false
}
]
},
{
"variantName": "treatment-a",
"sampleSize": 5000,
"metrics": [
{
"metricId": "signup-conversion",
"value": 0.15,
"confidenceInterval": [0.14, 0.16],
"pValue": 0.001,
"relativeEffect": 0.25,
"isSignificant": true
}
]
}
],
"winner": "treatment-a",
"recommendation": "ship_winner",
"lastUpdated": "2026-02-21T12:00:00.000Z"
}
}Recommendations
| Recommendation | Meaning |
|---|---|
ship_winner | Clear winner found — ship the winning variant |
extend_experiment | Not enough data yet — continue collecting |
no_significant_difference | No meaningful difference between variants |
Running Experiments with Flags Today
While the full experiments API is in development, you can run A/B tests using feature flags with variants:
# Create a flag with variants for an A/B test
curl -X POST /api/flags \
-H "Authorization: Bearer flg_your_token" \
-H "Content-Type: application/json" \
-d '{
"key": "exp-shorter-signup",
"name": "Shorter Signup Experiment",
"type": "string",
"enabled": true,
"defaultValue": "control",
"serviceId": "web-app",
"environment": "production",
"variants": [
{ "name": "control", "value": "control", "weight": 50 },
{ "name": "treatment", "value": "treatment", "weight": 50 }
],
"tags": ["experiment"]
}'// In your application
const variant = await client.getStringValue("exp-shorter-signup", "control", {
targetingKey: userId,
});
if (variant === "treatment") {
renderShortSignup();
} else {
renderFullSignup();
}
// Track metrics in your analytics system
analytics.track("signup_completed", { variant, userId });Variant assignment is deterministic — the same targetingKey always gets the same variant via consistent hashing.
Related
- Concepts — Core data model and evaluation flow
- Targeting Rules — Conditions and operators for targeting
- Progressive Rollouts — Staged deployment for non-experiment rollouts
- Mutual Exclusion — Prevent experiment interference
- Advanced Evaluation — Consistent hashing for deterministic assignment