Run controlled experiments with variants, traffic allocation, statistical analysis, and guardrail metrics

Experiments & A/B Testing

Flaggr's experiment system lets you run controlled A/B tests with configurable variants, traffic allocation, metric tracking, and statistical analysis. Experiments are built on top of feature flags — each experiment is backed by a flag that controls variant assignment.

Warning

The experiments API is under active development. This guide documents the data model and planned API surface. The core evaluation and variant assignment features are available today through feature flags with variants and rollout percentages.

How Experiments Work

An experiment extends a feature flag with:

Variants with explicit weights (traffic allocation)
A control group for comparison
Metrics to measure success (conversion, revenue, engagement)
Statistical analysis to determine significance
Guardrail metrics that auto-pause if key indicators degrade

┌─────────────────────────────────────────────────┐
│  Experiment: "Shorter Signup Flow"               │
│                                                  │
│  Hypothesis: Reducing signup steps from 4 to 2   │
│  will increase conversion by 15%                 │
│                                                  │
│  ┌──────────────┐  ┌──────────────┐              │
│  │ Control (50%)│  │Treatment(50%)│              │
│  │ 4-step flow  │  │ 2-step flow  │              │
│  └──────────────┘  └──────────────┘              │
│                                                  │
│  Primary metric: signup_completed (increase)     │
│  Guardrail: error_rate < 5% (pause if exceeded) │
└─────────────────────────────────────────────────┘

Experiment Data Model

Experiment

Field	Type	Description
`id`	string	Unique experiment identifier
`projectId`	string	Project this experiment belongs to
`serviceId`	string	Service that evaluates the flag
`name`	string	Human-readable experiment name
`description`	string	What you're testing and why
`hypothesis`	string	Your predicted outcome
`status`	string	`draft`, `running`, `paused`, `completed`, `archived`
`flagKey`	string	The feature flag that controls variant assignment
`variants`	array	Variant definitions with weights
`trafficAllocation`	number	Percentage of total traffic in the experiment (0-100)
`primaryMetric`	object	The main metric you're trying to improve
`secondaryMetrics`	array	Additional metrics to track
`guardrailMetrics`	array	Metrics that trigger auto-pause if they degrade
`statisticalConfig`	object	Statistical analysis configuration
`minimumSampleSize`	number	Minimum users before results are significant
`maximumDurationDays`	number	Auto-stop after this many days
`autoStopOnSignificance`	boolean	Stop early when results reach significance

Experiment Lifecycle

  ┌───────┐   start   ┌─────────┐   significance   ┌───────────┐
  │ Draft │ ────────> │ Running │ ───────────────> │ Completed │
  └───────┘           └─────────┘                  └───────────┘
                         │  ▲                           │
                   pause │  │ resume              archive │
                         ▼  │                           ▼
                      ┌────────┐                  ┌──────────┐
                      │ Paused │                  │ Archived │
                      └────────┘                  └──────────┘

Status	Description
`draft`	Experiment designed but not started. Variants and metrics can be edited.
`running`	Actively assigning users to variants and collecting metrics.
`paused`	Temporarily halted. Existing assignments are preserved.
`completed`	Results are in. Winner determined or no significant difference.
`archived`	Historical record. Results preserved, experiment inactive.

Variants

Each experiment has two or more variants. One must be marked as the control.

{
  "variants": [
    {
      "name": "control",
      "value": false,
      "weight": 50,
      "isControl": true
    },
    {
      "name": "treatment-a",
      "value": true,
      "weight": 25,
      "isControl": false
    },
    {
      "name": "treatment-b",
      "value": "v3",
      "weight": 25,
      "isControl": false
    }
  ]
}

Field	Type	Description
`name`	string	Variant identifier (e.g., `control`, `treatment-a`)
`value`	any	The flag value served to users in this variant
`weight`	number	Traffic percentage (all weights must sum to 100)
`isControl`	boolean	Whether this is the control/baseline variant

Traffic Allocation

The trafficAllocation field controls what percentage of total traffic enters the experiment. The remaining traffic gets the flag's default value.

Total traffic: 100%
├── In experiment (trafficAllocation: 80%)
│   ├── Control: 50% of 80% = 40% of total
│   └── Treatment: 50% of 80% = 40% of total
└── Not in experiment: 20% of total (gets default value)

Metrics

Metric Types

Type	Description	Example
`conversion`	Binary outcome (did/didn't)	Signup completed, purchase made
`revenue`	Monetary value	Order total, subscription value
`count`	Frequency count	Page views, API calls, clicks
`duration`	Time measurement	Session length, time to checkout
`custom`	Custom numeric metric	NPS score, engagement index

Defining Metrics

{
  "primaryMetric": {
    "id": "signup-conversion",
    "name": "Signup Conversion Rate",
    "type": "conversion",
    "eventName": "signup_completed",
    "direction": "increase",
    "filters": [
      { "property": "source", "operator": "equals", "value": "organic" }
    ]
  },
  "secondaryMetrics": [
    {
      "id": "time-to-signup",
      "name": "Time to Signup",
      "type": "duration",
      "eventName": "signup_completed",
      "direction": "decrease"
    },
    {
      "id": "page-views",
      "name": "Pages Viewed During Signup",
      "type": "count",
      "eventName": "page_view",
      "direction": "decrease"
    }
  ]
}

Field	Type	Description
`id`	string	Unique metric identifier
`name`	string	Human-readable metric name
`type`	string	`conversion`, `revenue`, `count`, `duration`, `custom`
`eventName`	string	Event name to track (matches your analytics events)
`direction`	string	`increase` (higher is better) or `decrease` (lower is better)
`filters`	array	Optional event filters to narrow the metric scope

Guardrail Metrics

Guardrail metrics protect against unexpected negative effects. If a guardrail threshold is breached, the experiment is automatically paused or stopped:

{
  "guardrailMetrics": [
    {
      "metric": {
        "id": "error-rate",
        "name": "Error Rate",
        "type": "conversion",
        "eventName": "error_occurred",
        "direction": "decrease"
      },
      "threshold": 0.05,
      "action": "pause"
    },
    {
      "metric": {
        "id": "page-load",
        "name": "Page Load Time",
        "type": "duration",
        "eventName": "page_loaded",
        "direction": "decrease"
      },
      "threshold": 3000,
      "action": "alert"
    }
  ]
}

Action	Behavior
`alert`	Send notification, experiment continues
`pause`	Pause the experiment for manual review
`stop`	Stop the experiment immediately

Statistical Configuration

Configure how results are analyzed:

{
  "statisticalConfig": {
    "confidenceLevel": 0.95,
    "minimumDetectableEffect": 0.05,
    "method": "frequentist",
    "correctionMethod": "bonferroni"
  }
}

Field	Type	Description
`confidenceLevel`	number	Required confidence level (0.0–1.0, typically 0.95)
`minimumDetectableEffect`	number	Smallest effect size worth detecting (e.g., 0.05 = 5%)
`method`	string	`frequentist` (p-values) or `bayesian` (posterior probabilities)
`correctionMethod`	string	Multiple comparison correction: `bonferroni` or `benjamini-hochberg`

Frequentist vs Bayesian

Method	Best For	Output
Frequentist	Classic hypothesis testing, clear yes/no decisions	p-value, confidence interval
Bayesian	Continuous monitoring, "probability of being best"	Posterior probability

Multiple Comparison Correction

When testing multiple variants or metrics, correction methods prevent false positives:

Method	Approach
Bonferroni	Conservative — divides significance level by number of comparisons
Benjamini-Hochberg	Less conservative — controls false discovery rate

Results

When an experiment completes, results include per-variant analysis:

{
  "results": {
    "totalSampleSize": 10000,
    "variantResults": [
      {
        "variantName": "control",
        "sampleSize": 5000,
        "metrics": [
          {
            "metricId": "signup-conversion",
            "value": 0.12,
            "confidenceInterval": [0.11, 0.13],
            "pValue": null,
            "relativeEffect": 0,
            "isSignificant": false
          }
        ]
      },
      {
        "variantName": "treatment-a",
        "sampleSize": 5000,
        "metrics": [
          {
            "metricId": "signup-conversion",
            "value": 0.15,
            "confidenceInterval": [0.14, 0.16],
            "pValue": 0.001,
            "relativeEffect": 0.25,
            "isSignificant": true
          }
        ]
      }
    ],
    "winner": "treatment-a",
    "recommendation": "ship_winner",
    "lastUpdated": "2026-02-21T12:00:00.000Z"
  }
}

Recommendations

Recommendation	Meaning
`ship_winner`	Clear winner found — ship the winning variant
`extend_experiment`	Not enough data yet — continue collecting
`no_significant_difference`	No meaningful difference between variants

Running Experiments with Flags Today

While the full experiments API is in development, you can run A/B tests using feature flags with variants:

# Create a flag with variants for an A/B test
curl -X POST /api/flags \
  -H "Authorization: Bearer flg_your_token" \
  -H "Content-Type: application/json" \
  -d '{
    "key": "exp-shorter-signup",
    "name": "Shorter Signup Experiment",
    "type": "string",
    "enabled": true,
    "defaultValue": "control",
    "serviceId": "web-app",
    "environment": "production",
    "variants": [
      { "name": "control", "value": "control", "weight": 50 },
      { "name": "treatment", "value": "treatment", "weight": 50 }
    ],
    "tags": ["experiment"]
  }'

// In your application
const variant = await client.getStringValue("exp-shorter-signup", "control", {
  targetingKey: userId,
});
 
if (variant === "treatment") {
  renderShortSignup();
} else {
  renderFullSignup();
}
 
// Track metrics in your analytics system
analytics.track("signup_completed", { variant, userId });