Skip to main content

Run controlled experiments with variants, traffic allocation, statistical analysis, and guardrail metrics

Experiments & A/B Testing

Flaggr's experiment system lets you run controlled A/B tests with configurable variants, traffic allocation, metric tracking, and statistical analysis. Experiments are built on top of feature flags — each experiment is backed by a flag that controls variant assignment.

Warning

The experiments API is under active development. This guide documents the data model and planned API surface. The core evaluation and variant assignment features are available today through feature flags with variants and rollout percentages.

How Experiments Work

An experiment extends a feature flag with:

  1. Variants with explicit weights (traffic allocation)
  2. A control group for comparison
  3. Metrics to measure success (conversion, revenue, engagement)
  4. Statistical analysis to determine significance
  5. Guardrail metrics that auto-pause if key indicators degrade
┌─────────────────────────────────────────────────┐
│  Experiment: "Shorter Signup Flow"               │
│                                                  │
│  Hypothesis: Reducing signup steps from 4 to 2   │
│  will increase conversion by 15%                 │
│                                                  │
│  ┌──────────────┐  ┌──────────────┐              │
│  │ Control (50%)│  │Treatment(50%)│              │
│  │ 4-step flow  │  │ 2-step flow  │              │
│  └──────────────┘  └──────────────┘              │
│                                                  │
│  Primary metric: signup_completed (increase)     │
│  Guardrail: error_rate < 5% (pause if exceeded) │
└─────────────────────────────────────────────────┘

Experiment Data Model

Experiment

FieldTypeDescription
idstringUnique experiment identifier
projectIdstringProject this experiment belongs to
serviceIdstringService that evaluates the flag
namestringHuman-readable experiment name
descriptionstringWhat you're testing and why
hypothesisstringYour predicted outcome
statusstringdraft, running, paused, completed, archived
flagKeystringThe feature flag that controls variant assignment
variantsarrayVariant definitions with weights
trafficAllocationnumberPercentage of total traffic in the experiment (0-100)
primaryMetricobjectThe main metric you're trying to improve
secondaryMetricsarrayAdditional metrics to track
guardrailMetricsarrayMetrics that trigger auto-pause if they degrade
statisticalConfigobjectStatistical analysis configuration
minimumSampleSizenumberMinimum users before results are significant
maximumDurationDaysnumberAuto-stop after this many days
autoStopOnSignificancebooleanStop early when results reach significance

Experiment Lifecycle

  ┌───────┐   start   ┌─────────┐   significance   ┌───────────┐
  │ Draft │ ────────> │ Running │ ───────────────> │ Completed │
  └───────┘           └─────────┘                  └───────────┘
                         │  ▲                           │
                   pause │  │ resume              archive │
                         ▼  │                           ▼
                      ┌────────┐                  ┌──────────┐
                      │ Paused │                  │ Archived │
                      └────────┘                  └──────────┘
StatusDescription
draftExperiment designed but not started. Variants and metrics can be edited.
runningActively assigning users to variants and collecting metrics.
pausedTemporarily halted. Existing assignments are preserved.
completedResults are in. Winner determined or no significant difference.
archivedHistorical record. Results preserved, experiment inactive.

Variants

Each experiment has two or more variants. One must be marked as the control.

{
  "variants": [
    {
      "name": "control",
      "value": false,
      "weight": 50,
      "isControl": true
    },
    {
      "name": "treatment-a",
      "value": true,
      "weight": 25,
      "isControl": false
    },
    {
      "name": "treatment-b",
      "value": "v3",
      "weight": 25,
      "isControl": false
    }
  ]
}
FieldTypeDescription
namestringVariant identifier (e.g., control, treatment-a)
valueanyThe flag value served to users in this variant
weightnumberTraffic percentage (all weights must sum to 100)
isControlbooleanWhether this is the control/baseline variant

Traffic Allocation

The trafficAllocation field controls what percentage of total traffic enters the experiment. The remaining traffic gets the flag's default value.

Total traffic: 100%
├── In experiment (trafficAllocation: 80%)
│   ├── Control: 50% of 80% = 40% of total
│   └── Treatment: 50% of 80% = 40% of total
└── Not in experiment: 20% of total (gets default value)

Metrics

Metric Types

TypeDescriptionExample
conversionBinary outcome (did/didn't)Signup completed, purchase made
revenueMonetary valueOrder total, subscription value
countFrequency countPage views, API calls, clicks
durationTime measurementSession length, time to checkout
customCustom numeric metricNPS score, engagement index

Defining Metrics

{
  "primaryMetric": {
    "id": "signup-conversion",
    "name": "Signup Conversion Rate",
    "type": "conversion",
    "eventName": "signup_completed",
    "direction": "increase",
    "filters": [
      { "property": "source", "operator": "equals", "value": "organic" }
    ]
  },
  "secondaryMetrics": [
    {
      "id": "time-to-signup",
      "name": "Time to Signup",
      "type": "duration",
      "eventName": "signup_completed",
      "direction": "decrease"
    },
    {
      "id": "page-views",
      "name": "Pages Viewed During Signup",
      "type": "count",
      "eventName": "page_view",
      "direction": "decrease"
    }
  ]
}
FieldTypeDescription
idstringUnique metric identifier
namestringHuman-readable metric name
typestringconversion, revenue, count, duration, custom
eventNamestringEvent name to track (matches your analytics events)
directionstringincrease (higher is better) or decrease (lower is better)
filtersarrayOptional event filters to narrow the metric scope

Guardrail Metrics

Guardrail metrics protect against unexpected negative effects. If a guardrail threshold is breached, the experiment is automatically paused or stopped:

{
  "guardrailMetrics": [
    {
      "metric": {
        "id": "error-rate",
        "name": "Error Rate",
        "type": "conversion",
        "eventName": "error_occurred",
        "direction": "decrease"
      },
      "threshold": 0.05,
      "action": "pause"
    },
    {
      "metric": {
        "id": "page-load",
        "name": "Page Load Time",
        "type": "duration",
        "eventName": "page_loaded",
        "direction": "decrease"
      },
      "threshold": 3000,
      "action": "alert"
    }
  ]
}
ActionBehavior
alertSend notification, experiment continues
pausePause the experiment for manual review
stopStop the experiment immediately

Statistical Configuration

Configure how results are analyzed:

{
  "statisticalConfig": {
    "confidenceLevel": 0.95,
    "minimumDetectableEffect": 0.05,
    "method": "frequentist",
    "correctionMethod": "bonferroni"
  }
}
FieldTypeDescription
confidenceLevelnumberRequired confidence level (0.0–1.0, typically 0.95)
minimumDetectableEffectnumberSmallest effect size worth detecting (e.g., 0.05 = 5%)
methodstringfrequentist (p-values) or bayesian (posterior probabilities)
correctionMethodstringMultiple comparison correction: bonferroni or benjamini-hochberg

Frequentist vs Bayesian

MethodBest ForOutput
FrequentistClassic hypothesis testing, clear yes/no decisionsp-value, confidence interval
BayesianContinuous monitoring, "probability of being best"Posterior probability

Multiple Comparison Correction

When testing multiple variants or metrics, correction methods prevent false positives:

MethodApproach
BonferroniConservative — divides significance level by number of comparisons
Benjamini-HochbergLess conservative — controls false discovery rate

Results

When an experiment completes, results include per-variant analysis:

{
  "results": {
    "totalSampleSize": 10000,
    "variantResults": [
      {
        "variantName": "control",
        "sampleSize": 5000,
        "metrics": [
          {
            "metricId": "signup-conversion",
            "value": 0.12,
            "confidenceInterval": [0.11, 0.13],
            "pValue": null,
            "relativeEffect": 0,
            "isSignificant": false
          }
        ]
      },
      {
        "variantName": "treatment-a",
        "sampleSize": 5000,
        "metrics": [
          {
            "metricId": "signup-conversion",
            "value": 0.15,
            "confidenceInterval": [0.14, 0.16],
            "pValue": 0.001,
            "relativeEffect": 0.25,
            "isSignificant": true
          }
        ]
      }
    ],
    "winner": "treatment-a",
    "recommendation": "ship_winner",
    "lastUpdated": "2026-02-21T12:00:00.000Z"
  }
}

Recommendations

RecommendationMeaning
ship_winnerClear winner found — ship the winning variant
extend_experimentNot enough data yet — continue collecting
no_significant_differenceNo meaningful difference between variants

Running Experiments with Flags Today

While the full experiments API is in development, you can run A/B tests using feature flags with variants:

# Create a flag with variants for an A/B test
curl -X POST /api/flags \
  -H "Authorization: Bearer flg_your_token" \
  -H "Content-Type: application/json" \
  -d '{
    "key": "exp-shorter-signup",
    "name": "Shorter Signup Experiment",
    "type": "string",
    "enabled": true,
    "defaultValue": "control",
    "serviceId": "web-app",
    "environment": "production",
    "variants": [
      { "name": "control", "value": "control", "weight": 50 },
      { "name": "treatment", "value": "treatment", "weight": 50 }
    ],
    "tags": ["experiment"]
  }'
// In your application
const variant = await client.getStringValue("exp-shorter-signup", "control", {
  targetingKey: userId,
});
 
if (variant === "treatment") {
  renderShortSignup();
} else {
  renderFullSignup();
}
 
// Track metrics in your analytics system
analytics.track("signup_completed", { variant, userId });

Variant assignment is deterministic — the same targetingKey always gets the same variant via consistent hashing.