bootcamp / III: Analytics / step 5 of 10

Time Series Basics

3-4 hours - tier 2 - time-series analytics bootcamp

Step 5 of 10 in Bootcamp III: Operational Analytics.

Why This Step Exists

Agent metrics are time series. Token usage per day. Error rate per hour. Cost per week. Finding count per review cycle. Every backlog item has a creation timestamp. The catch-log has dates. The events log has dates. The moment you ask “is this going up?” or “is that spike normal?” you are doing time series analysis.

Most introductions to time series start with ARIMA models, stationarity tests, and autocorrelation functions. That is the wrong starting point for operational analytics. You do not need to forecast next week’s token usage. You need to answer three practical questions:

Is this going up or down? Linear regression on a time index gives you a slope and a p-value. Done.
Is there a repeating pattern? Resample to the suspected frequency and look. Decompose if you want to separate trend from pattern.
Is this spike an anomaly? Compare it to the rolling mean and standard deviation. If it is 3 standard deviations away, it is unusual.

These are time series questions with practical answers that do not require Fourier transforms or differencing. This step teaches the tools for answering them - in both pandas and DuckDB - using project data from events.yaml and catch-log.tsv.

The goal: turn append-only event logs into operational instruments that tell you when something changed, whether the change matters, and whether the new behavior is stable.

Resampling and Frequency Conversion (~25 min)
Moving Averages (~30 min)
Trend Detection (~30 min)
Seasonality (~30 min)
Anomaly Detection (~35 min)
Change Point Detection (~40 min)
Challenges (~60-90 min)
Key Takeaways
Recommended Reading
What to Read Next

1. Resampling and Frequency Conversion

Estimated time: 25 minutes

Raw event data arrives at whatever granularity the system produces - one row per event, one row per commit, one row per catch-log entry. The first step in any time series analysis is choosing a regular frequency and aggregating the raw data to match.

Resampling is frequency conversion. You take data at one granularity (individual events with timestamps) and convert it to another (daily counts, weekly sums, monthly averages). This is not a statistical technique. It is data preparation. But getting it wrong - choosing the wrong frequency, using the wrong aggregation, or failing to handle gaps - will make every downstream analysis incorrect.

Resampling in pandas

The resample() method requires a DatetimeIndex. If your DataFrame has a date column but uses a default integer index, you must convert:

import pandas as pd
import yaml

# Load events.yaml
with open("docs/internal/events.yaml") as f:
  raw = yaml.safe_load(f)

events = pd.DataFrame(raw["events"])
events["date"] = pd.to_datetime(events["date"])
events = events.set_index("date")
events = events.sort_index()  # resample requires sorted index

Once the index is a DatetimeIndex, resampling is a groupby on time:

# Count events per day
daily_counts = events.resample("D")["type"].count()

# Count events per week (week ending Sunday by default)
weekly_counts = events.resample("W")["type"].count()

The resample() call returns a Resampler object, not data. You must chain an aggregation: .count(), .sum(), .mean(), .agg(). Without the aggregation, you get a Resampler that prints as an opaque object.

Common frequency aliases in pandas 2.x:

Alias	Meaning
`"D"`	Calendar day
`"W"`	Week (ending Sunday)
`"ME"`	Month end
`"MS"`	Month start
`"h"`	Hour

Note the 2.x changes: "M" became "ME", "H" became "h", "Y" became "YE". Using the old aliases produces FutureWarning in 2.1 and ValueError in 2.2+.

Resampling in DuckDB

DuckDB uses DATE_TRUNC for the same operation, combined with GROUP BY:

import duckdb

result = duckdb.sql("""
  SELECT
    DATE_TRUNC('week', date::DATE) AS week,
    COUNT(*) AS event_count
  FROM read_csv_auto('docs/internal/weaver/catch-log.tsv', delim='\t')
  GROUP BY 1
  ORDER BY 1
""").fetchdf()

DATE_TRUNC('week', date) truncates each date to the Monday of its week. This is the DuckDB equivalent of resample("W"), but anchored to Monday rather than Sunday. The distinction matters when comparing output between the two tools.

Handling Gaps

Real event data has gaps. No events on a Saturday means no row for Saturday. This creates problems for rolling calculations and visualizations. Fill the gaps explicitly:

# pandas: create a complete date range and reindex
full_range = pd.date_range(
  start=daily_counts.index.min(),
  end=daily_counts.index.max(),
  freq="D"
)
daily_counts = daily_counts.reindex(full_range, fill_value=0)

In DuckDB, generate the date range with generate_series:

WITH date_range AS (
  SELECT UNNEST(generate_series(
    DATE '2026-03-04',
    DATE '2026-03-10',
    INTERVAL 1 DAY
  )) AS date
),
raw_counts AS (
  SELECT date::DATE AS date, COUNT(*) AS n
  FROM read_csv_auto('docs/internal/weaver/catch-log.tsv', delim='\t')
  GROUP BY 1
)
SELECT
  dr.date,
  COALESCE(rc.n, 0) AS event_count
FROM date_range dr
LEFT JOIN raw_counts rc ON dr.date = rc.date
ORDER BY dr.date

AGENTIC GROUNDING: When an agent reports “event frequency is stable at 5 per day,” check whether it accounted for zero-event days. If the agent computed mean(daily_counts) on a DataFrame that only contains rows where events occurred, every zero-count day is invisible. The mean is inflated. Reindexing with fill_value=0 before computing statistics is a one-line fix that changes the answer.

2. Moving Averages

Estimated time: 30 minutes

A moving average smooths noisy data by averaging each point with its neighbors. This is the single most useful time series operation for operational analytics. It answers: “ignoring the day-to-day noise, what is the underlying level?”

There are two kinds that matter in practice: the Simple Moving Average (SMA) and the Exponentially Weighted Moving Average (EWMA).

Simple Moving Average (SMA)

The SMA gives equal weight to every observation in the window. A 7-day SMA averages the current value with the previous 6 days:

# pandas
df["sma_7"] = df["event_count"].rolling(7).mean()

The first 6 values of sma_7 will be NaN because there are not enough prior observations to fill the window. If you want partial windows (compute the mean of whatever is available), set min_periods=1:

df["sma_7"] = df["event_count"].rolling(7, min_periods=1).mean()

If you already worked through Step 3, this should look familiar. A rolling mean is a window function applied to time-ordered data. The difference from Step 3’s SQL window functions is that pandas rolling() operates on rows (or time offsets), while SQL window functions operate on partitioned result sets.

An important distinction: window=7 means 7 rows, not 7 calendar days. If your data has gaps (missing weekends), a 7-row window covers more than a week. Use a time-based offset for calendar-based windows:

# Calendar-based: always covers exactly 7 days, regardless of gaps
df["sma_7d"] = df["event_count"].rolling("7D").mean()

The DuckDB equivalent uses a window function:

SELECT
  date,
  event_count,
  AVG(event_count) OVER (
    ORDER BY date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS sma_7
FROM daily_events

Note: ROWS BETWEEN 6 PRECEDING AND CURRENT ROW gives a 7-row window (6 + current = 7). This is the row-based equivalent. For calendar-based windows in DuckDB:

SELECT
  date,
  event_count,
  AVG(event_count) OVER (
    ORDER BY date
    RANGE BETWEEN INTERVAL 6 DAY PRECEDING AND CURRENT ROW
  ) AS sma_7d
FROM daily_events

Exponentially Weighted Moving Average (EWMA)

The EWMA gives more weight to recent observations and exponentially decaying weight to older ones. It tracks recent changes more closely than the SMA:

# pandas
df["ewma_7"] = df["event_count"].ewm(span=7).mean()

The span parameter controls how fast the weights decay. span=7 means the center of mass is at 3 observations, and roughly 86% of the total weight falls within the last 7 periods. A higher span produces smoother output; a lower span tracks the raw data more closely.

You must specify exactly one of com, span, halflife, or alpha. They are mathematically related: alpha = 2 / (span + 1). For instructional purposes, span is the most intuitive.

Unlike the SMA, the EWMA has no leading NaN values. It is defined from the first observation because the weighting scheme does not require a full window.

SMA vs EWMA: When to Use Each

Property	SMA	EWMA
Weights	Equal across window	Exponentially decaying
Lag	Higher (centered at middle of window)	Lower (weighted toward recent)
Leading NaN	Yes (window - 1 values)	No
Best for	Smoothing noise, stable baselines	Tracking regime shifts, recent trends

Use the SMA when you want to remove noise and all recent observations are equally relevant. Use the EWMA when you want the smoothed line to respond faster to recent changes - for example, detecting that cost started climbing this week rather than last month.

AGENTIC GROUNDING: An agent monitoring its own token usage might report a 7-day moving average to smooth out variance from task complexity. But if the agent uses SMA and a regime change happened 3 days ago (new prompt, different task distribution), the SMA will not fully reflect the change for another 4 days. The EWMA reacts faster. When reviewing agent self-reporting of smoothed metrics, ask which smoothing method was used and whether the lag is appropriate for the question being asked.

3. Trend Detection

Estimated time: 30 minutes

“Is this going up?” is the most common question in operational analytics. A visual inspection of a line chart often suffices (Step 6 covers visualization), but sometimes you need a number: the rate of change, and whether it is statistically significant.

The simplest quantitative trend detection is linear regression on a time index. You fit a line y = slope * t + intercept where t is time (as a numeric sequence) and y is the metric. The slope tells you the rate of change. The p-value tells you whether the slope is significantly different from zero.

Linear Regression with scipy

import numpy as np
from scipy.stats import linregress

# Prepare: x must be numeric, not datetime
x = np.arange(len(df))  # 0, 1, 2, ..., n-1
y = df["event_count"].values

result = linregress(x, y)

The LinregressResult has five useful fields:

Field	Meaning
`result.slope`	Rate of change per time unit
`result.intercept`	Starting level (y when x=0)
`result.rvalue`	Pearson correlation coefficient
`result.pvalue`	P-value for H0: slope = 0
`result.stderr`	Standard error of the slope

Interpret the result:

if result.pvalue < 0.05:
  direction = "increasing" if result.slope > 0 else "decreasing"
  r_squared = result.rvalue ** 2
  printf(f"Significant {direction} trend: "
         f"slope={result.slope:.4f}/day, "
         f"R^2={r_squared:.3f}, "
         f"p={result.pvalue:.4f}")
else:
  printf("No significant trend detected "
         f"(slope={result.slope:.4f}, p={result.pvalue:.4f})")

The rvalue**2 (R-squared) tells you how much of the variance is explained by the linear trend. An R-squared of 0.7 means 70% of the variation in the metric is explained by the passage of time. The remaining 30% is noise, seasonality, or other factors.

Practical Example: Catch-Log Firing Rate

Load the catch-log and compute a daily control firing count:

import pandas as pd
from scipy.stats import linregress

catches = pd.read_csv(
  "docs/internal/weaver/catch-log.tsv", sep="\t", parse_dates=["date"]
)
daily = catches.groupby("date").size().reset_index(name="count")
daily = daily.sort_values("date")

x = np.arange(len(daily))
y = daily["count"].values

result = linregress(x, y)
printf(f"Slope: {result.slope:.3f} catches/day, p={result.pvalue:.4f}")

With only 2 distinct dates in the current catch-log (2026-03-04 and 2026-03-05), this will produce a line through 2 points - a perfect fit with no residual, which is not meaningful. This illustrates an important constraint: trend detection requires enough data points to distinguish signal from noise. With fewer than 10-15 observations, linear regression is unreliable. The p-value may look significant, but the confidence interval on the slope will be wide.

Trend Detection in DuckDB

DuckDB does not have a built-in LINREGRESS function, but you can compute the slope manually using the regression formulas:

SELECT
  (COUNT(*) * SUM(x * y) - SUM(x) * SUM(y)) /
  (COUNT(*) * SUM(x * x) - SUM(x) * SUM(x)) AS slope,
  (SUM(y) - slope * SUM(x)) / COUNT(*) AS intercept
FROM (
  SELECT
    ROW_NUMBER() OVER (ORDER BY date) - 1 AS x,
    count AS y
  FROM daily_counts
)

This is the closed-form formula for simple linear regression. It computes the same slope as linregress() but does not give you a p-value. For operational work, use DuckDB for the data preparation (resampling, gap filling, window functions) and scipy for the statistical test.

HISTORY: Linear regression was invented by Francis Galton in the 1880s while studying the heights of parents and children. He noticed that tall parents tended to have shorter children (and vice versa) - a phenomenon he called “regression towards mediocrity.” The name stuck, even though modern regression has nothing to do with mediocrity. The technique became the foundation of statistical modeling because it decomposes observed data into a systematic component (the line) and a random component (the residual) - the same decomposition you use in trend detection.

AGENTIC GROUNDING: When an agent reports “token usage increased by 15% this week,” ask for the baseline. A 15% increase from a noisy baseline with no significant trend is meaningless. A 15% increase with a slope p-value of 0.002 is a real signal. The p-value from linregress() is the tool that separates “it went up this particular week” from “it is consistently going up.”

4. Seasonality

Estimated time: 30 minutes

Seasonality is a repeating pattern at a known frequency. Agent systems often show weekly patterns: more activity on weekdays, less on weekends. Cost data shows monthly billing cycles. CI/CD systems show deploy-day spikes.

The question is not “does this data have seasonality?” (visual inspection answers that). The question is: “after removing the seasonal pattern, what is the real trend?” Seasonal decomposition separates the observed time series into three components:

Trend - the long-term direction (up, down, flat)
Seasonal - the repeating pattern (weekday/weekend cycle, monthly cycle)
Residual - everything left over (noise, anomalies, one-off events)

Decomposition with statsmodels

from statsmodels.tsa.seasonal import seasonal_decompose

# Prepare: need a Series with DatetimeIndex
daily_counts = catches.groupby("date").size()
daily_counts.index = pd.to_datetime(daily_counts.index)
daily_counts = daily_counts.sort_index()

# Fill gaps - decomposition requires no missing values
full_range = pd.date_range(
  start=daily_counts.index.min(),
  end=daily_counts.index.max(),
  freq="D"
)
daily_counts = daily_counts.reindex(full_range, fill_value=0)

result = seasonal_decompose(
  daily_counts,
  model="additive",
  period=7,
  extrapolate_trend="freq"
)

The three critical parameters:

model - "additive" or "multiplicative". Use additive when the seasonal variation is roughly constant regardless of the trend level (the weekday/weekend difference is always about 3 events). Use multiplicative when the seasonal variation scales with the trend (the weekday/weekend difference is always about 20% of the current level). When in doubt, start with additive.
period - the number of observations in one complete seasonal cycle. For daily data with weekly seasonality: period=7. For hourly data with daily patterns: period=24. For monthly data with yearly patterns: period=12. Getting this wrong produces nonsensical decomposition.
extrapolate_trend="freq" - by default, the trend component has NaN at the edges because the centered moving average cannot be computed at the boundaries. This parameter fills those NaN values by extrapolation.

Accessing the Components

The result has four attributes, each a pandas Series:

trend = result.trend       # the smoothed long-term direction
seasonal = result.seasonal # one cycle repeated across the full series
residual = result.resid    # what is left after removing trend + seasonal
observed = result.observed # the original data

For additive decomposition: observed = trend + seasonal + residual. You can verify this:

reconstructed = result.trend + result.seasonal + result.resid
assert (reconstructed - result.observed).abs().max() < 1e-10

Visualizing the Decomposition

The result object has a built-in plot method:

fig = result.plot()
fig.set_size_inches(12, 8)
fig.tight_layout()
fig.savefig("decomposition.png", dpi=150)

This produces a 4-panel figure: observed, trend, seasonal, residual. The trend panel answers “is the underlying level changing?” The seasonal panel answers “what does the repeating pattern look like?” The residual panel answers “which data points are not explained by trend or seasonality?”

Seasonality in DuckDB

DuckDB does not have a built-in decomposition function, but you can compute the seasonal averages manually:

WITH daily AS (
  SELECT
    date::DATE AS date,
    COUNT(*) AS event_count,
    EXTRACT(dow FROM date::DATE) AS day_of_week
  FROM read_csv_auto('docs/internal/weaver/catch-log.tsv', delim='\t')
  GROUP BY 1
)
SELECT
  day_of_week,
  AVG(event_count) AS avg_events,
  COUNT(*) AS n_weeks
FROM daily
GROUP BY day_of_week
ORDER BY day_of_week

This tells you the average event count per day of the week - a simple seasonal profile. For the full trend-seasonal-residual decomposition, use Python.

AGENTIC GROUNDING: Seasonal decomposition is particularly useful for evaluating agent process changes. If you changed the darkcat review instructions on a Wednesday and catch rates went up on Thursday, did the instructions work, or is Thursday always a high-catch day? Decompose the time series, then look at the residual. If Thursday’s residual is unusually high (not just its seasonal component), the change had an effect beyond the normal weekly pattern.

5. Anomaly Detection

Estimated time: 35 minutes

An anomaly is a data point that deviates significantly from the expected pattern. In operational analytics, anomalies answer: “is this spike a problem?” Token usage doubled today. Cost tripled this week. The catch rate dropped to zero. Are these anomalies that warrant investigation, or normal variation?

Two methods cover the vast majority of practical cases: the z-score method and the IQR method. Both build directly on descriptive statistics from Step 2. Neither requires machine learning.

Z-Score Method

The z-score measures how many standard deviations a value is from the mean. For time series, use a rolling mean and rolling standard deviation instead of the global mean and std:

# Compute rolling statistics
window = 7
df["rolling_mean"] = df["event_count"].rolling(window, min_periods=1).mean()
df["rolling_std"] = df["event_count"].rolling(window, min_periods=1).std()

# Compute z-score
df["z_score"] = (df["event_count"] - df["rolling_mean"]) / df["rolling_std"]

# Flag anomalies: |z| > 2 is unusual, |z| > 3 is extreme
df["is_anomaly"] = df["z_score"].abs() > 2

Why rolling statistics instead of global? Because the mean and standard deviation of a time series change over time. If token usage trends upward, a value that was anomalously high in January might be normal in March. The rolling window adapts the baseline.

The threshold is a design choice:

Threshold	Meaning	False positive rate (normal data)
	z	> 2
	z	> 2.5
	z	> 3

Choose based on the cost of false positives vs missed anomalies. For cost monitoring, |z| > 2 is aggressive but catches problems early. For error alerting, |z| > 3 reduces noise.

Z-Score Method in DuckDB

WITH stats AS (
  SELECT
    date,
    event_count,
    AVG(event_count) OVER (
      ORDER BY date
      ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_mean,
    STDDEV_SAMP(event_count) OVER (
      ORDER BY date
      ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_std
  FROM daily_events
)
SELECT
  date,
  event_count,
  rolling_mean,
  rolling_std,
  (event_count - rolling_mean) / NULLIF(rolling_std, 0) AS z_score,
  ABS((event_count - rolling_mean) / NULLIF(rolling_std, 0)) > 2 AS is_anomaly
FROM stats

Note the NULLIF(rolling_std, 0) guard. If all values in the window are identical, the standard deviation is zero, and division by zero produces an error (DuckDB) or infinity (pandas). Guarding against zero std is essential.

IQR Method

The Interquartile Range (IQR) method is more robust to outliers than the z-score method. It does not assume the data is normally distributed:

# Compute rolling IQR
q1 = df["event_count"].rolling(window, min_periods=1).quantile(0.25)
q3 = df["event_count"].rolling(window, min_periods=1).quantile(0.75)
iqr = q3 - q1

# Define fences
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr

# Flag anomalies
df["iqr_anomaly"] = (
  (df["event_count"] < lower_fence) |
  (df["event_count"] > upper_fence)
)

The 1.5 multiplier is the standard Tukey fence. For more aggressive detection, use 1.0. For fewer false positives, use 2.0 or 3.0 (which matches the “extreme outlier” convention).

Z-Score vs IQR: When to Use Each

Property	Z-Score	IQR
Assumption	Approximately normal	None
Sensitivity to outliers	High (outliers inflate std)	Low (quartiles resist outliers)
Best for	Roughly symmetric data	Skewed data, heavy tails
Simplicity	Simpler formula	Slightly more computation

For most operational metrics (token counts, error rates, latencies), the IQR method is safer because these distributions tend to be right-skewed with occasional spikes. The z-score method is easier to explain and compute.

Practical Example: Catch-Log Anomaly Detection

import pandas as pd

catches = pd.read_csv(
  "docs/internal/weaver/catch-log.tsv", sep="\t", parse_dates=["date"]
)
daily = catches.groupby("date").size().reset_index(name="count")
daily = daily.sort_values("date")

# With only 2 dates, rolling statistics are not meaningful.
# This illustrates the minimum data requirement.
# In practice, you need at least 2-3x the window size in data points.
printf(f"Data points: {len(daily)}")
printf(f"Minimum for 7-day rolling: ~14-21 days")

AGENTIC GROUNDING: When an agent system flags its own anomalies - “unusual token usage detected” - verify the detection method. A common failure mode is computing z-scores against a global mean rather than a rolling mean. If usage has been trending up for weeks, today’s “anomalous” value might just be the continuation of a trend, not a spike. Rolling baselines prevent this conflation. A second failure mode: the agent uses a 1-day window for the rolling std, which makes every day that differs from yesterday an “anomaly.”

6. Change Point Detection

Estimated time: 40 minutes

Anomaly detection finds individual data points that deviate from the expected pattern. Change point detection finds the moment when the pattern itself changed. These are different questions:

Anomaly: “Tuesday’s token usage was unusually high.”
Change point: “Starting around March 5th, the average token usage shifted from 50k to 80k per day.”

The first is a spike. The second is a regime change. For operational analytics, change points answer questions like: “Did the prompt update actually change the error rate?” “When did the new model’s behavior stabilize?” “Did the darkcat alley process improve the catch rate?”

The CUSUM Algorithm

CUSUM (Cumulative Sum) is the simplest and most widely used change point detection method. The idea: compute the cumulative sum of deviations from the mean. When the cumulative sum trends sharply away from zero, something changed.

Here is the intuition. If your data fluctuates around a stable mean, the deviations (value - mean) will be roughly balanced: some positive, some negative, and their cumulative sum will stay near zero. If the mean shifts upward at some point, the deviations after that point will be consistently positive, and the cumulative sum will start climbing. The change point is where the cumulative sum begins its climb.

CUSUM Implementation

The complete algorithm in Python:

import numpy as np

def cusum(data, threshold=None):
  """Detect change points using the CUSUM algorithm.

  Args:
    data: 1D array of numeric values
    threshold: detection threshold (default: 2 * std of data)

  Returns:
    List of indices where change points were detected
  """
  data = np.asarray(data, dtype=float)
  mean = data.mean()
  std = data.std(ddof=1)
  if threshold is None:
    threshold = 2 * std

  cumsum = np.cumsum(data - mean)

  # Find where the cumulative sum deviates beyond the threshold
  change_points = []
  for i in range(1, len(cumsum)):
    if abs(cumsum[i] - cumsum[i - 1]) > threshold:
      change_points.append(i)

  return cumsum, change_points

This is a simplified version. The core operation is one line: cumsum = np.cumsum(data - mean). Everything else is bookkeeping.

Let’s trace through an example to make it concrete. Suppose you have 10 days of event counts:

import numpy as np

# Days 1-5: baseline (~4 events/day)
# Days 6-10: elevated (~8 events/day)
data = np.array([3, 5, 4, 4, 3, 8, 9, 7, 8, 9])

mean = data.mean()  # 6.0
deviations = data - mean
# [-3, -1, -2, -2, -3, 2, 3, 1, 2, 3]

cumsum = np.cumsum(deviations)
# [-3, -4, -6, -8, -11, -9, -6, -5, -3, 0]

The cumulative sum drops steadily for the first 5 days (values below the mean) and climbs back for the last 5 days (values above the mean). The minimum of the cumulative sum at index 4 (day 5) marks the change point - the moment when the data shifted from below-mean to above-mean behavior.

A more robust approach uses the maximum deviation of the CUSUM from its running minimum (or maximum) to detect the actual point of change:

def cusum_detect(data, drift=0):
  """CUSUM change point detection with directional tracking.

  Args:
    data: 1D array of numeric values
    drift: allowable drift before signaling (default: 0)

  Returns:
    cumsum: the raw cumulative sum of deviations
    change_idx: index of the most likely change point
  """
  data = np.asarray(data, dtype=float)
  mean = data.mean()
  cumsum = np.cumsum(data - mean - drift)

  # The change point is at the extremum of the cumulative sum
  min_idx = np.argmin(cumsum)
  max_idx = np.argmax(cumsum)

  # The extremum with the largest absolute value indicates
  # the direction and location of the change
  if abs(cumsum[min_idx]) > abs(cumsum[max_idx]):
    change_idx = min_idx
  else:
    change_idx = max_idx

  return cumsum, change_idx

Applying CUSUM to Project Data

Consider this scenario: the control firing rate changed after updating the darkcat review instructions. We have the catch-log with dates. Did the firing rate actually change?

import pandas as pd
import numpy as np

catches = pd.read_csv(
  "docs/internal/weaver/catch-log.tsv", sep="\t", parse_dates=["date"]
)

# Compute daily firing counts
daily = catches.groupby("date").size().reset_index(name="count")
daily = daily.sort_values("date")

# Simulate extended data to illustrate CUSUM
# (The actual catch-log has only 2 dates)
np.random.seed(42)
baseline = np.random.poisson(lam=3, size=14)    # 2 weeks at ~3/day
elevated = np.random.poisson(lam=6, size=14)    # 2 weeks at ~6/day
simulated = np.concatenate([baseline, elevated])

dates = pd.date_range("2026-02-20", periods=28, freq="D")
sim_df = pd.DataFrame({"date": dates, "count": simulated})

# Apply CUSUM
cumsum, change_idx = cusum_detect(sim_df["count"].values)
printf(f"Detected change point at index {change_idx}: "
       f"{sim_df['date'].iloc[change_idx].strftime('%Y-%m-%d')}")

Interpreting CUSUM Results

The CUSUM curve itself is informative:

Flat cumulative sum - no change, data fluctuates around a stable mean
Steadily rising - values are consistently above the mean
Steadily falling - values are consistently below the mean
V-shape or inverted V - a change point exists at the vertex

The sharpness of the V indicates how sudden the change was. A gradual transition produces a smooth curve; an abrupt shift produces a sharp corner.

Limitations

CUSUM assumes a single change point. If there are multiple regime changes, the algorithm finds the dominant one. For multiple change points, you can apply CUSUM recursively to each segment, or use more sophisticated methods (the ruptures Python package, for example). For operational analytics, the single-change-point assumption is usually sufficient because you are typically testing a specific hypothesis: “Did this deployment change the metric?”

CUSUM also uses the global mean as the reference level. If the data has a strong upward trend, CUSUM may flag the trend itself as a “change” rather than finding a specific shift. Detrend the data first (subtract the linear regression line from Section 3) if you need to detect change points in trended data.

HISTORY: CUSUM was developed by E.S. Page in 1954 for industrial quality control - detecting when a manufacturing process shifted out of specification. It was designed for exactly the scenario we face in agent operations: continuous monitoring of a metric with the question “has the process changed?” The algorithm has been used in fields from epidemiology (detecting disease outbreaks) to finance (regime detection) to network security (intrusion detection). Its simplicity - a running sum of deviations - is its strength. It requires no distributional assumptions beyond a stable mean.

AGENTIC GROUNDING: CUSUM answers the operationally critical question: “When did the change take effect?” If you updated an agent’s system prompt on March 5th, but CUSUM detects the change point on March 7th, there is a 2-day lag. This might mean the change took time to propagate, or that the first few interactions under the new prompt happened to look like the old regime. Either way, the deployment timestamp is not the same as the effect timestamp. CUSUM finds the latter.

7. Challenges

Estimated time: 60-90 minutes total

Challenge: Resample and Smooth Event Data

Estimated time: 15 minutes

Goal: Load the events.yaml file, compute daily event counts, apply a moving average, and determine if the smoothing changes your interpretation.

Load docs/internal/events.yaml into a pandas DataFrame with a DatetimeIndex.
Resample to daily counts. Fill zero-event days with 0.
Compute a 3-day simple moving average (the dataset is small, so use a small window).
Compute a 3-day EWMA.
Print all three columns (raw count, SMA, EWMA) side by side.

import pandas as pd
import yaml

with open("docs/internal/events.yaml") as f:
  raw = yaml.safe_load(f)

events = pd.DataFrame(raw["events"])
events["date"] = pd.to_datetime(events["date"])
# Your code here

Verification: The raw daily counts should be: 2026-03-04 has 4 events, 2026-03-05 has 2 events (check against the file). The SMA and EWMA should differ because the EWMA weights recent values more heavily.

Extension: Repeat the exercise in DuckDB. Load the events into a DuckDB table via a pandas DataFrame, compute the daily counts with DATE_TRUNC, and compute the SMA with a window function.

Challenge: Detect a Trend in Simulated Data

Estimated time: 20 minutes

Goal: Generate time series data with a known trend, add noise, and use linregress() to recover the trend.

Generate 60 days of data with a known slope of 0.5 per day and normally distributed noise:

import numpy as np
from scipy.stats import linregress

np.random.seed(42)
n_days = 60
true_slope = 0.5
noise = np.random.normal(0, 3, n_days)
y = true_slope * np.arange(n_days) + 10 + noise

Run linregress(np.arange(n_days), y) and print the estimated slope, p-value, and R-squared.
How close is the estimated slope to the true slope of 0.5?
Increase the noise level from 3 to 10. Repeat. Does the slope estimate change? Does the p-value change?

Verification: With noise=3 and seed=42, the estimated slope should be close to 0.5 (within about 0.1) and the p-value should be well below 0.05. With noise=10, the slope estimate should be similar but the p-value will be higher and the R-squared lower.

What you are learning: Noise does not bias the slope estimate (on average), but it does reduce your ability to detect the trend (higher p-value, lower R-squared). This is why smoothing and longer time windows improve trend detection.

Challenge: Anomaly Detection with Both Methods

Estimated time: 20 minutes

Goal: Apply both z-score and IQR anomaly detection to the same dataset and compare the results.

Generate 90 days of Poisson-distributed event counts with two injected anomalies:

import numpy as np
import pandas as pd

np.random.seed(42)
counts = np.random.poisson(lam=5, size=90)
counts[30] = 25  # inject anomaly at day 30
counts[60] = 0   # inject anomaly at day 60 (unusual low)

dates = pd.date_range("2026-01-01", periods=90, freq="D")
df = pd.DataFrame({"date": dates, "count": counts})
df = df.set_index("date")

Apply the z-score method with a 7-day rolling window and threshold |z| > 2.
Apply the IQR method with a 7-day rolling window and 1.5x fences.
How many anomalies does each method detect? Do both methods catch the injected anomalies at days 30 and 60?

Verification: Both methods should catch day 30 (the spike to 25 is far from the baseline of ~5). Day 60 (a drop to 0) may or may not be flagged depending on the rolling window context. Compare the total number of flagged days between the two methods.

Hints

For the z-score method:

rolling_mean = df["count"].rolling(7, min_periods=1).mean()
rolling_std = df["count"].rolling(7, min_periods=1).std()
z = (df["count"] - rolling_mean) / rolling_std
anomalies_z = df[z.abs() > 2]

For the IQR method:

q1 = df["count"].rolling(7, min_periods=1).quantile(0.25)
q3 = df["count"].rolling(7, min_periods=1).quantile(0.75)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
anomalies_iqr = df[(df["count"] < lower) | (df["count"] > upper)]

Challenge: CUSUM Change Point Detection

Estimated time: 20 minutes

Goal: Generate data with a known change point, apply the CUSUM algorithm, and verify it finds the correct location.

Generate 56 days of data: 28 days with a mean of 4, then 28 days with a mean of 8:

import numpy as np

np.random.seed(42)
before = np.random.poisson(lam=4, size=28)
after = np.random.poisson(lam=8, size=28)
data = np.concatenate([before, after])
# The true change point is at index 28

Implement the cusum_detect function from Section 6 (or copy it).
Run it on data and print the detected change point index.
How close is the detected index to the true change point at 28?

Verification: The detected change point should be within a few indices of 28. Plot the cumulative sum if you have completed Step 6 - it should show a clear V-shape with the vertex near index 28.

Extension: Try different mean values for the “after” period. How large must the shift be (relative to the Poisson variance) for CUSUM to reliably detect it? Try shifts of +1, +2, +4, +8 from the baseline mean of 4. At what shift size does the detected change point consistently land within 3 indices of the true location?

AGENTIC GROUNDING: In practice, you rarely know the true change point. You know when you deployed a change, and you want to know when the effect became visible in the data. Running CUSUM on your metrics with the deployment date as a reference lets you measure the deployment-to-effect lag. If the lag is longer than expected, the change may not have propagated correctly, or confounding factors may be masking the effect.

Key Takeaways

Before moving to Step 6, verify you can answer these questions without looking anything up:

What does df.resample("W").sum() require of the DataFrame’s index? What happens if the index is not a DatetimeIndex?
What is the DuckDB equivalent of pandas resample("W")?
What is the difference between rolling(7) (integer window) and rolling("7D") (offset window) when data has gaps?
When would you choose EWMA over SMA? What does the span parameter control?
What does linregress() require as the x argument? Can you pass datetime objects?
What does a low p-value from linregress() tell you? What does R-squared tell you?
In seasonal_decompose(), what happens if you set period incorrectly?
Why use rolling mean/std instead of global mean/std for z-score anomaly detection on a time series?
What is the IQR method’s advantage over the z-score method for skewed data?
What does the CUSUM algorithm compute, and how do you interpret its output?

What to Read Next

Step 6: Visualization for Decision-Making - A trend line is a number. A chart of that trend line is a story. Step 6 covers matplotlib’s object-oriented API for producing functional plots: line charts for trends over time, scatter plots for correlations, histograms for distributions, and multi-panel layouts for decompositions. Every technique in this step - moving averages, trend lines, seasonal decomposition panels, CUSUM curves, anomaly flags overlaid on raw data - becomes more useful when you can see it. Step 6 turns the numbers from Steps 2-5 into visual instruments for operational decision-making.

‹ index ›

Why This Step Exists

Table of Contents

1. Resampling and Frequency Conversion

Resampling in pandas

Resampling in DuckDB

Handling Gaps

2. Moving Averages

Simple Moving Average (SMA)

Exponentially Weighted Moving Average (EWMA)

SMA vs EWMA: When to Use Each

3. Trend Detection

Linear Regression with scipy

Practical Example: Catch-Log Firing Rate

Trend Detection in DuckDB

4. Seasonality

Decomposition with statsmodels

Accessing the Components

Visualizing the Decomposition

Seasonality in DuckDB

5. Anomaly Detection

Z-Score Method

Z-Score Method in DuckDB

IQR Method

Z-Score vs IQR: When to Use Each

Practical Example: Catch-Log Anomaly Detection

6. Change Point Detection

The CUSUM Algorithm

CUSUM Implementation

Applying CUSUM to Project Data

Interpreting CUSUM Results

Limitations

7. Challenges

Challenge: Resample and Smooth Event Data

Challenge: Detect a Trend in Simulated Data

Challenge: Anomaly Detection with Both Methods

Challenge: CUSUM Change Point Detection

Key Takeaways

Recommended Reading

What to Read Next