Siya Gupte
10 min readJun 6, 2022

Causal Impact Analysis: An understanding of the inner workings to optimize model results

Let’s start with why we need it. If you are a marketer, and have some combination of a PR event, partnerships, or local offline media, chances are you are not able to fully measure the incremental impact of the campaign or activity at driving sales; why? Because each of these has one thing in common: we cannot measure the impact at an individual user level, which makes incrementality and/or attribution difficult.

If you want to avoid shooting darts to make decisions around lift from marketing spend, and need a relatively low lift way to make decisions, Causal Impact is a nifty tool to have in your back pocket — I have seen it used increasingly over the years to make budgeting decisions (in pairing with other tools ex. MMM)

The purpose of this document is to break down the technical parameters of Causal Impact, for those of you trying to implement it with the hope that you can use it with a more in-depth understanding.

Causal Impact Analysis is a statistical approach used to estimate the incremental impact of an event or designated media intervention in market. More specifically, we build a Bayesian structural time series model (by regressing the test market on a series of comparable control groups (or markets)) and use the model to project (or forecast) the counterfactual, i.e., how the response metric would have evolved after the intervention if the intervention had never occurred.

I. High level Technical Workflow:

  1. Pre-screening step: Match your test markets to a series of control markets (using dynamic time warping*). This step creates a series of candidate markets that match the test market.
  2. Inference step:
  • Fit a Bayesian structural time series model that uses the test markets as dependent variables and the multiple comparable control markets identified in 1) as regressors.
  • Use this model to create a synthetic control series by estimating a counterfactual of what would have happened in the post period in the absence of marketing.
  • Then calculate the difference between the synthetic control and the test market for the post-intervention period — which will give you the estimated impact of the event — and compare to the posterior interval to measure uncertainty (Bayesian conditional probability component)

Benefits over traditional difference-in-difference approach (i.e. (Testpost-Controlpost)- (Testpre-Controlpre)

  • Captures unknown evolution of control markets that are not explained by known events
  • Captures the uncertainty of the relationship between test and control markets (using Bayesian priors). This reduces reliance on historical fit by estimating an uncertainty distribution around the prediction
  • The spike and slab priors in BSTS help to minimize overfitting by reducing market selection

II. Understanding some key concepts:

  1. Control Groups vs. Test Group
  • The Control groups are market(s) where we didn’t have the event such as marketing campaigns, hence we don’t expect to see any impact of the marketing activity
  • The Test group is the market where we had the media event (TV etc.) and expect some degree of impact on consumer behavior (new services, incremental services etc.)

2. Market matching: Helps us to identify a series of control markets that best predict the test market in the pre-intervention period. This can be done via several methods:

  • Euclidean distance: Is the ordinary straight line distance between two points. Limitation: this approach implicitly over penalizes instances where relations between markets are temporarily shifted.
  • Correlation Analysis: Establishes a linear relationship between two variables. Limitation: does not factor in size of markets
  • Dynamic Time warping (DTW): is a more flexible approach, which finds a one to many mapping along a warping curve — instead of using the raw data — where the warping curve represents the best alignments between the two times series (With user-defined constraints). Euclidean distance is a special case of DTW.

3. Spike and Slab priors: Allows the model to reduce the number of control markets selected thereby minimizing multicollinearity, and have a set of markets which follow a random distribution, incorporating uncertainty.

  • Spike: determines a given market’s probability of having a non-zero coefficient and being selected in the model. It is a set of independent bernoulli distributions, where the probability of being selected in the model can be set based on expected model size
  • Slab: A Gaussian prior coefficient which shrinks the coefficient towards a certain value (typically zero)

Note: while the Causal Impact package can utilize spike and slab priors to select the most predictive markets, there is a handy addition introduced by Kim Larsen on his MarketMatching R package page. This allows you to wrap the control group picking logic with an additional pre-process step, to reduce and control the number of candidate markets upfront.

4. Baseline: What is a synthetic time series baseline?

A series of values we would have expected without the impacting (marketing) event. To predict these values, this algorithm builds the model based on the actual data of the Control groups for the post-event time period and predicts the baseline values. This is unlike other typical prediction/forecasting algorithms, which would build predictive models based on the past data of the Test group itself. This means that the Control groups need to be somewhat similar or correlated to the Test group in order for the algorithm to predict the baseline values for the Test group in a reliable way.

Implication: The local level baseline* term is estimated

  • Using actuals of control markets in the post-period to predict the baseline impact.
  • Based on a random walk that is specified for the model.
  • The cumulative Standard error (SE) will increase over time, making the credible intervals somewhat larger, and the effect size needed to overcome that a bit larger as well.

III. Optimizing your parameters:

While the Causal Impact package can be run with a series of default parameters, it might be helpful to understand your parameter options, to improve model fit (i.e. MAPE (Mean Absolute Percent Error)) in the pre-period while balancing Standard Error, allowing for more accurate estimates in the post-intervention period. The parameters can be split into pre-modeling and modeling parameters.

A. Market Matching (Pre-screening): Parameters

Allows you to find the ideal number and list of markets that would minimize historical MAPE.

  1. Number of control matched markets: This parameter specifies the number of markets to match the test market to:
  • 3
  • 6
  • 8
  • 10

2. Warping limit: match window parameter, which specifies the maximum number of data points that each week for a given test market can match to:

  • 1: (1 data point can be matched to at most 2 data points)
  • 2: (1 data point can be matched to at most 4 data points)

3. Parallel: Specifies whether we want simultaneous or sequential matching of markets

  • true (simultaneous matching of markets) vs.
  • false (sequential matching of markets)

4. Dynamic Time warping* Emphasis: Specifies whether matched market pre-screen should rely fully on DTW or also on correlation

  • 1= algorithm relies on 100% DTW to match markets pre-screening, or
  • 0.5 = equal weighting to DTW and correlation

B. BSTS (Model estimation/inference): Equation and Parameters:

Here’s a bit of technical detail on the Bayesian structural time series equation and parameters that can be optimized.

  1. Equation:
  • yt (test market)=μt+ϵt + salest(control market1…n)
  • μt+1=μt+ηt.
  • The model is a Bayesian structural Time series model, where Sales (y) in time t is a function of a time-varying, stochastic trend term, μt, which allows sales to follow a random walk, plus an error term + control market sales

2. Dynamic baseline: Local level model means local in time i.e. locality on a time axis is a small interval in time. For any small interval in time, the trend term is flat (very slow moving)

a. Local level term: Default setting. if an error is centered around 0

  • μt+1=μt+ηt.

b. Local linear trend term (stochastic trend term with growth component): if time series is linearly increasing with time add a fixed delta (linear trend) + error

  • μt+1=μt+δt+η0t
  • μt∼N(μ0,tσ2η). (variance grows to t=∞ )

c. Local Semi Linear trend term: Replaces the random walk with a stationary AR process

  • μt+1=ρμt+ηt
  • ηt∼N(0,σ2η) and |ρ|<1|.
  • This model has stationary distribution μ∞∼N(0,σ2η/1−ρ2) which means that uncertainty grows to a finite asymptote, rather than infinity, in the distant future.
  • μt+1=μt+δt+η0t
  • δt+1=D+ρ(δt−D)+η1t
  • The D parameter is the long run slope of the trend component, to which δt will eventually revert.

3. Local level prior standard deviation: specifies the random walk or how wiggly the baseline term is, if there is too much variation, it will overfit historically and increase SEs in the future

  • Test a range from 0.001 to 0.1

4. Number of MCMC iterations: Pre-period MCMC draws for stability of estimates

  • Choose niter=ex. 1000 (specify the number of iterations)

5. Seasonality component: number of seasons we want the model to estimate

  • monthly (nseasons=12)
  • weekly (nseasons=52)
  • Note: season duration is dependent on whether it’s daily or weekly data

IV. Model evaluation criteria:

You can run iterations of the model algorithm to optimize parameters across the 1) Market Match (pre-screening) step and the 2) BSTS Modeling estimation step. The objective is to evaluate a series of model parameters that will:

  1. Minimize model error or Mean Absolute Percent Error (MAPE) without overfitting: Average of weekly |(Actual- Predicted)|/Actual
  2. Minimize serial correlation in time series data, i.e. one of the characteristics of time series data, is that the residual error in time t is correlated to residual error in time t-1. This needs to be controlled for (and is done using our dynamic baseline*)

Durbin Watson: test statistic used to measure autocorrelation in data at lag 1; (d ~ 2(1-r), where r is the autocorrelation between residuals)

  • Durbin Watson = 2 (no autocorrelation)
  • Durbin Watson > 2 (negative autocorrelation, successive error terms are negatively correlated)
  • Durbin Watson < 2 (positive autocorrelation, successive error terms are positively correlated)

3. Balance the tradeoff between the standard error (lowest possible standard error) while minimizing historical fit (MAPE) and Durbin Watson ~2.

4. Provide stability and reasonable coefficients for each of the matched markets

V. Stability/Stress Testing the Model

In order to stress test the model, and ensure stability of the coefficients, you can run several variations of your model, for instance:

  1. Data frequency: Ex. daily vs. weekly data
  2. Outcome data: Test a few key performance indicators or outcome metrics
  3. Time-shift the data: to change pre/post period (make sure training period is continuous/without gaps)
  4. Type I error: This allows us to see if we have rejection of a true Null Hypothesis or falsely infer the existence of something that is not there. Run iterations of placebo interventions (i.e. artificially change the outcome in the post-period), to see if the model can detect an impact. Look at the impact testing:
  • Outcome effect sizes: 1%- 10%
  • Intervention timeframes: 1 month, 3 month, 6 month, 9 month
  • Confidence Intervals: Test 80%, 90%, 95% etc., to understand the lift we need to see at 80% confidence, for instance

5. Type II error: Failing to reject a false Null Hypothesis, or falsely infer the absence of something

6. More than one test market: test to see if the same optimal parameter settings hold for each test market.

7. Business reasoning: Ensure that the list of matched markets makes sense from a business standpoint.

This will allow you to create a:

  1. Final list of parameters (market matching step + BSTS model)
  2. A final list of control markets
  3. Minimum detectable effect needed, at various confidence intervals + estimated size of business impact.
  4. Impact analysis: Building out a table of spend, duration, cost per incremental outcome, number of new outcome and lift % will allow you to effectively measure the directional ROI of marketing impact.

VI. Methodology Caveats:

The methodology has a fundamental underlying assumption that both Test and Control markets have the same ‘opportunity to see’ other impacting events (Paid Acq media, coupon changes, etc.). This is a MAJOR (and unlikely) assumption since we cannot control media interventions/pricing/coupon changes etc. by other parts of the organization. If the intervention time frame is long (ex. 6–9 months), the possibility of differential impacting events across markets is high, which will increase model error and ability to detect a significant lift.

Using back of the envelope calculation above to measure estimated lift, this approach can be used to measure other offline channels (where user-level data is not available) as long as

  1. There is sufficient spending with the marketing intervention in the test market, to overcome error from model estimation (uncertainty distribution or credible interval).
  • For this, start by calculating the Minimum Detectable Effect (MDE) as well as estimated impact for a given spend level
  • Minimize other activity going on in test/control markets i.e. minimize differences in ‘opportunities to see’ for other marketing events in test/control markets to the extent possible

2. You ensure that the time series data quality for test/control markets is good. Establishing a stable/good set of matched markets that correlate with the test market is important for the model to predict the baseline values (and hence calculation of lift) in a reliable way.

Putting it all together

Causal Impact, when used with understanding, is a great relatively low-lift way to measure the impact of a marketing intervention or campaign (where individual level A/B testing or measurement isn’t possible).

With a bit more understanding of the technical nuances, you can optimize your results, balancing the tradeoff between historical model fit (MAPE), and standard error, while reducing multicollinearity and maintaining a Durbin Watson of ~2.

*****************************************************************

References

[1] Fitting Bayesian Structural Times Series with the BSTS R package, Steven Scott https://www.unofficialgoogledatascience.com/2017/07/fitting-bayesian-structural-time-series.html

[2] Making Causal Impact Analysis easy, https://multithreaded.stitchfix.com/blog/2016/01/13/market-watch/

Relevant R packages:

[1] CausalImpact version 1.0.3, Brodersen et al., Annals of Applied Statistics (2015) https://github.com/google/CausalImpact

[2] Market matching package, Kim Larsen https://github.com/klarsen1/MarketMatching

[3] Vignette fordtw package: https://cran.r-project.org/web/packages/dtw/vignettes/dtw.pdf

*****************************************************************

*****************************************************************

Fractal is one of the most prominent players in the Artificial Intelligence space. Fractal’s mission is to power every human decision in the enterprise and brings AI, engineering, and design to help the world’s most admired Fortune 500 companies. Fractal has consistently been rated as India’s best companies to work for, by The Great Place to Work® Institute, featured as a leader in the Specialized Insights Service Providers Wave™ 2020, Customer Analytics Service Providers Wave™ 2019 by Forrester Research, and recognized as an “Honorable Vendor” in 2020 magic quadrant for data & analytics by Gartner. For more information visit fractal.ai.

  • ************************************************************

Credits: Bella Wang