## The SO1 Supermarket Gym – Data Generating Process & theoretical background

*In the first part of this interview with Sebastian, our Chief AI Officer at SO1(Segment of One), we introduced the Supermarket Gym, a tool to simulate shopping behaviour of individual customers. One first step to utilize this tool is to generate data, which this part (edited for clarity and length) focuses on.*

**Sebastian, the last time you outlined the key components of the Supermarket Gym. That’s the simulation we use to train the SO1 algorithm, our “feed generator”, so that individual shoppers receive tailored special offers while maximizing redemption and revenue for the retailer. This time, let’s take a deeper dive into what makes this simulation so special.**

Ok. To simulate shopping baskets you need a so-called data generating process, or DGP. A DGP is basically a statistical model that allows you to sample shopping baskets, for example. Now, a shopping basket is defined as a set of products that were purchased in the same shopping trip, which means by the same consumer, at the same time, at the same retailer.

Generating this kind of data is not trivial, because the product relationships are very complex. For example, a product A (like Pepsi) can be a substitute and a complement at the same time – a complement to product B (Pringles), but a substitute to product C (Coca-Cola).

There are different ways to generate this data. One way is to use so-called generative machine learning models: you input tons and tons of basket data, try to learn the underlying multivariate distribution that allows you to sample shopping baskets, and then sample shopping baskets from that distribution. This does not really solve the “black-box” problem I mentioned in the first part of the interview. Although the baskets look real you do not know what drives customer choices.

For our use cases at SO1, we, therefore, have to take a different route with the Supermarket Gym. When you look at quantitative marketing research over the last several decades, you will find that researchers have proposed many models to analyze the different parts of the consumer decision-making process, for example, retailer choice, category choice, and product choice. We incorporated all these models from econometric research and built a very structural approach, an approach combining several structural models, to this problem. This combination of models is our DGP. Moreover, we truly understand the different components and we calibrate those different components – based on our expert knowledge and our understanding of the industry. All of this allows us to generate shopping baskets and to connect several baskets of a consumer’s shopping history – as, by definition, consumers are the ones making the decisions.

To summarize this point: there are both unstructured and structured approaches to data generation. We opted for the structural approach, because that way we can leverage a lot of institutional knowledge from econometric research, and from marketing science. Marketing research has shown over and over again that these models simulate real data very nicely. So why not make use of all of that knowledge? This allows us to build a DGP that is very transparent and that we can be sure works well.

There’s also a nice benefit to this approach that is very important for SO1. We simulate individual consumers making decisions, so we actually simulate loyalty card data. Of course we could discard the loyalty card ID, that is “the user” identifier in our simulation, and simply generate baskets, which would also work. But since the user ID is in there for free – because we simulate individual retail customers – we also make good use of it.

**So the data you generate represents 100% distribution of the loyalty card, right?**

Exactly, and then we can always remove loyalty card IDs to simulate a case where we only have, let’s say a 40% to 50% penetration, as we observe for Payback or other loyalty programs that exist at retailers.

**You’ve talked a little about the data generating process. Can you also tell us more about the theoretical models that were used in the development of the Supermarket Gym?**

Sure. As I mentioned, this data generating process (DGP) consists of several models, and I’ll just mention a few. The model that we use to generate what we refer to as *category purchase incidence is a multivariate probit model* (Manchanda, Ansari & Gupta 1999. The “Shopping Basket”: A Model for Multicategory Purchase Incidence Decisions. Marketing Science). We use that model to simulate whether customers purchase categories or not so the output is binary. The important thing here is that the decisions whether to buy categories or not are correlated. Basically, there’s complementarity or coincidence across categories. A typical example of complementarity would be if you buy milk, you might be more likely to buy cookies as well, because we know that milk and cookies go well together.

Or if you buy pasta, you might also buy pasta sauce, because those two go well together, too. So this is the correlation that I was referring to, which is directly modeled by this multivariate probit model. That’s the model that we use for determining or simulating whether consumers buy a category or not in a given week or on a given shopping trip.

In addition to that, we need to model product choice. There we use a different class of models, because the output data has a different shape. It’s not a series of correlated binary purchase events as in the case of category choice. But within a category, consumers typically choose a certain number of products from all available products within the category. So that sounds a little abstract, but what I mean by that is, that whenever you buy detergent, you typically do not buy five or six different detergents. You often pick one detergent in a category. For example you might choose Tide but not Purex. It would be very, very unusual for someone to buy several detergents. This is a different modeling problem, because you have a set of J products – let’s index them 1 to J – and typically you buy one from among this set of J products. So the decision process is not binary anymore, and it’s what is called a *multinomial model*.

You picked one out of the J alternatives that you have and that are available to you. And there’s a lot of research on that in econometrics, i.e., so-called discrete choice models. Discrete choice means there is a choice among alternatives and you pick one out of these alternatives. But you don’t pick many. This is a type of model that we use to model product choice – specifically the *mixed logit model* (Train 2009. Discrete choice methods with simulation. Cambridge University Press) or the *multinomial probit model* (Chintagunta 1992. Estimating a Multinomial Probit Model of Brand Choice Using the Method of Simulated Moments. Marketing Science). An important point here is that product choice is conditioned on category choice. By that I mean that we only sample a product purchase if you purchase a category. Here comes the sequential aspect I mentioned earlier: At first, we model whether you buy a category, which is a binary event: either you buy a category, that’s the “1”, or you don’t buy it, that’s the “0”. And then conditioned on category purchase incidence, that is, we sampled a “1” in our binary model, we then decide which product you purchase. So we pick “1” out of J alternatives and this is a sequential process.

That’s the core of the simulation. But even before the *category choice* we might simulate which retailer you go to, and after the *product choice* we might sample how many items of that particular product you purchase. So there might be two additional models: *retailer choice* before the *category choice* and *quantity choice* after the *product choice*. But the core that is always present consists of the two models in the middle, namely, category and product choice.

**In the upcoming two parts of this interview we will look at some concrete applications of the Supermarket Gym:**