Sales Forecasting for Retailers: Models, Mistakes, and How to Get it Right

In retail, sales forecasting uses algorithms to predict a store’s future performance based on the current performance of similar stores. This empowers them to make intelligent real estate decisions based on which locations and trade areas represent the best financial opportunities. Of course, a sales forecast’s accuracy depends on the reliability of your data and your ability to incorporate the variables that actually impact performance.

An inaccurate sales forecast can undermine your whole location strategy and lead you to prioritize less-effective (or even unprofitable) sites. Unfortunately, many retailers make multimillion-dollar decisions based on forecasts with uncomfortably high margins of error.

Producing accurate sales forecasts requires moving beyond the criteria you use for trade area analysis and includes regional variables like weather, crime rate, and annual events that can change demand and traffic flow throughout the year. Decades ago, a sales forecast may have incorporated a handful of variables that captured the major drivers of performance. Today’s process can take thousands of variables into consideration and whittle down to dozens of variables that capture both the major drivers as well as nuanced variables that forecast extremes of special cases in a way that was not possible just 10 years ago.

Retailers may use a wide variety of sales forecasting models based on their available data and location characteristics. Given the volume of data and the number of variables that affect a store’s sales, most retailers turn to predictive analytics software with advanced machine learning and artificial intelligence capabilities to generate their sales forecasts.

To build sales forecasts you can trust, you need to customize the process around your specific business, rule out the factors that have no impact, and include every variable that affects a store’s bottom line. But that can be difficult to pull off—especially if your business lacks the tools or expertise.

In this article, we’ll walk you through some common sales forecast methods retailers use, mistakes to avoid, and what it takes to get this critical process right.

Sales forecast methods for retailers

Traditionally, sales forecasting started with a formula—often a linear regression model—and analysts would simply plug key variables into the formula. This is still how some predictive analytics software works today, and it can be somewhat effective. But the best sales forecasting begins with identifying the variables first, then allowing the best machine learning algorithm, or combination of algorithms, to be selected that is most suitable for the client data. In fact, by applying machine learning and artificial intelligence to their datasets, retailers can efficiently run their data through multiple layers of predictive models to ensure they account for all the variables that matter and weed out the ones that don’t.

The goal of this process is to train machine learning algorithms to understand the relationship between variables across your portfolio—for example, the general correlation between the populations of various target demographics, the proximity and number of nearby competitors, the vehicle and/or pedestrian traffic around the store, and actual sales. By analyzing these variables across your portfolio and comparing sites with matching characteristics, you can use the models to predict future sales for potential locations as well as identify under-performers in the current store network, which have upside potential and which would benefit from either operational or capital investments.

Here are some of the most common models used in retail sales forecasting. The best predictive analytics tools will utilize these and more, creating ensemble models that are far superior in forecasting volumes, whether those are sales, counts, gallons, service hours, gym members, e-com deliveries, among others.

Linear regression model

Linear regression is one of the most commonly used predictive models. In retail predictive analytics, linear regression may use an outcome such as raw sales figures or EBITDA (earnings before interest, tax, depreciation, and amortization) as the dependent variable, and any number of site or trade area characteristics as independent variables.

This model establishes a best-fit linear relationship between the dependent and independent variables. Linear regression can consider many independent variables, but by design, it’s best for identifying broad relationships between them, which loses some of the nuances that can yield more accurate forecasts. Additionally, you’ll always have outliers and residuals that don’t perfectly fit the model, and you may need to dig into these more and apply other layers of predictive models to ensure your forecast can account for them.

Random forest

Decision trees are excellent for training machine learning algorithms, as they help software understand highly specific relationships between inputs and outputs. But they don’t translate well to predictive models because they’re too specific to the individual trees—in this case, your individual stores. They account for too many variables and don’t understand which ones carry the most weight.

Random forest leverages machine learning to examine decision trees in aggregate, identifying patterns and recognizing the significance of variables by splitting the trees at various points and grouping them in different ways.

Common sales forecasting mistakes

Sales forecasting is a crucial, high-stakes exercise. Mistakes could lead you and others to close stores, open stores that have no hope or would only cannibalize sales from another location or relocate stores into worse trade areas. And it’s frighteningly easy to get projections wrong. In fact, with any sales forecast, you’re always going to have a margin of error—the goal is just to reduce it as much as possible.

Here are some of the most common mistakes that create sales forecasts you can’t trust.

Not accounting for all the variables

The single greatest challenge with producing accurate sales forecasts is incorporating every factor that affects a store’s performance. At Tango, we generally separate these into two broad areas, each of which has numerous subcategories: Regional/Surrounding area, immediate vicinity, store specific.  Variables range from site, trade area and regional characteristics.

The facility itself has numerous qualities that can affect its ability to access the demand in a trade area, including visibility, accessibility, features, and operations. Your site models need to include assessments of each site’s parking lot, ingress and egress, capacity, signage, and other key criteria.

Each trade area has many variables as well, including raw demographic data, traffic flow, crime, weather, points of interest, competitors, and more. These determine the size of the opportunity the trade area presents.

Together, site and trade area characteristics could encompass hundreds or thousands of variables. This can easily get too granular—you wouldn’t want to incorporate the color of a garage, for example—but your goal should always be to include every variable that has significance, particularly if you can rationalize its impact. However, most predictive analytics software forces retailers to overlook many of these variables, “simplifying” the process by learning less about what makes your stores effective.

Tango Predictive Analytics helps you bring more variables into consideration through a series of 30 or so questions about your facility, which help create more accurate site models, useful comparisons, and ultimately, more reliable forecasts. Plus, you can create custom inputs based on your own knowledge of your unique business.

Choosing a sales forecast model that doesn’t fit the location

Sales forecasting works best when you choose the formula that best fits the data—not when you choose the data that best fits a formula. Unfortunately, software limitations or lack of expertise often force retailers to use cookie-cutter approaches that start with a formula instead of starting with the data.

The problem is that you often don’t know which variables matter most to your business’s success or how to weight them properly until you’ve trained a machine learning algorithm. So if you start with a model and preselected variables, you’ll wind up relying on some that have little relevance and overlooking others that directly impact performance.

Every business is different, and you can’t know which inputs yield the most reliable outputs until you’ve analyzed the available data in its entirety.

“Overfitting” the data

Machine learning is a powerful process that lets retailers analyze massive volumes of data to identify patterns human analysts could never recognize on their own. But in order for an algorithm to learn how to classify data and make predictions, it has to “train” on a dataset. The goal is for it to learn relationships between variables based on the training data, and then apply its understanding of those relationships to new data. (In this case, you analyze your current sites to make predictions about a potential new site or forecast sales for a current one.)

But there’s a problem retailers frequently encounter when using machine learning for sales forecasting: when the algorithm trains too long on the dataset, it actually learns so much about the intricacies of the dataset that it can no longer apply what it learned to new information. It “overfits” the data and doesn’t generalize well to other datasets. You can tell this has occurred when the algorithm yields a low error rate on the training dataset and a high error rate on the test dataset.

To prevent this issue, you need tight controls for addressing overfitting if it arises. You need the ability to vary your training data and cross validate it. In Tango Predictive Analytics, we enable you to use different samples of your training and test data to ensure you can trust how the algorithm will behave and prevent overfitting.

Lack of refinement capabilities

Sales forecasting should take a generally applicable, proven process and shape it to fit your unique business. Over time, you should be able to experiment and test different variables to refine your models and reduce the error rate even further. Outliers and anomalies create opportunities to investigate why a store’s performance didn’t fit the projections—which could lead you to discover a new significant variable you left out, like the presence of a nearby complementary business, a piece of specialized equipment, a unique product offering, Yelp reviews, or an exceptional manager.

Tango AutoML does a lot of the heavy lifting by removing variables with near-zero variance (meaning little impact on the outcome) and learning which techniques work best for your business.

But Tango has another process for helping you refine your sales forecasting methodology: Portfolio Reviews. Every Tango Predictive Analytics customer gets free consultations from our data experts. Not only will our analysts work with you to recommend actions for each location based on performance, but they’ll also help you examine outliers and anomalies to refine your models. Of course, Tango Predictive Analytics empowers your own analysts to dig into anomalies, too, but many customers lean on our expert insights to support their internal decisions and improve their refinement capabilities.

Sampling bias

Sampling bias makes your training data worthless. It’s when your model doesn’t reflect your actual store distribution in your existing network. For example, you might train your machine learning model on a set of stores from Utah—but that’s obviously not going to generalize well when you want to create forecasts for a potential site in California.

To avoid overfitting your data, you want to break your training data into samples and mix and match which stores you let the algorithm work with. But you still have to ensure that each of these samples are representative of your real estate portfolio. Some key dimensions to consider in your sample are urbanity, market size, and classifications that are relevant to your business, like mall vs. off-mall, street locations, non-traditional sites like an airport, or drive-thru units.

When there’s a disconnect between your training and test data, that often reveals either sampling bias or an early model.

Ignoring data from closed stores

When we ask customers to provide performance data from the past several years, we are not just talking about your stores that are currently open. A machine learning algorithm can glean a lot from closed locations, especially in terms of how one location may affect others nearby. Your closed stores may have certain characteristics in common, performance patterns that could indicate potential problems, or real estate decisions that led to closures (such as store cannibalization).

Stores that closed in the last few years can still serve as ideal site models for a current store or potential one, so you certainly don’t want to rule them out of your training data. They’ll also help you better understand the ripple effects of future decisions and yield better sales forecasts.

Improve your sales forecasting with Tango

Tango Predictive Analytics significantly reduces your margin of error by incorporating all the variables that matter, eliminating the ones that don’t, and bringing robust, versatile machine learning and artificial intelligence to your models. You’ll have all the refinement capabilities you need to produce reliable sales forecasts that fit your business, plus tools to diagnose and prevent overfitting, sampling bias, and other sales forecasting challenges.

Want to see what Tango Predictive Analytics can do for you?

Schedule a demo today.

Keep up to date with industry news

Every month, we publish in-depth newsletters and articles exploring emerging trends in workplace and retail management—subscribe to stay in the know!