An overview of our Intent model | Made With Intent Help Center

This document explains how our Intent model performs and what its outputs mean in real-world terms. The model is designed to support better decision-making by providing probabilistic predictions—not just binary outcomes.

The model provides:

Reliable probabilities for risk-based decisions.
Effective ranking of outcomes for prioritisation.
Strong performance when used for automated or manual decisions.

Read more about why our model is globally inspired based on eCommerce behaviours and continually fine-tuned here

How the model works

The model is a deep learning framework with multiple inputs and outputs, designed to predict customer intent on e-commerce platforms. It identifies four intent signals - conversion, exit, return and add to cart - each independently optimised and weighted. The model utilises two distinct input streams to capture different behavioural patterns.

The first input is designed to learn individual user shopping behaviour. It processes a range of real-time behavioural signals, including clicks, page navigation, site metadata, mouse movements, time since user clicked X, events since user viewed Y, dwell time, and a comprehensive history of interactions. This comprising approximately 600 features, is updated with each inference request.

The second input provides an aggregated view of the website’s environment. It incorporates information about product categories, page structures, pricing distributions, traffic composition, device mix, and behavioural patterns of both converting and non-converting users. This works out to approximately 800 features. This input enables the model to differentiate between various e-commerce environments, such as travel vs. retail, department stores vs. boutique stores or luxury brands vs. economical brands. The data is updated every three hours using a rolling seven-day window.

The architecture follows a wide-and-deep model structure with multiple branches to effectively accommodate the complexities of multi-input and multi-output learning. Model training is optimised over a dataset of +50 million events, with performance validated against smaller, single-output benchmark models. These smaller models serve as accuracy baselines for the deep learning framework. This also improves model explainability and feature exploration.

Each output undergoes independent calibration, transforming predictions into sigmoid probabilities. Model evaluation is conducted using both classification metrics (F1-score, AUC) to assess predictive accuracy and regression-based metrics (MSE, LogLoss) to evaluate probability calibration. Additionally, scenario-based testing is performed per website to validate performance across various contexts. These include accuracy assessments at different interaction stages, such as the first event, across different page types, varying page views, and pre- vs. post-add-to-cart interactions. Outliers identified through these evaluations are fed back into the training and feature engineering pipeline to refine model performance.

What do we predict?

Find out more about the Intent Framework here

How are the predictions made?

The specific model prediction we use here is our Intent to Convert (xC™).

Rather than giving a simple “yes” or “no,” our model provides a confidence score - a probability estimate of how likely something is to be true. For example:

“There’s a 65% chance this customer will churn.”

“This user has an 80% likelihood of being a bot.”

These scores allow you to act based on your own risk thresholds, rather than forcing a single outcome. We know user journeys are inherently progressive; there isn’t a particular moment when a user has Intent to Convert. Instead, we interpret our predictions in the context of a shopping journey, starting low and getting progressively higher until a purchase is made.

Why We Don’t Say ‘It’s X% Accurate’

A common question is: “How accurate is your Intent model?” But for probabilistic models like ours, the answer isn’t a single number like “80% accurate.”

Traditional accuracy applies to models that always give yes/no answers and are judged by how often they’re correct.
Our model gives probabilities, not final decisions. The quality of the model lies in how well-calibrated those probabilities are and how effectively the model distinguishes between different outcomes.

Instead of a single “accuracy” score, we report on:

Calibration – Are the confidence scores reliable?
Discrimination – Can the model tell different outcomes apart?
Decisioning – How useful is the model for making binary decisions?

1. Calibration: Do Confidence Scores Reflect Reality?

Calibration evaluates how well the model’s probabilities align with actual outcomes.

Expected Calibration Error (ECE): This means when the model predicts something with 70% confidence, it’s right about 70 times out of 100. A low ECE indicates the model’s probabilities are trustworthy.

Real-world analogy:

Like a weather forecast: when it says there’s a 30% chance of rain, and it rains 3 out of 10 times, the forecast is well-calibrated. Our model works the same way.

2. Discrimination: Can the Model Tell Outcomes Apart?

Discrimination measures how well the model separates different classes—for example, identifying likely positives versus likely negatives.

AUC (Area Under the Curve): For example, a score of 0.84 shows that the model can correctly rank positive cases ahead of negatives about 84% of the time.

Real-world analogy:

Think of sorting emails into spam and non-spam. A high AUC means the model ranks spam messages near the top of the list—even before you draw a cutoff.

3. Decisioning: Precision vs Recall

When a specific threshold is chosen (e.g. flag if probability > 0.8), we can also evaluate how well the model performs when it commits to a yes/no decision.

Precision: Of all the times the model predicted “yes,” how many were correct?
Recall: Of all actual “yes” cases, how many did the model identify?
F1 Score: A single measure that balances precision and recall.

This helps assess performance in workflows that act on the model’s predictions.

Real-world analogy:

Like a medical test. You want to know:

If it says you’re positive, how often is that right? (Precision)
If you’re sick, how often does it catch it? (Recall)
The F1 Score balances both.