The Encoding Trap We All Fall Into
When I first started in data science, the recipe for handling categorical variables seemed straightforward: just one-hot encode everything! It's the default approach taught in most ML courses, and for good reason—it's simple to understand and implement. But after years of building models across various domains, I've learned a hard truth: relying solely on one-hot encoding is leaving significant performance on the table.
In this post, I'll share two game-changing approaches we've implemented that dramatically improved our model performance:
- Statistical encodings for high-cardinality categorical variables
- Cyclical encoding with basis functions for time-based features
Both techniques have transformed how our models handle categorical data, especially in complex forecasting scenarios. Let's dive in!
The High-Cardinality Problem
We recently tackled a demand forecasting project with over 50,000 unique SKUs. Our first instinct was to one-hot encode the SKU IDs, but that immediately raised red flags:
- Explosion of dimensions: 50,000 SKUs = 50,000 new binary features!
- Sparsity issues: Each sample would have just one "1" and 49,999 "0"s
- Cold-start problem: New SKUs would have no representation
- Lost information: The encoding doesn't capture any relationship between similar SKUs
Here's what one-hot encoding looks like in pseudocode:
# Traditional one-hot encoding
for each sku in dataset:
for each possible_sku_value in all_possible_skus:
if sku == possible_sku_value:
encoded_feature[possible_sku_value]=1
else:
encoded_feature[possible_sku_value] = 0
For a dataset with 50,000 SKUs and just 1 million rows, this creates a matrix with 50 billion elements—most of which are zeros! Not only is this computationally inefficient, but it also fails to leverage the rich information contained in the SKU identifiers.
Statistical Encoding: Turning IDs Into Information Gold Mines
We completely reimagined how to handle SKU IDs by encoding them with statistical properties of their historical time series. This approach was inspired by techniques used in the M5 forecasting competition, where winners extracted rich representations from product identifiers.
Here's how we implemented it:
# Pseudocode for statistical encoding of high-cardinality SKU IDs
Function CreateStatisticalEncodings(all_skus, historical_data):
sku_encodings = EmptyDictionary()
For each sku in all_skus:
# Get this SKU's historical time series
sku_time_series = FilterDataForSKU(historical_data, sku)
# Create empty dictionary for this SKU's statistics
sku_stats = EmptyDictionary()
# Basic volume and variation metrics
sku_stats["mean"] = CalculateMean(sku_time_series)
sku_stats["median"] = CalculateMedian(sku_time_series)
sku_stats["std_dev"] = CalculateStandardDeviation(sku_time_series)
sku_stats["cv"] = sku_stats["std_dev"] / sku_stats["mean"] # Coefficient of variation
sku_stats["min"] = CalculateMinimum(sku_time_series)
sku_stats["max"] = CalculateMaximum(sku_time_series)
sku_stats["range"] = sku_stats["max"] - sku_stats["min"]
# Trend metrics
sku_stats["trend_coefficient"] = FitLinearTrend(sku_time_series)
sku_stats["trend_strength"] = CalculateTrendStrength(sku_time_series)
# Seasonality metrics
sku_stats["weekly_seasonality"] = DetectWeeklySeasonality(sku_time_series)
sku_stats["monthly_seasonality"] = DetectMonthlySeasonality(sku_time_series)
sku_stats["quarterly_seasonality"] = DetectQuarterlySeasonality(sku_time_series)
sku_stats["yearly_seasonality"] = DetectYearlySeasonality(sku_time_series)
# Autocorrelation metrics
For each lag in [1, 7, 14, 28, 90, 365]:
sku_stats["autocorr_" + lag] = CalculateAutocorrelation(sku_time_series, lag)
# Quantile metrics
For each quantile in [0.1, 0.25, 0.5, 0.75, 0.9]:
sku_stats["quantile_" + quantile] = CalculateQuantile(sku_time_series, quantile)
# Time series decomposition
decomposition = DecomposeTimeSeries(sku_time_series)
sku_stats["seasonal_strength"] = CalculateVariance(decomposition.seasonal) / CalculateVariance(sku_time_series)
sku_stats["residual_strength"] = CalculateVariance(decomposition.residual) / CalculateVariance(sku_time_series)
# Frequency domain metrics
frequency_analysis = PerformFFT(sku_time_series)
sku_stats["dominant_frequency"] = GetDominantFrequency(frequency_analysis)
sku_stats["spectral_entropy"] = CalculateSpectralEntropy(frequency_analysis)
# Store all metrics for this SKU
sku_encodings[sku] = sku_stats
Return sku_encodings
In total, we computed 48 statistical features for each SKU, transforming a simple identifier into a rich representation of its historical behavior. This approach yielded several tremendous benefits:
- Dimensionality reduction: Instead of 50,000 binary features, we had just 48 continuous features
- Information preservation: Each feature captured meaningful aspects of the SKU's behavior
- Cross-SKU learning: The model could now recognize patterns across similar SKUs
- Cold-start handling: New SKUs could be characterized by their early sales patterns
The impact was dramatic—our forecasting accuracy improved by 22.7% compared to models using one-hot encoding!
Real Example: Pattern Recognition Across Products
Let me share a concrete example of why this was so powerful. In our dataset, we had two energy drink SKUs that had never been sold in the same store. With one-hot encoding, our model couldn't transfer learning between them. But with statistical encoding, the model immediately recognized similarities in their:
- Weekly seasonality patterns (both peaked on weekends)
- Price elasticity (both had similar responses to promotions)
- Growth trends (both showed similar upward trajectories)
When a promotion was run on one SKU, the model could now accurately predict how the other would respond to a similar promotion—something impossible with traditional encoding.
Cyclical Variables: When One-Hot Actually Destroys Information
A second major insight came when modeling time-based features like day of week, month, and holidays. The standard approach is often to one-hot encode these variables:
# One-hot encoding for day of week (0=Monday, 6=Sunday)
monday_feature = 1 if day_of_week == 0 else 0
tuesday_feature = 1 if day_of_week == 1 else 0
# ... and so on
But this approach creates a fundamental problem: it breaks the natural cyclical relationship between values. With one-hot encoding, Sunday (6) and Monday (0) appear as completely unrelated as Sunday and Wednesday—the model can't see that they're adjacent days!
Basis Functions: Smooth Representations of Cyclical Data
Instead, we implemented cyclical encodings using basis functions—a technique that preserves the natural cyclical relationships and even allows for "spillover" effects between adjacent values.
Here's our approach:
# For cyclical features like day of week (0-6)
def cyclical_encoding(value, period, num_basis_functions=3):
"""
Create smooth cyclical encoding using sine/cosine basis functions
value: The value to encode (e.g., day of week 0-6)
period: The full cycle length (e.g., 7 for days of week)
num_basis_functions: Number of sine/cosine pairs to use
"""
encodings = []
for k in range(1, num_basis_functions + 1):
# Create sine and cosine features for this frequency
sin_feature = sin(2 * π * k * value / period)
cos_feature = cos(2 * π * k * value / period)
encodings.extend([sin_feature, cos_feature])
return encodings
# Example: Encoding day of week (0-6)
day_features = cyclical_encoding(day_of_week, period=7, num_basis_functions=3)
# This gives 6 features that smoothly represent the cyclical pattern
This approach creates a smooth representation where adjacent days have similar encodings, and the cycle wraps around properly from Sunday to Monday.
Handling Holiday Effects with Basis Functions
We took this approach even further for modeling holiday effects. Instead of a simple binary flag for "is holiday," we created a continuous effect using a Gaussian basis function:
# Modeling holiday effects with tapering
def holiday_effect(target_date, holiday_date, width=3):
"""
Create a tapering holiday effect that builds up and winds down
width: Controls how quickly the effect tapers off (in days)
"""
days_difference = abs((target_date - holiday_date).days)
effect = exp(-(days_difference ** 2) / (2 * width ** 2))
return effect
# Example: Modeling Christmas effect
christmas_effect = holiday_effect(current_date, christmas_date, width=5)
This function creates an effect that:
- Peaks at 1.0 on the holiday itself
- Gradually builds up before the holiday
- Gradually tapers off after the holiday
- Eventually approaches zero for dates far from the holiday
The beauty of this approach is that it models reality much more accurately. The effect of Christmas doesn't suddenly appear on December 25th and disappear on December 26th—it builds up through December and gradually fades afterward.
Visualization of Effect Difference
Here's what this looks like visually for Christmas sales in our retail data:
```
One-Hot Encoding: Basis Function Encoding:
^ ^
| |
| |
| |
□□□□□□□□□□□□□ ____/ \____
---□□□□□□□□□□□□□□□□□--- _/ \_
| / \
Dec 20 Dec 25 Dec 30 Dec 20 Dec 25 Dec 30
```
The basis function approach captures the gradual build-up and tapering effect that matches the real-world pattern, while one-hot encoding creates an unrealistic step function.
Putting It All Together: Combined Impact on Our Models
When we implemented both statistical encodings for high-cardinality variables AND basis functions for cyclical variables, the combined effect was greater than the sum of its parts. Our models could now:
- Recognize similar SKUs through their statistical signatures
- Properly model the smooth transitions between cyclical time periods
- Capture the gradual build-up and tapering of holiday effects
The results were remarkable:
- 22.7% accuracy improvement from statistical encodings
- 8.3% additional improvement from cyclical basis functions
- Massive reduction in model complexity (48 features vs. 50,000+ with one-hot)
- Better generalization to new products and time periods
Implementation Tips and Challenges
While these techniques are powerful, implementing them successfully required navigating some challenges:
For Statistical Encodings:
- Feature stability: We had to ensure that statistical features were robust to outliers in the time series
- Leakage concerns: Care must be taken to only use historical data when creating these encodings
- Computation time: Calculating 48 statistical features for 50,000 SKUs required optimization
- Missing history: For new SKUs, we developed a fallback strategy using category averages
For Basis Functions:
- Choosing the right width: The width parameter controls how quickly effects taper off
- Balancing complexity: More basis functions capture more detail but can lead to overfitting
- Interaction with other features: These encodings often interact differently with other features
Code Example: A Complete Implementation
Here's a simplified version of our implementation for a demand forecasting scenario:
# Pseudocode for our encoding approach
def prepare_features(historical_data, forecast_dates):
features = []
# 1. Extract SKU statistical encodings
sku_stats = {}
for sku in unique_skus:
sku_history = historical_data[historical_data.sku == sku]
sku_stats[sku] = extract_statistical_features(sku_history.sales)
# 2. Prepare for forecasting
for date in forecast_dates:
for sku in skus_to_forecast:
# Get basic features
feature_row = {
'sku': sku,
'date': date,
}
# Add statistical encoding features
for stat_name, stat_value in sku_stats[sku].items():
feature_row[f'sku_stat_{stat_name}'] = stat_value
# Add cyclical time features
day_of_week = date.weekday()
month = date.month
day_of_month = date.day
# Cyclical encoding for day of week
dow_features = cyclical_encoding(day_of_week, period=7, num_basis=3)
for i, value in enumerate(dow_features):
feature_row[f'dow_cyclical_{i}'] = value
# Cyclical encoding for month
month_features = cyclical_encoding(month, period=12, num_basis=4)
for i, value in enumerate(month_features):
feature_row[f'month_cyclical_{i}'] = value
# Add holiday effects
for holiday_date, holiday_name in holidays:
effect = holiday_effect(date, holiday_date, width=holiday_widths[holiday_name])
feature_row[f'holiday_{holiday_name}'] = effect
features.append(feature_row)
return pd.DataFrame(features)
Industry State-of-the-Art: A Rapidly Evolving Landscape
It's worth noting that the field of categorical encoding for machine learning is experiencing a remarkable revolution. Major tech companies and research institutions are continuously pushing the boundaries of what's possible:
- Google has published research on learned categorical embeddings in their Wide & Deep architecture, which is now used for various recommendation systems.
- Uber leverages a combination of statistical encodings and learned embeddings for their demand forecasting systems, serving millions of rides daily.
- Amazon has developed specialized encoding techniques for their massive product catalog, allowing them to handle millions of SKUs efficiently.
- Netflix has pioneered work in content embeddings that transform categorical attributes of shows and movies into rich representations.
- Financial institutions are using advanced cyclical encodings for temporal patterns in transaction data, detecting fraud with unprecedented accuracy.
SOTA (State-of-the-Art) approaches are being published almost monthly, and what was cutting-edge just six months ago is now considered a baseline approach. This rapid evolution makes it an exciting time to be working in this field.
In our own work, we're not just implementing techniques from academic papers—we're actively contributing to the field through experimentation and refinement of these methods for real-world applications. Our approach to statistical encoding of high-cardinality variables, for instance, incorporates several innovations that we haven't seen published elsewhere.
What makes this particularly exciting is that advances in encoding techniques often yield more significant gains than switching to more complex model architectures. A well-encoded feature set can make even a simple model perform remarkably well, which has important implications for computational efficiency and interpretability in production environments.
The Cutting Edge: Learned Embeddings with Transformer Architectures
While statistical encodings and basis functions have dramatically improved our models, we're already pushing into the next frontier: learned embeddings within neural network architectures.
This field is experiencing a true revolution right now, with companies like Google, Amazon, Uber, and several AI research labs publishing groundbreaking approaches almost monthly. The pace of innovation in learned representations is staggering, and we're proud to be contributing to this cutting-edge area.
Our research team is currently experimenting with transformer architectures for time series forecasting that can learn optimal representations of categorical variables automatically during training. Here's a high-level pseudocode approach we're exploring:
# Conceptual pseudocode for embedding categorical features in a transformer architecture
Function BuildTimeSeriesTransformer(categorical_variables, time_features):
# Create embedding spaces for each categorical variable
For each variable in categorical_variables:
Create embedding layer with dimension proportional to variable cardinality
# Process time features with positional encoding
encoded_time_features = AddPositionalEncoding(time_features)
# Multi-head attention blocks
attention_output = MultiHeadAttention(encoded_time_features)
# Combine learned embeddings with attention output
For each embedding in embeddings:
Process embedding through feedforward network
combined_features = Concatenate(processed_embeddings, attention_output)
# Final prediction layers
predictions = FeedForwardNetwork(combined_features)
Return model
The beauty of this approach is that the embeddings are learned automatically during model training to optimize for the prediction task. This offers several advantages over manual encoding:
- Optimal representations: The model learns exactly the representation that minimizes prediction error
- Automatic feature interactions: Embeddings naturally capture interactions between categorical variables
- Transfer learning potential: Embeddings from one task can be reused for related tasks
- Reduced manual engineering: Less need for domain expertise in feature creation
Our early experiments with transformer-based architectures are showing promising results, particularly for complex time series with multiple categorical factors. In some test cases, we're seeing an additional 5-7% accuracy improvement over our statistical encoding approach.
What's particularly exciting is how these approaches are being rapidly advanced by major tech companies and research labs. Companies like Uber are using similar techniques for their demand forecasting, while retail giants are implementing transformer architectures for inventory management. The state-of-the-art is advancing so quickly that techniques considered cutting-edge six months ago are now baseline approaches.
However, these approaches come with their own challenges:
- Data requirements: Learning good embeddings typically requires more data
- Computational complexity: Training transformer models is significantly more resource-intensive
- Interpretability: Learned embeddings are less interpretable than statistical features
- Engineering complexity: Deploying these models in production requires more infrastructure
We're actively working to address these challenges and hope to move this approach into production within the next development cycle.
Conclusion: Rethinking Categorical Encoding
The key lesson we've learned is that categorical encoding isn't just a mechanical preprocessing step—it's an opportunity to inject domain knowledge and statistical insight into your models.
By moving beyond one-hot encoding to statistical representations and basis functions, we've significantly improved our model performance while actually reducing computational complexity. And our research into learned embeddings with transformer architectures suggests even more gains ahead.
It's vital to reconsider your encoding approach based on the specific use case and task at hand. A tremendous amount of rich information is hidden within categorical variables, and the right encoding technique can unlock this hidden value. What might appear as a simple identifier (like a product SKU or customer ID) often contains implicit patterns, relationships, and behaviors that traditional encoding methods completely discard.
I encourage you to look at your own categorical variables with fresh eyes:
- Which high-cardinality variables might benefit from statistical encodings?
- Which variables have inherent cyclical relationships that one-hot encoding breaks?
- Where might gradual tapering effects better represent real-world phenomena than binary flags?
- Could your project benefit from learned embeddings in a neural network architecture?
The answers to these questions could unlock significant improvements in your own models.
In future posts, I'll dive deeper into our transformer-based experiments and explore other advanced encoding techniques, including target encoding and hybrid approaches. Until then, I'd love to hear your experiences with different encoding methods in the comments!