The brighter you are, the more you have to learn.

Reinforcement Learning for Ad Spend Optimization

Author – Author – Ankit Gadi, Data Scientist at Decision Tree Analytics and Services

All companies with any digital spend are faced with a unique proposition, the media agency has supplied a host of creatives, however they are not sure which ad would work and which wouldn’t. Traditionally marketers have used their ‘gut feel’ and ‘experience’ to allocate spends across different ads.

However, with advances in technology, one can leverage machine learning algorithms to remove the guesswork from this task. This problem can be framed mathematically to optimize spends across campaigns to maximize a performance metric like CTR. Multi armed bandits is one such technique, which can be applied here.

MULTI – ARMED BANDITS: A multi-armed bandit solution is a reinforcement-learning algorithm that uses machine-learning algorithms to dynamically allocate traffic to variations that are performing well, while allocating less traffic to variations that are underperforming.

The algorithm starts in an ignorant state, where it knows nothing, and begins to acquire data by testing the system. As it acquires data and results, it learns what the best and worst behaviors are. Furthermore, it tries to solve the explore-exploit problem in a different way. Instead of two distinct periods of pure exploration and pure exploitation, bandit tests are adaptive, and simultaneously include exploration and exploitation. This algorithm suggests that we should not discard the choices that didn’t do well, but we should pick them at a decreasing rate as we gather confidence that there exist better options.

Algorithms of multi - armed bandits include:

Epsilon Greedy Algorithm: The epsilon-Greedy algorithm works by randomly oscillating between purely randomized experimentation and objective of maximizing profits. It gives importance to prior information with probability epsilon, which means that it randomly explores choices with probability 1- epsilon.

Softmax Algorithm:The Softmax algorithm tries to cope with choices differing in estimated value by explicitly incorporating erstwhile information about the available choices into its method by choosing the option to select when it explores. Suppose two arms, A and B, based on the past experiences have had two different rates of success: rA and rB (CTR). With those assumptions, the implementation of a Softmax algorithm would have you choose Arm A with probability exp(rA) / (exp(rA) + exp(rB)) and Arm B with probability exp(rB) / (exp(rA) + exp(rB)).

MULTI ARMED BANDIT FOR AD OPTIMISATION: For ad optimization, multi armed bandits are used to explore and exploit different ads to find out what batch of ads works the best, i.e., the winner set of ads. Not only it provides us with the optimal weights for ads but simultaneously studies how to grow customer engagement in the future by continuously learning from the past behavior. Implementation of multi armed bandit model requires the probabilities of each ad getting clicked upon an impression where Click Through Rate (CTR) of each ad provides us with that probability. Algorithm is applied iteratively so that all possible combinations can be considered and combination with highest reward is then chosen for further analysis. By reward we mean stochastic rewards from the ads, +1 reward for success and 0 reward for failure.


Greedy Algorithm works randomly without considering historical data while exploring. Now, random experimentation will seem like a great approach but it can lose a lot of money for a business, because it means you’re as likely to try a bad idea, as you are to try a good idea. Also this algorithm tends to converge towards a particular ad quite soon. In contrast, softmax selects and explores ads entirely on the basis of their previously obtained CTRs, which seems like a more realistic approach. On a client’s dataset, weights obtained from this algorithm are depicting an optimum balance between exploration and exploitation.

Multi-armed bandits are thus quite efficient because they move traffic towards winning ads progressively by making use of information obtained in the previous runs in future iterations to land up at the winner set of ads, instead of forcing to wait for a “final answer” at the end of an experiment. They’re faster because resources that would have gone to obviously inferior ads can be assigned to potential winners. Further, extra data collected on the high-performing ads can help separate the “good” ads from the “best” ones more quickly.

Finally, on the basis of the weights procured using MAB along with other constraints like minimum exposure required for each ad and fixed total budget, budget is then allocated across ads using linear optimization. On our client’s data set we found that overall Clicks could have been 13.33% (7560) more if investment across ads was made according to the weights indicated by MAB.