How we use trees to predict customer behaviour: random forests in action
To gain deep business insights we need to understand how customers will behave. This means decoding whatever information we have about a customer to predict their needs and ultimately whether they will buy.
How can we look at all the available information and connect to a clear signal, filtering out everything that’s irrelevant to us? Intuitively, we understand that a salesperson can classify prospective customers based on cues such as the clothes they are wearing, their apparent age and other factors.
- But how do we formalise these heuristic methods in a way that a computer could emulate?
- And how do we let the data reveal which features or combination of features have the greatest predictive value for buying a new car or booking a holiday to the Bahamas?
The random forest algorithm solves both of these problems:
To understand how this algorithm works, think of the two-player board game Guess Who?, popular in the 1980s. It shows named cartoon images of 24 fictional people, from which each player selects one. Then by taking turns to ask questions with a ‘yes’ or ‘no’ answer, they attempt to guess the identity of their opponent’s character. Whoever manages to do this first is the winner.
Guess Who? is very simple to understand and children quickly pick up some of the more subtle nuances through experience. Although all the questions they ask must have a binary outcome of yes or no, some will be more valuable than others. For example, you could start the game by asking: “Is this person Susan?”. If by some miraculous stroke of luck you were correct, you would immediately win the game. But it would much more likely turn out to be a terrible strategy because a negative answer would allow you to eliminate just one character out of 24.
A wiser initial question might be: “Is this person blond?”. No matter what the answer, you would be able to eliminate all the characters with blond hair who are blond or all people who are not blond. For the sake of argument, let’s assume that 20% of individuals are blond and 80% have some other hair colour. If you are lucky, your opponent’s individual is blond and you just need to sift through the remaining 20% of people left. If unlucky, then this wasn’t a very good question to start with since there is still 80% of the data left to go through. It quickly becomes clear that the optimal question is one whose answer would split the data evenly into 50% yes and 50% no.
The second subtlety is that the order in which you ask the questions matters, and the answers revealed by each will influence what your next optimal question will be. Imagine that in your previous round you have asked if the person has brown eyes and the answer was ‘yes’. If there is a 50:50 split between brown-eyed people who wear glasses and brown eyed people who do not wear glasses, then asking if this person wears glasses would be a good choice for your next question. If, on the other hand, the answer to the previous round was ‘no’ and only 10% of the non-brown-eyed people wore glasses, asking if they wear glasses is not a wise move.
Making predictions on new data
This method of repeatedly splitting data by finding the optimal splits at each stage is known as a decision tree. Visually speaking, the data starts out at the trunk of the tree and every split in the data represents two branches of data going off in different directions. The meaning of optimal split is a bit different when our goal is classification. The optimal split in this context is the split that divides the data as purely as possible with regards to whatever it is we are trying to predict.
If we are looking for an algorithm that predicts whether or not someone is going to buy a certain product, we find the variables that can most accurately split our training data into buyers and non-buyers.
Using a decision tree is simple but it is prone to overfitting. This means that although it will do an excellent job at classifying the data you train it on, it will do poorly when trying to make predictions on new data.
Random forest algorithm
The random forest algorithm remedies this by creating many decision trees, each trained on a subset of all the available data and only able to use a subset of all possible features in the data to make its splits. When new data is presented, each tree in the forest gets to vote on which class it believes a given data point belongs to. Whichever class gets the most votes is the prediction made by the random forest algorithm as a whole. The beauty of the method is that it is conceptually easy and it tends to perform very well.
We are better able to understand each individual consumer and know what he or she wants – perhaps even before they do.