In last week's blog, we started demystifying the wonderful world of Machine Learning. In this blog we look at how Decision Trees help make sense of complexity.
This morning, Mbongeni our gardener asked me quite nonchalantly, for money to set up a small scale gold mining business in Zimbabwe. Immediately, a number of questions raced through my mind:
- Does Mbongeni know anything about gold mining?
- Will his business be legal?
- Will I receive a return?
- Will the garden fall into ruin while he’s off striking it rich?
You can imagine how these questions with ‘Yes’ or ‘No’ answers can be strung together and arranged in a form that looks surprisingly like an (upside down) tree with a root and several branches, each with a leaf at the end.
The Investment Decision: Mbongeni's El Dorado Mine
We are beset by decisions all day long - from where to go on holiday and what to wear in the morning to whether to invest in a new venture, such as Mbongeni's El Dorado Mine. We routinely reduce these to a set of simple decisions. Why? Because it’s easier that way.
Decision Trees are one of the most widely used machine learning methods. Other than their versatility in tackling a broad range of data problems, they lend themselves to simple graphical representations that require no statistical expertise to understand.
Just as in the above example where I considered a number of questions (call them features) before deciding whether to invest or not in Mbongeni’s El Dorado Mine, Decision Trees are used to step sequentially through a range of features that ultimately inform a decision.
One common application of Decision Trees is found in credit scoring models where a lender tries to predict whether an applicant will default on her loan or not. In doing so the Decision Tree might evaluate features such as the applicant’s credit history, account balance, term of loan, demographics, online profile and so on.
Decision trees grow
Our gardener Mbongeni would tell you that real trees grow. Decision Trees are no different. The algorithm works down each branch by creating decision nodes at each selected feature, effectively splitting the data into smaller and smaller subsets. Obviously this could go on for quite a while so when does the algorithm know when to stop?
As we saw in our previous blog on Cluster Analyses, one splitting criterion commonly used, seeks to create homogenous groups after each split. The algorithm stops when there is no appreciable increase in homogeneity.
So there you have it. If ever one day you come across an ex-gardener in a private jet with gold plated trimmings and Cristal champagne on tap, you’ll know it was all down to Machine Learning and Decision Trees.