Categories
Restructure

Introducing GRANT (pt 1)

The behaviour of Tree Ensemble models, like Random Forest and Gradient Boosting, is much easier to explain than is commonly understood.

One of the barriers to understanding the behaviour of a Tree Ensemble is that they are usually pictured as a collection of decision trees. 

While this is an accurate depiction of their architecture, it’s helpful to think of each split of the decision tree as splitting the answer space into ever smaller segments. 

Obviously this method doesn’t allow us to describe the behaviour of a model with more than two explanatory variables. But it does make clear that the decision boundaries have a rectangular shape and that they are contiguous, non-overlapping, and cover the whole answer space from -infinity to +infinity for every explanatory variable.

Criticisms of Explainable AI (XAI)

In Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead Cynthia Rudin correctly identifies the problems with current state of XAI, but makes two mistakes in arguing that uninterpretable modelling techniques shouldn’t be used for important decisions. 

Firstly the claim that “it is a myth that there is necessarily a trade-off between accuracy and interpretability” is overstated. It is true that regression models are a good choice for problems that are well described by the explanatory data and that research is continuing into more complex, interpretable models. But the results of countless data science competitions have shown that, today, the “blackbox” models are usually superior when accuracy matters. 

When describing complex techniques Rudin uses the “this looks like that” network as an example of a method to produce non-post hoc interpretations. This architecture is interesting, but isn’t necessarily explainable – there is no guarantee that the model doesn’t have unknown sensitivities that may result in unexpected behaviour. If there was a poor outcome as a result of such a model then it may not be possible to satisfactorily answer how the model arrived at such a conclusion.

Wagtail Labs favours a pragmatic approach to modelling. Analysis should start with a first-principles view (e.g. trying to solve it with physics, engineering, etc), because this provides us with the most detailed knowledge of the problem. Statistical models (e.g. regressions) should then be used if all the necessary explanatory data isn’t accurately measured or the relationships are uncertain. Machine Learning should be applied in cases when the functional form of the problem is complex (i.e. highly interacting with significant non-linearity) or statistical models aren’t yielding sufficient accuracy. The Machine Learning models should then be explained in a way that allows a user, although not necessarily a lay user, to understand the model’s sensitivities and weaknesses. Lastly insights should be extracted from the statistical and machine learning approaches and included with the first-principles body of knowledge.

The second claim that “Explainable ML methods provide explanations that are not faithful to what the original model computes” is the subject of this post.

Introducing GRANT

There is a huge gap between the quality of explanations for Decision Tree and Tree Ensemble models. While the behaviour of Decision Trees is well described, most techniques for Tree Ensembles use approximation that oversimplifies the model. To address this Wagtail Labs has created GRANT, a collection of five new techniques that allows the behaviour of Tree Ensemble models to be completely explained.


Graft

Graft converts a Tree Ensemble into a series of rules that completely covert all possible predictions, which allows a user to explore the model in detail, and not be surprised by unexpected predictions. This is important because Tree Ensembles create prediction boundaries that are not covered by training data, which may not align with expectation.

While there have been many discussions on how to extract the rules of a Decision Tree, it seems that only inTrees has attempted to combine the rules from a Tree Ensemble into a coherent explanation. The weakness of inTrees is that the results are displayed as code, which complicates understanding the vast number of rules created by a Tree Ensemble.

In contrast, Graft returns a dataframe. This format allows a user to easily explore the rules and answer questions about the prediction boundaries that are otherwise difficult, such as “at what temperature do the predictions stop changing?” or “what is the lowest possible prediction for a Saturday in October?”. It also makes it easier to find anomalies, for example Graft showed a testing model always returns the same result, 1280.823161, after 23:25.

A Tree Ensemble has an enormous number of rules, potentially number of trees number of leaves per tree because predictions are the overlapping area from each leaf of each tree. The sheer volume of potential rules means that an efficient algorithm is needed to guarantee that all have been uncovered. Graft uses a novel iterative search algorithm to efficiently find every rule. 

Consider a model with three training records:

And a Random Forest the data, producing three tree:

First the edges of the decision boundary that covers infinity for every explanatory variable is found:

Then each edge is tested to find the adjacent decision boundaries:

And the process is repeated until no more boundaries are found:

The output is returned in a dataframe showing the low (<x) and high (x<=) values of each explanatory variable in the decision boundary:

GRANT’s code walkthrough uses a toy model (a Random Forest of two trees with a max depth of ten -> 210 = 1024 leaves on each tree) to demonstrate the concepts and even this tiny model still has 17,322 separate decision boundaries.

But the exciting aspect of Graft is that it opens up a range of new possibilities, beyond just exploring the decision boundaries of a model.


Reassemble

Sometimes speed matters. When it does searching through each tree of a Tree Ensemble can be slow, because each prediction requires number of trees x depth of each tree calculations.

Reassemble improves the performance of scoring a Tree Ensemble by taking the output of Graft and using it to create a single, optimised Decision Tree that produces that produces the same result.

First step is to find the split that most evenly divides the Graft rules into two groups, in some cases this will mean that a single rule is found on both sides. The optimisation criteria is:

(abs(lt_len - gt_len)+2)2 + ((lt_len + gt_len)2

Which maximises the evenness of number of rules uniquely on one side of the split, and minimises the number of rules that appear on both sides. lt_len is the number of rules on the less than side and gt_len is the number of rules on the greater than or equal to side.
In the example above this would split on x=1:

This process is then recursively repeated for each group:

Until there is only a single rule in each:

Testing the Reassembled model has shown a performance boost of 15-20%:

And this performance boost can be increased by selectively removing unreachable rules. In some cases the Tree Ensemble may have create rules that cannot be reached during use, for example historic timeseries data when the model will only be used to predict future events. In this case, the rules that ended in the past can be safely dropped before Reassembling – providing an additional performance boost:

Amalgamate

The concept of reducing the complexity of Tree Ensemble models by cutting branches off each tree (a.k.a. pruning) is well established. Typically this is achieved using a criteria to determine the homogeneity of each split in each tree, dropping the branches below those that meet a user defined benchmark. This approach works well, but doesn’t guarantee that useful splits aren’t lost or allow a user to make an informed tradeoff between accuracy and complexity.

Amalgamate takes a bottom up approach to pruning, finding and merging adjacent Graft rules that form a rectangular boundary with a difference less than a user defined threshold.

For example, it may be that green and brown so close to the user that the difference is trivial and the boundary of the two rules is still a rectangle:

In this case we can merge the two boundaries into a single green/brown rule without significantly compromising the behaviour of the model:

Then the process can be repeated:

And so on, until the rules all either fail the rectangular boundary or prediction threshold tests.

The code walkthrough example first tests Amalgamate against a very low threshold, but this does little to reduce complexity.

An increased threshold though does reduce the complexity of the model by around half, while only increasing the error rate on the validation set by about 5%.

However, increasing the threshold further provides a limited additional decrease. This suggests a natural plateau in the behaviour of the model had been found. Curiously the validation scores improve marginally.

Amalgamate is an important technique because it allows for a direct complexity/accuracy tradeoff to be made. By increasing or decreasing the threshold, a user is able to have an informed conversation with stakeholders on the relative merits of more complex models.


Next steps for GRANT

While GRANT contains both Restructure and Assess techniques, this post only covers the former. A follow up post will cover the remaining techniques.

Wagtail Labs is actively developing GRANT by extending the range of supported languages, frameworks, and layers, as well as improving user experience through code performance and by creating new visualisation tools. Follow us on GitHubTwitter and wagtaillabs.com.

6 replies on “Introducing GRANT (pt 1)”

Hi,

Thanks for sharing this. I think this is an interesting approach, however I am curious to what the benefit of using this method is versus using the original Random Forest?

You mention that even for a small RF with 2 trees, this method generates 17,322 rules for the different decision boundaries. To Rudin’s point that you mention in the intro — yes this is faithful to the original model but is reading a 17k rule list any more interpretable than the Random Forest?

My understanding is that to understand the prediction for an individual point I could look at the single rule that applies, however to understand the global structure of the model wouldn’t I need to be reading thousands of rules?

Thanks for your comments.

I think the important point here is not so much that the Tree Ensemble becomes easy to interpret, but that Graft makes it possible to explore the entire model. We can be honest about the complexity of the tools we’re wielding and if a decision maker isn’t able to accept use the model in all its detail, then a complexity / accuracy tradeoff needs to be made. Perhaps that means using a simpler technique (like a regression), or perhaps it means using Amalgamate to selectively prune rules.

For problems that are simple enough for a human to completely interpret the global structure my opinion is that Tree Ensembles should only be used to identify structures and relationships. They do an excellent job of modelling highly non-linear, highly interacting problems, and we can use tools like Neighbour Sensitivity and Training Delta (follow up post coming soon) to extract the insights that allow simpler techniques to have better results.

Give me a few weeks and I’ll write up a more detailed answer, but I’m definitely happy to keep chatting in the meantime 🙂

Hi – It seems you have misquoted me. You wrote that I “make two mistakes in arguing that simple modeling techniques should always be used for important decisions.” In fact I *never* stated this, nor would I ever say this. I said we should use interpretable models, not simple models. Perhaps you missed my section on interpretable neural networks? I like your approach, though it has the disadvantage that it is built from heuristics (it is not globally optimal). This leads to the possibility that a much simpler model exists with the same performance. But it seems like a reasonable place to start to build interpretable models from more complex models.

Firstly, my apologies. I have re-read your article and I did misquote you. I have amended the paragraphs where I discuss your work and would be happy to revise further if you feel that I am not representing it correctly.

I assume that the heuristic you’re referring to are the machine learning techniques used to build the model (I contend that, unlike most XAI techniques, Graft itself isn’t a heuristic because it finds all possible outcomes. Although I accept that I am yet to prove this.). If my assumption is correct, then of course I completely agree with you. The random nature of these techniques cannot guarantee a superior outcome and it is definitely possible that a simpler model may yield the same performance.

My preferred approach is to start with the simplest models and to only accept more complex ones if they have merit (i.e. the improvement in accuracy is worth the loss of interpretability). It is a failure of the practitioner if a simpler model with suitable accuracy exists and isn’t found.

Right now I’m working on techniques that allow the complexity of a model to be dialled back, but am exploring a middle ground. Hopefully I’ll have it ready to put out there soon.

Leave a Reply

Your email address will not be published. Required fields are marked *