Specifying models using model builder
Treelite supports loading models from major tree libraries, such as XGBoost and scikit-learn. However, you may want to use models trained by other tree libraries that are not directly supported by Treelite. The model builder is useful in this use case.
What is the model builder?
The ModelBuilder
class is a tool used to
specify decision tree ensembles programmatically.
Example: Regressor
Consider the following regression model, consisting of two trees:
Note
Provision for missing data: default directions
Decision trees in Treelite accomodate missing data by indicating the default direction for every test node. In the diagram above, the default direction is indicated by label “Missing.” For instance, the root node of the first tree shown above will send to the left all data points that lack values for feature 0.
For now, let’s assume that we’ve somehow found optimal choices of default directions at training time. For detailed instructions for actually deciding default directions, see Section 3.4 of the XGBoost paper.
Let us construct this ensemble using the model builder. First step is to assign unique integer key to each node. In the following diagram, integer keys are indicated in red. Note that integer keys need to be unique only within the same tree.
Next, we create a model builder object by calling the constructor for
ModelBuilder
with some model metadata.
import treelite
from treelite.model_builder import (
Metadata,
ModelBuilder,
PostProcessorFunc,
TreeAnnotation,
)
builder = ModelBuilder(
threshold_type="float32",
leaf_output_type="float32",
metadata=Metadata(
num_feature=3,
task_type="kRegressor", # Regression model
average_tree_output=False,
num_target=1,
num_class=[1], # Set num_class=1 for regression model
leaf_vector_shape=(1, 1), # Each tree outputs a scalar
),
# Every tree generates output for target 0, class 0
tree_annotation=TreeAnnotation(num_tree=2, target_id=[0, 0], class_id=[0, 0]),
# The link function for the output is the identity function
postprocessor=PostProcessorFunc(name="identity"),
# Add this value for all outputs. Also known as the intercept.
base_scores=[0.0],
)
The model generates output for a single output target, so we set num_target=1
.
Also, the model produces a continuous output, so we set num_class=[1]
. num_class
is an array
because each output target has a unique number of classes. We will later look at an example where a
model produces multiple output targets.
For tree_annotation
field, specify the number of trees you will build via num_tree
argument.
Set target_id=[0] * num_tree
and class_id=[0] * num_tree
, since each tree produces output for Target 0, Class 0.
Later, we will look an example where the tree model produces outputs for multiple targets and multiple classes.
With the builder object, we are now ready to construct the trees.
# Tree 0
builder.start_tree()
# Tree 0, Node 0
builder.start_node(0)
builder.numerical_test(
feature_id=0,
threshold=5.0,
default_left=True,
opname="<",
left_child_key=1,
right_child_key=2,
)
builder.end_node()
# Tree 0, Node 1
builder.start_node(1)
builder.numerical_test(
feature_id=2,
threshold=-3.0,
default_left=False,
opname="<",
left_child_key=3,
right_child_key=4,
)
builder.end_node()
# Tree 0, Node 2
builder.start_node(2)
builder.leaf(0.6)
builder.end_node()
# Tree 0, Node 3
builder.start_node(3)
builder.leaf(-0.4)
builder.end_node()
# Tree 0, Node 4
builder.start_node(4)
builder.leaf(1.2)
builder.end_node()
builder.end_tree()
# Tree 1
builder.start_tree()
# Tree 1, Node 0
builder.start_node(0)
builder.numerical_test(
feature_id=1,
threshold=2.5,
default_left=False,
opname="<",
left_child_key=1,
right_child_key=2,
)
builder.end_node()
# Tree 1, Node 1
builder.start_node(1)
builder.leaf(1.6)
builder.end_node()
# Tree 1, Node 2
builder.start_node(2)
builder.numerical_test(
feature_id=2,
threshold=-1.2,
default_left=True,
opname="<",
left_child_key=3,
right_child_key=4,
)
builder.end_node()
# Tree 1, Node 3
builder.start_node(3)
builder.leaf(0.1)
builder.end_node()
# Tree 1, Node 4
builder.start_node(4)
builder.leaf(-0.3)
builder.end_node()
builder.end_tree()
It is important to declare the start and end of each tree and node by calling start_*
and end_*
methods.
Failure to do so will generate an error.
Note
The first node is assumed to be the root node
You may specify the nodes in a tree in an arbitrary order. There is one requirement however: the first node to be specified is always assumed to be the root node. In the example above, node 0 is the root node because it is specified the first.
We are now done building the member trees. The last step is to call
commit()
to finalize the ensemble into
a Model
object:
# Finalize and obtain Model object
model = builder.commit()
Let’s inspect the content of the model by looking at its JSON dump:
print(model.dump_as_json())
which produces
{
"num_feature": 3,
"task_type": "kRegressor",
"average_tree_output": false,
"num_target": 1,
"num_class": [1],
"leaf_vector_shape": [1, 1],
"target_id": [0, 0],
"class_id": [0, 0],
"postprocessor": "identity",
"sigmoid_alpha": 1.0,
"ratio_c": 1.0,
"base_scores": [0.0],
"attributes": "{}",
"trees": [{
"num_nodes": 5,
"has_categorical_split": false,
"nodes": [{
"node_id": 0,
"split_feature_id": 0,
"default_left": true,
"node_type": "numerical_test_node",
"comparison_op": "<",
"threshold": 5.0,
"left_child": 1,
"right_child": 2
}, {
"node_id": 1,
"split_feature_id": 2,
"default_left": false,
"node_type": "numerical_test_node",
"comparison_op": "<",
"threshold": -3.0,
"left_child": 3,
"right_child": 4
}, {
"node_id": 2,
"leaf_value": 0.6000000238418579
}, {
"node_id": 3,
"leaf_value": -0.4000000059604645
}, {
"node_id": 4,
"leaf_value": 1.2000000476837159
}]
}, {
"num_nodes": 5,
"has_categorical_split": false,
"nodes": [{
"node_id": 0,
"split_feature_id": 1,
"default_left": false,
"node_type": "numerical_test_node",
"comparison_op": "<",
"threshold": 2.5,
"left_child": 1,
"right_child": 2
}, {
"node_id": 1,
"leaf_value": 1.600000023841858
}, {
"node_id": 2,
"split_feature_id": 2,
"default_left": true,
"node_type": "numerical_test_node",
"comparison_op": "<",
"threshold": -1.2000000476837159,
"left_child": 3,
"right_child": 4
}, {
"node_id": 3,
"leaf_value": 0.10000000149011612
}, {
"node_id": 4,
"leaf_value": -0.30000001192092898
}]
}]
}
We can also pass in some test data for prediction:
import numpy as np
X = np.array(
[
[0.0, 0.0, -5.0],
[0.0, 0.0, -2.0],
[0.0, 0.0, 1.0],
[0.0, 5.0, -5.0],
[0.0, 5.0, -2.0],
[0.0, 5.0, 1.0],
[10.0, 0.0, -5.0],
[10.0, 0.0, -2.0],
[10.0, 0.0, 1.0],
[10.0, 5.0, -5.0],
[10.0, 5.0, -2.0],
[10.0, 5.0, 1.0],
],
dtype=np.float32
)
print(treelite.gtil.predict(model, X))
[[ 1.2 ]
[ 2.8000002 ]
[ 2.8000002 ]
[-0.3 ]
[ 1.3000001 ]
[ 0.90000004]
[ 2.2 ]
[ 2.2 ]
[ 2.2 ]
[ 0.70000005]
[ 0.70000005]
[ 0.3 ]]
Example: Binary classifier
In the first example, we simply added the output of each tree to obtain the final prediction. Summing the tree outputs is sufficient for regression models, where the target variable is a real value.
In this example, let’s look at binary classifiers, where the target variable is now a binary label. We follow the
common practice, where we produce a probability score in the range of [0, 1]
, to indicate the relative strength for
the positive and negative classes. (Scores close to 0 indicates strong vote for the negative class; scores close to 1
indicates a strong vote for the positive class.)
To obtain probability scores, we pass the sum of the tree outputs through a link function
sigmoid(x) = 1/(1+exp(-x))
. In the model builder API,
the link function is specified by the postprocessor
argument. (Consult List of postprocessor functions for the list of
available postprocessors.)
Let’s look at how the builder object is constructed:
builder = ModelBuilder(
threshold_type="float32",
leaf_output_type="float32",
metadata=Metadata(
num_feature=3,
task_type="kBinaryClf",
average_tree_output=False,
num_target=1,
num_class=[1],
leaf_vector_shape=(1, 1),
),
# Every tree generates output for target 0, class 0
tree_annotation=TreeAnnotation(num_tree=2, target_id=[0, 0], class_id=[0, 0]),
# The link function for the output is the sigmoid function
postprocessor=PostProcessorFunc(name="sigmoid"),
# Add this value for all outputs. Also known as the intercept.
base_scores=[0.0],
)
Note that we’ve also changed task_type
to kBinaryClf
.
Using the same definition for the two trees, we now obtain probability scores:
# Same tree construction logic as the first example
# ...
model = builder.commit()
X = np.array(
[
[0.0, 0.0, -5.0],
[0.0, 0.0, -2.0],
[0.0, 0.0, 1.0],
[0.0, 5.0, -5.0],
[0.0, 5.0, -2.0],
[0.0, 5.0, 1.0],
[10.0, 0.0, -5.0],
[10.0, 0.0, -2.0],
[10.0, 0.0, 1.0],
[10.0, 5.0, -5.0],
[10.0, 5.0, -2.0],
[10.0, 5.0, 1.0],
],
dtype=np.float32
)
print(treelite.gtil.predict(model, X))
[[0.7685248 ]
[0.9426758 ]
[0.9426758 ]
[0.4255575 ]
[0.785835 ]
[0.7109495 ]
[0.90024954]
[0.90024954]
[0.90024954]
[0.6681878 ]
[0.6681878 ]
[0.5744425 ]]
Example: multi-class classifier with vector leaf
Now let’s consider a multi-class classifier, where the target variable is now a class label whose value can be
one of {0, 1, 2, ..., num_class - 1}
. The tree model should now produce a 2D array of probability scores where
score[i, k]
represents the i
-th row’s probability score for class k
.
For the sake of brevity, consider a multi-class classifier consisting of a single decision tree stump:
The model has a single output target, for which there are 3 possible class labels, so we set num_target=1
and
num_class=[3]
. To indicate that the tree outputs a vector of length 3, set leaf_vector_shape=(1, 3)
.
The softmax function softmax(x) = exp(x) / sum(exp(x))
is used as the link function, to convert the tree output
to probability scores in the range [0, 1]
.
builder = ModelBuilder(
threshold_type="float32",
leaf_output_type="float32",
metadata=Metadata(
num_feature=1,
task_type="kMultiClf", # To indicate multi-class classification
average_tree_output=False,
num_target=1,
num_class=[3],
leaf_vector_shape=(1, 3),
),
# Every tree generates probability scores for all classes, so class_id=-1
tree_annotation=TreeAnnotation(num_tree=1, target_id=[0], class_id=[-1]),
# The link function for the output is the softmax function
postprocessor=PostProcessorFunc(name="softmax"),
# base_scores must have length (num_target * max(num_class))
base_scores=[0.0, 0.0, 0.0],
)
builder.start_tree()
builder.start_node(0)
builder.numerical_test(
feature_id=0,
threshold=0.0,
default_left=True,
opname="<",
left_child_key=1,
right_child_key=2,
)
builder.end_node()
builder.start_node(1)
builder.leaf([0.5, 0.5, 0.0])
builder.end_node()
builder.start_node(2)
builder.leaf([0.0, 0.0, 1.0])
builder.end_node()
builder.end_tree()
model = builder.commit()
X = np.array([[-1.0], [1.0]])
print(treelite.gtil.predict(model, X))
[[0.38365173 0.38365173 0.23269653]
[0.21194156 0.21194156 0.57611686]]
Example: multi-class classifier with scalar leaf
It is also possible to build a multi-class classifier where each tree produces a scalar output: compute each class’s
score by summing the output from a subset of decision trees. How do we know which decision tree contributes to which
class? This is where the TreeAnnotation
becomes useful.
The class_id
field in TreeAnnotation
is assigned an array of integers so that
class_id[i]
gives the class ID to which i
-th tree’s output counts towards. In the following example,
the outputs of Tree 0, 1, and 2 count towards Class 0, 1, and 2, respectively:
builder = ModelBuilder(
threshold_type="float32",
leaf_output_type="float32",
metadata=Metadata(
num_feature=1,
task_type="kMultiClf", # To indicate multi-class classification
average_tree_output=False,
num_target=1,
num_class=[3],
leaf_vector_shape=(1, 1),
),
# Tree i produces score for class i
tree_annotation=TreeAnnotation(
num_tree=3,
target_id=[0, 0, 0],
class_id=[0, 1, 2], # Tree i contributes towards the score of Class i
),
# The link function for the output is the softmax function
postprocessor=PostProcessorFunc(name="softmax"),
# base_scores must have length (num_target * max(num_class))
base_scores=[0.0, 0.0, 0.0],
)
In this example, we will have three trees, and each tree at index i
produces the score for class i
. In general,
we would set longer arrays for class_id
to associate multiple decision trees with each class.
for tree_id in range(3):
builder.start_tree()
builder.start_node(0)
builder.numerical_test(
feature_id=0,
threshold=0.0,
default_left=True,
opname="<",
left_child_key=1,
right_child_key=2,
)
builder.end_node()
builder.start_node(1)
builder.leaf(0.5 if tree_id < 2 else 0.0)
builder.end_node()
builder.start_node(2)
builder.leaf(1.0 if tree_id == 2 else 0.0)
builder.end_node()
builder.end_tree()
model = builder.commit()
X = np.array([[-1.0], [1.0]])
print(treelite.gtil.predict(model, X))
[[0.38365173 0.38365173 0.23269653]
[0.21194156 0.21194156 0.57611686]]
Example: multi-target regressor
[To be added later]