Specifying models using model builder

Treelite supports loading models from major tree libraries, such as XGBoost and scikit-learn. However, you may want to use models trained by other tree libraries that are not directly supported by Treelite. The model builder is useful in this use case.

What is the model builder?

The ModelBuilder class is a tool used to specify decision tree ensembles programmatically.

Example: Regressor

Consider the following regression model, consisting of two trees:

../_images/builder_toy1.svg../_images/builder_toy2.svg

Note

Provision for missing data: default directions

Decision trees in Treelite accomodate missing data by indicating the default direction for every test node. In the diagram above, the default direction is indicated by label “Missing.” For instance, the root node of the first tree shown above will send to the left all data points that lack values for feature 0.

For now, let’s assume that we’ve somehow found optimal choices of default directions at training time. For detailed instructions for actually deciding default directions, see Section 3.4 of the XGBoost paper.

Let us construct this ensemble using the model builder. First step is to assign unique integer key to each node. In the following diagram, integer keys are indicated in red. Note that integer keys need to be unique only within the same tree.

../_images/builder_toy1_1.svg../_images/builder_toy2_1.svg

Next, we create a model builder object by calling the constructor for ModelBuilder with some model metadata.

import treelite
from treelite.model_builder import (
  Metadata,
  ModelBuilder,
  PostProcessorFunc,
  TreeAnnotation,
)
builder = ModelBuilder(
  threshold_type="float32",
  leaf_output_type="float32",
  metadata=Metadata(
    num_feature=3,
    task_type="kRegressor",  # Regression model
    average_tree_output=False,
    num_target=1,
    num_class=[1],   # Set num_class=1 for regression model
    leaf_vector_shape=(1, 1),  # Each tree outputs a scalar
  ),
  # Every tree generates output for target 0, class 0
  tree_annotation=TreeAnnotation(num_tree=2, target_id=[0, 0], class_id=[0, 0]),
  # The link function for the output is the identity function
  postprocessor=PostProcessorFunc(name="identity"),
  # Add this value for all outputs. Also known as the intercept.
  base_scores=[0.0],
)

The model generates output for a single output target, so we set num_target=1. Also, the model produces a continuous output, so we set num_class=[1]. num_class is an array because each output target has a unique number of classes. We will later look at an example where a model produces multiple output targets.

For tree_annotation field, specify the number of trees you will build via num_tree argument. Set target_id=[0] * num_tree and class_id=[0] * num_tree, since each tree produces output for Target 0, Class 0. Later, we will look an example where the tree model produces outputs for multiple targets and multiple classes.

With the builder object, we are now ready to construct the trees.

# Tree 0
builder.start_tree()
# Tree 0, Node 0
builder.start_node(0)
builder.numerical_test(
  feature_id=0,
  threshold=5.0,
  default_left=True,
  opname="<",
  left_child_key=1,
  right_child_key=2,
)
builder.end_node()
# Tree 0, Node 1
builder.start_node(1)
builder.numerical_test(
  feature_id=2,
  threshold=-3.0,
  default_left=False,
  opname="<",
  left_child_key=3,
  right_child_key=4,
)
builder.end_node()
# Tree 0, Node 2
builder.start_node(2)
builder.leaf(0.6)
builder.end_node()
# Tree 0, Node 3
builder.start_node(3)
builder.leaf(-0.4)
builder.end_node()
# Tree 0, Node 4
builder.start_node(4)
builder.leaf(1.2)
builder.end_node()
builder.end_tree()

# Tree 1
builder.start_tree()
# Tree 1, Node 0
builder.start_node(0)
builder.numerical_test(
  feature_id=1,
  threshold=2.5,
  default_left=False,
  opname="<",
  left_child_key=1,
  right_child_key=2,
)
builder.end_node()
# Tree 1, Node 1
builder.start_node(1)
builder.leaf(1.6)
builder.end_node()
# Tree 1, Node 2
builder.start_node(2)
builder.numerical_test(
  feature_id=2,
  threshold=-1.2,
  default_left=True,
  opname="<",
  left_child_key=3,
  right_child_key=4,
)
builder.end_node()
# Tree 1, Node 3
builder.start_node(3)
builder.leaf(0.1)
builder.end_node()
# Tree 1, Node 4
builder.start_node(4)
builder.leaf(-0.3)
builder.end_node()
builder.end_tree()

It is important to declare the start and end of each tree and node by calling start_* and end_* methods. Failure to do so will generate an error.

Note

The first node is assumed to be the root node

You may specify the nodes in a tree in an arbitrary order. There is one requirement however: the first node to be specified is always assumed to be the root node. In the example above, node 0 is the root node because it is specified the first.

We are now done building the member trees. The last step is to call commit() to finalize the ensemble into a Model object:

# Finalize and obtain Model object
model = builder.commit()

Let’s inspect the content of the model by looking at its JSON dump:

print(model.dump_as_json())

which produces

{
    "num_feature": 3,
    "task_type": "kRegressor",
    "average_tree_output": false,
    "num_target": 1,
    "num_class": [1],
    "leaf_vector_shape": [1, 1],
    "target_id": [0, 0],
    "class_id": [0, 0],
    "postprocessor": "identity",
    "sigmoid_alpha": 1.0,
    "ratio_c": 1.0,
    "base_scores": [0.0],
    "attributes": "{}",
    "trees": [{
            "num_nodes": 5,
            "has_categorical_split": false,
            "nodes": [{
                    "node_id": 0,
                    "split_feature_id": 0,
                    "default_left": true,
                    "node_type": "numerical_test_node",
                    "comparison_op": "<",
                    "threshold": 5.0,
                    "left_child": 1,
                    "right_child": 2
                }, {
                    "node_id": 1,
                    "split_feature_id": 2,
                    "default_left": false,
                    "node_type": "numerical_test_node",
                    "comparison_op": "<",
                    "threshold": -3.0,
                    "left_child": 3,
                    "right_child": 4
                }, {
                    "node_id": 2,
                    "leaf_value": 0.6000000238418579
                }, {
                    "node_id": 3,
                    "leaf_value": -0.4000000059604645
                }, {
                    "node_id": 4,
                    "leaf_value": 1.2000000476837159
                }]
        }, {
            "num_nodes": 5,
            "has_categorical_split": false,
            "nodes": [{
                    "node_id": 0,
                    "split_feature_id": 1,
                    "default_left": false,
                    "node_type": "numerical_test_node",
                    "comparison_op": "<",
                    "threshold": 2.5,
                    "left_child": 1,
                    "right_child": 2
                }, {
                    "node_id": 1,
                    "leaf_value": 1.600000023841858
                }, {
                    "node_id": 2,
                    "split_feature_id": 2,
                    "default_left": true,
                    "node_type": "numerical_test_node",
                    "comparison_op": "<",
                    "threshold": -1.2000000476837159,
                    "left_child": 3,
                    "right_child": 4
                }, {
                    "node_id": 3,
                    "leaf_value": 0.10000000149011612
                }, {
                    "node_id": 4,
                    "leaf_value": -0.30000001192092898
                }]
        }]
}

We can also pass in some test data for prediction:

import numpy as np

X = np.array(
    [
        [0.0, 0.0, -5.0],
        [0.0, 0.0, -2.0],
        [0.0, 0.0, 1.0],
        [0.0, 5.0, -5.0],
        [0.0, 5.0, -2.0],
        [0.0, 5.0, 1.0],
        [10.0, 0.0, -5.0],
        [10.0, 0.0, -2.0],
        [10.0, 0.0, 1.0],
        [10.0, 5.0, -5.0],
        [10.0, 5.0, -2.0],
        [10.0, 5.0, 1.0],
    ],
    dtype=np.float32
)
print(treelite.gtil.predict(model, X))
[[ 1.2       ]
 [ 2.8000002 ]
 [ 2.8000002 ]
 [-0.3       ]
 [ 1.3000001 ]
 [ 0.90000004]
 [ 2.2       ]
 [ 2.2       ]
 [ 2.2       ]
 [ 0.70000005]
 [ 0.70000005]
 [ 0.3       ]]

Example: Binary classifier

In the first example, we simply added the output of each tree to obtain the final prediction. Summing the tree outputs is sufficient for regression models, where the target variable is a real value.

In this example, let’s look at binary classifiers, where the target variable is now a binary label. We follow the common practice, where we produce a probability score in the range of [0, 1], to indicate the relative strength for the positive and negative classes. (Scores close to 0 indicates strong vote for the negative class; scores close to 1 indicates a strong vote for the positive class.)

To obtain probability scores, we pass the sum of the tree outputs through a link function sigmoid(x) = 1/(1+exp(-x)). In the model builder API, the link function is specified by the postprocessor argument. (Consult List of postprocessor functions for the list of available postprocessors.) Let’s look at how the builder object is constructed:

builder = ModelBuilder(
  threshold_type="float32",
  leaf_output_type="float32",
  metadata=Metadata(
    num_feature=3,
    task_type="kBinaryClf",
    average_tree_output=False,
    num_target=1,
    num_class=[1],
    leaf_vector_shape=(1, 1),
  ),
  # Every tree generates output for target 0, class 0
  tree_annotation=TreeAnnotation(num_tree=2, target_id=[0, 0], class_id=[0, 0]),
  # The link function for the output is the sigmoid function
  postprocessor=PostProcessorFunc(name="sigmoid"),
  # Add this value for all outputs. Also known as the intercept.
  base_scores=[0.0],
)

Note that we’ve also changed task_type to kBinaryClf.

Using the same definition for the two trees, we now obtain probability scores:

# Same tree construction logic as the first example
# ...

model = builder.commit()

X = np.array(
    [
        [0.0, 0.0, -5.0],
        [0.0, 0.0, -2.0],
        [0.0, 0.0, 1.0],
        [0.0, 5.0, -5.0],
        [0.0, 5.0, -2.0],
        [0.0, 5.0, 1.0],
        [10.0, 0.0, -5.0],
        [10.0, 0.0, -2.0],
        [10.0, 0.0, 1.0],
        [10.0, 5.0, -5.0],
        [10.0, 5.0, -2.0],
        [10.0, 5.0, 1.0],
    ],
    dtype=np.float32
)
print(treelite.gtil.predict(model, X))
[[0.7685248 ]
 [0.9426758 ]
 [0.9426758 ]
 [0.4255575 ]
 [0.785835  ]
 [0.7109495 ]
 [0.90024954]
 [0.90024954]
 [0.90024954]
 [0.6681878 ]
 [0.6681878 ]
 [0.5744425 ]]

Example: multi-class classifier with vector leaf

Now let’s consider a multi-class classifier, where the target variable is now a class label whose value can be one of {0, 1, 2, ..., num_class - 1}. The tree model should now produce a 2D array of probability scores where score[i, k] represents the i-th row’s probability score for class k.

For the sake of brevity, consider a multi-class classifier consisting of a single decision tree stump:

../_images/builder_vecleaf1.svg

The model has a single output target, for which there are 3 possible class labels, so we set num_target=1 and num_class=[3]. To indicate that the tree outputs a vector of length 3, set leaf_vector_shape=(1, 3). The softmax function softmax(x) = exp(x) / sum(exp(x)) is used as the link function, to convert the tree output to probability scores in the range [0, 1].

builder = ModelBuilder(
  threshold_type="float32",
  leaf_output_type="float32",
  metadata=Metadata(
    num_feature=1,
    task_type="kMultiClf",  # To indicate multi-class classification
    average_tree_output=False,
    num_target=1,
    num_class=[3],
    leaf_vector_shape=(1, 3),
  ),
  # Every tree generates probability scores for all classes, so class_id=-1
  tree_annotation=TreeAnnotation(num_tree=1, target_id=[0], class_id=[-1]),
  # The link function for the output is the softmax function
  postprocessor=PostProcessorFunc(name="softmax"),
  # base_scores must have length (num_target * max(num_class))
  base_scores=[0.0, 0.0, 0.0],
)

builder.start_tree()
builder.start_node(0)
builder.numerical_test(
  feature_id=0,
  threshold=0.0,
  default_left=True,
  opname="<",
  left_child_key=1,
  right_child_key=2,
)
builder.end_node()
builder.start_node(1)
builder.leaf([0.5, 0.5, 0.0])
builder.end_node()
builder.start_node(2)
builder.leaf([0.0, 0.0, 1.0])
builder.end_node()
builder.end_tree()

model = builder.commit()
X = np.array([[-1.0], [1.0]])
print(treelite.gtil.predict(model, X))
[[0.38365173 0.38365173 0.23269653]
 [0.21194156 0.21194156 0.57611686]]

Example: multi-class classifier with scalar leaf

It is also possible to build a multi-class classifier where each tree produces a scalar output: compute each class’s score by summing the output from a subset of decision trees. How do we know which decision tree contributes to which class? This is where the TreeAnnotation becomes useful. The class_id field in TreeAnnotation is assigned an array of integers so that class_id[i] gives the class ID to which i-th tree’s output counts towards. In the following example, the outputs of Tree 0, 1, and 2 count towards Class 0, 1, and 2, respectively:

builder = ModelBuilder(
  threshold_type="float32",
  leaf_output_type="float32",
  metadata=Metadata(
    num_feature=1,
    task_type="kMultiClf",  # To indicate multi-class classification
    average_tree_output=False,
    num_target=1,
    num_class=[3],
    leaf_vector_shape=(1, 1),
  ),
  # Tree i produces score for class i
  tree_annotation=TreeAnnotation(
    num_tree=3,
    target_id=[0, 0, 0],
    class_id=[0, 1, 2],  # Tree i contributes towards the score of Class i
  ),
  # The link function for the output is the softmax function
  postprocessor=PostProcessorFunc(name="softmax"),
  # base_scores must have length (num_target * max(num_class))
  base_scores=[0.0, 0.0, 0.0],
)

In this example, we will have three trees, and each tree at index i produces the score for class i. In general, we would set longer arrays for class_id to associate multiple decision trees with each class.

../_images/builder_grove_per_class1.svg../_images/builder_grove_per_class2.svg../_images/builder_grove_per_class3.svg
for tree_id in range(3):
  builder.start_tree()
  builder.start_node(0)
  builder.numerical_test(
    feature_id=0,
    threshold=0.0,
    default_left=True,
    opname="<",
    left_child_key=1,
    right_child_key=2,
  )
  builder.end_node()
  builder.start_node(1)
  builder.leaf(0.5 if tree_id < 2 else 0.0)
  builder.end_node()
  builder.start_node(2)
  builder.leaf(1.0 if tree_id == 2 else 0.0)
  builder.end_node()
  builder.end_tree()
model = builder.commit()
X = np.array([[-1.0], [1.0]])
print(treelite.gtil.predict(model, X))
[[0.38365173 0.38365173 0.23269653]
 [0.21194156 0.21194156 0.57611686]]

Example: multi-target regressor

[To be added later]