Specifying models using JSON string

Treelite supports loading models from major tree libraries, such as XGBoost and scikit-learn. However, you may want to use models trained by other tree libraries that are not directly supported by Treelite. The JSON importer is useful in this use case. (Alternatively, consider using the model builder instead.)

Toy Example 

Consider the following tree ensemble, consisting of two regression trees:

where each node is assign a unique integer key, indicated in red. Note that integer keys need to be unique only within the same tree.

You can construct this tree ensemble by calling import_from_json() with an appropriately formatted JSON string. We will give you the example code first; in the following section, we will explain the meaining of each field in the JSON string.

Note

dump_as_json() will NOT preserve the JSON string that’s passed into import_from_json()

The operation performed in import_from_json() is strictly one-way. So the output of dump_as_json() will differ from the JSON string you used in calling import_from_json().

import treelite

json_str = """
{
    "num_feature": 3,
    "task_type": "kBinaryClfRegr",
    "average_tree_output": false,
    "task_param": {
        "output_type": "float",
        "grove_per_class": false,
        "num_class": 1,
        "leaf_vector_size": 1
    },
    "model_param": {
        "pred_transform": "identity",
        "global_bias": 0.0
    },
    "trees": [
        {
            "root_id": 0,
            "nodes": [
                {
                    "node_id": 0,
                    "split_feature_id": 1,
                    "default_left": true,
                    "split_type": "categorical",
                    "categories_list": [1, 2, 4],
                    "categories_list_right_child": false,
                    "left_child": 1,
                    "right_child": 2
                },
                {
                    "node_id": 1,
                    "split_feature_id": 2,
                    "default_left": false,
                    "split_type": "numerical",
                    "comparison_op": "<",
                    "threshold": -3.0,
                    "left_child": 3,
                    "right_child": 4
                },
                {"node_id": 2, "leaf_value": 0.6},
                {"node_id": 3, "leaf_value": -0.4},
                {"node_id": 4, "leaf_value": 1.2}
            ]
        },
        {
            "root_id": 1,
            "nodes": [
                {
                    "node_id": 1,
                    "split_feature_id": 0,
                    "default_left": false,
                    "split_type": "numerical",
                    "comparison_op": "<",
                    "threshold": 2.5,
                    "left_child": 2,
                    "right_child": 4
                },
                {
                    "node_id": 4,
                    "split_feature_id": 2,
                    "default_left": true,
                    "split_type": "numerical",
                    "comparison_op": "<",
                    "threshold": -1.2,
                    "left_child": 6,
                    "right_child": 8
                },
                {"node_id": 2, "leaf_value": 1.6},
                {"node_id": 6, "leaf_value": 0.1},
                {"node_id": 8, "leaf_value": -0.3}
            ]
        }
    ]
}
"""
model = treelite.Model.import_from_json(json_str)

Building model components using JSON 

Model metadata 

In the beginning, we must specify certain metadata of the model.

num_teature: Number of features (columns) in the training data
average_tree_output: Whether to average the outputs of trees. Set this to True if the model is a random forest.
task_type / task_param: Parameters that together define a machine learning task.
model_param: Other important parameters in the model.

Task Parameters: Define a machine learing task 

The task_type parameter is closely related to the content of task_param. The task_param object has the following parameters:

output_type: Type of leaf output. Either float or int.
grove_per_class: Boolean indicating a particular organization of multi-class classifier.
num_class: Number of targer classes in a multi-class classifier. Set this to 1 if the model is a binary classifier or a non-classifier.
leaf_vector_size: Length of leaf output. A value of 1 indicates scalar output.

The docstring of TaskType explains the relationship between task_type and the parameters in task_param:

enum class treelite::TaskType : uint8_t

Enum type representing the task type.

The task type places constraints on the parameters of TaskParam. See the docstring for each enum constants for more details.

Values:

enumerator kBinaryClfRegr

Catch-all task type encoding all tasks that are not multi-class classification, such as binary classification, regression, and learning-to-rank.

The kBinaryClfRegr task type implies the following constraints on the task parameters: output_type=float, grove_per_class=false, num_class=1, leaf_vector_size=1.

enumerator kMultiClfGrovePerClass

The multi-class classification task, in which the prediction for each class is given by the sum of outputs from a subset of the trees. We refer to this method as “grove-per-class”.

In this setting, each leaf node in a tree produces a single scalar output. To obtain predictions for each class, we divide the trees into multiple groups (“groves”) and then compute the sum of outputs of the trees in each group. The prediction for the i-th class is given by the sum of the outputs of the trees whose index is congruent to [i] modulo [num_class].

Examples of “grove-per-class” classifier are found in XGBoost, LightGBM, and GradientBoostingClassifier of scikit-learn.

The kMultiClfGrovePerClass task type implies the following constraints on the task parameters: output_type=float, grove_per_class=true, num_class>1, leaf_vector_size=1. In addition, we require that the number of trees is evenly divisible by [num_class].

enumerator kMultiClfProbDistLeaf

The multi-class classification task, in which each tree produces a vector of probability predictions for all the classes.

In this setting, each leaf node in a tree produces a vector output whose length is [num_class]. The vector represents probability predictions for all the classes. The outputs of the trees are combined via summing or averaging, depending on the value of the [average_tree_output] field. In effect, each tree is casting a set of weighted (fractional) votes for the classes.

Examples of kMultiClfProbDistLeaf task type are found in RandomForestClassifier of scikit-learn and RandomForestClassifier of cuML.

The kMultiClfProbDistLeaf task type implies the following constraints on the task parameters: output_type=float, grove_per_class=false, num_class>1, leaf_vector_size=num_class.

enumerator kMultiClfCategLeaf

The multi-class classification task, in which each tree produces a single integer output representing an unweighted vote for a particular class.

In this setting, each leaf node in a tree produces a single integer output between 0 and [num_class-1] that indicates a vote for a particular class. The outputs of the trees are combined by summing one_hot(tree(i)), where one_hot(x) represents the one-hot-encoded vector with 1 in index [x] and 0 everywhere else, and tree(i) is the output from the i-th tree. Models of type kMultiClfCategLeaf can be converted into the kMultiClfProbDistLeaf type, by converting the output of every leaf node into the equivalent one-hot-encoded vector.

The kMultiClfCategLeaf task type implies the following constraints on the task parameters: output_type=int, grove_per_class=false, num_class>1, leaf_vector_size=1.

Other Model Parameters 

The model_param field contains the parameters described in Model Parameters. You may safely omit a parameter as long as it has a default value.

Tree nodes 

Each tree object must have root_id field to indicate which node is the root node.

The nodes array must have node objects. Each node object must have node_id field. It will also have other fields, depending on the type of the node. A typical leaf node will be like this:

{"node_id": 2, "leaf_value": 0.6}

To output a leaf vector, use a list instead.

{"node_id": 2, "leaf_value": [0.6, 0.4]}

A typical internal node with numerical test:

{
    "node_id": 1,
    "split_feature_id": 2,
    "default_left": false,
    "split_type": "numerical",
    "comparison_op": "<",
    "threshold": -3.0,
    "left_child": 3,
    "right_child": 4
}

A typical internal node with categorical test:

{
    "node_id": 0,
    "split_feature_id": 1,
    "default_left": true,
    "split_type": "categorical",
    "categories_list": [1, 2, 4],
    "categories_list_right_child": false,
    "left_child": 1,
    "right_child": 2
}

For the categorical test, the test criterion is in the form of

[Feature value] \in [categories_list]

where the categories_list defines a (mathematical) set. When the test criteron is evaluated to be true, the prediction function traverses to the left child node (if categories_list_right_child=false) or to the right child node (if categories_list_right_child=true).

Specifying models using JSON string

Toy Example

Building model components using JSON

Model metadata

Task Parameters: Define a machine learing task

Other Model Parameters

Tree nodes