Specifying models using JSON string

Treelite supports loading models from major tree libraries, such as XGBoost and scikit-learn. However, you may want to use models trained by other tree libraries that are not directly supported by Treelite. The JSON importer is useful in this use case. (Alternatively, consider using the model builder instead.)

Note

import_from_json() is strict about which JSON strings to accept

Some tree libraries such as XGBoost, Catboost, and cuML RandomForest let users to dump tree models as JSON strings. However, import_from_json() will not accept these strings. It requires a particular set of fields, as outlined in the tutorial below. Here are the suggested methods for converting your tree model into Treelite format:

  1. If you are using XGBoost, LightGBM, or scikit-learn, use methods load() and import_model().

  2. If you are using cuML RandomForest, convert the model directly to Treelite objects as follows:

cuml_rf_model.convert_to_treelite_model().to_treelite_checkpoint("checkpoint.bin")
tl_model = treelite.Model.deserialize("checkpoint.bin")
  1. If you are using Catboost or other tree libraries that Treelite do not support directly, write a custom program to format your tree models to produce the correctly formatted JSON string. Make sure that all fields are put in place, such as task_param, model_param and others.

Toy Example

Consider the following tree ensemble, consisting of two regression trees:

where each node is assign a unique integer key, indicated in red. Note that integer keys need to be unique only within the same tree.

You can construct this tree ensemble by calling import_from_json() with an appropriately formatted JSON string. We will give you the example code first; in the following section, we will explain the meaining of each field in the JSON string.

Note

dump_as_json() will NOT preserve the JSON string that’s passed into import_from_json()

The operation performed in import_from_json() is strictly one-way. So the output of dump_as_json() will differ from the JSON string you used in calling import_from_json().

 1import treelite
 2
 3json_str = """
 4{
 5    "num_feature": 3,
 6    "task_type": "kBinaryClfRegr",
 7    "average_tree_output": false,
 8    "task_param": {
 9        "output_type": "float",
10        "grove_per_class": false,
11        "num_class": 1,
12        "leaf_vector_size": 1
13    },
14    "model_param": {
15        "pred_transform": "identity",
16        "global_bias": 0.0
17    },
18    "trees": [
19        {
20            "root_id": 0,
21            "nodes": [
22                {
23                    "node_id": 0,
24                    "split_feature_id": 1,
25                    "default_left": true,
26                    "split_type": "categorical",
27                    "categories_list": [1, 2, 4],
28                    "categories_list_right_child": false,
29                    "left_child": 1,
30                    "right_child": 2
31                },
32                {
33                    "node_id": 1,
34                    "split_feature_id": 2,
35                    "default_left": false,
36                    "split_type": "numerical",
37                    "comparison_op": "<",
38                    "threshold": -3.0,
39                    "left_child": 3,
40                    "right_child": 4
41                },
42                {"node_id": 2, "leaf_value": 0.6},
43                {"node_id": 3, "leaf_value": -0.4},
44                {"node_id": 4, "leaf_value": 1.2}
45            ]
46        },
47        {
48            "root_id": 1,
49            "nodes": [
50                {
51                    "node_id": 1,
52                    "split_feature_id": 0,
53                    "default_left": false,
54                    "split_type": "numerical",
55                    "comparison_op": "<",
56                    "threshold": 2.5,
57                    "left_child": 2,
58                    "right_child": 4
59                },
60                {
61                    "node_id": 4,
62                    "split_feature_id": 2,
63                    "default_left": true,
64                    "split_type": "numerical",
65                    "comparison_op": "<",
66                    "threshold": -1.2,
67                    "left_child": 6,
68                    "right_child": 8
69                },
70                {"node_id": 2, "leaf_value": 1.6},
71                {"node_id": 6, "leaf_value": 0.1},
72                {"node_id": 8, "leaf_value": -0.3}
73            ]
74        }
75    ]
76}
77"""
78model = treelite.Model.import_from_json(json_str)

Building model components using JSON

Model metadata

In the beginning, we must specify certain metadata of the model.

Task Parameters: Define a machine learing task

The task_type parameter is closely related to the content of task_param. The task_param object has the following parameters:

  • output_type: Type of leaf output. Either float or int.

  • grove_per_class: Boolean indicating a particular organization of multi-class classifier.

  • num_class: Number of targer classes in a multi-class classifier. Set this to 1 if the model is a binary classifier or a non-classifier.

  • leaf_vector_size: Length of leaf output. A value of 1 indicates scalar output.

The docstring of TaskType explains the relationship between task_type and the parameters in task_param:

enum class treelite::TaskType : uint8_t

Enum type representing the task type.

The task type places constraints on the parameters of TaskParam. See the docstring for each enum constants for more details.

Values:

enumerator kBinaryClfRegr

Catch-all task type encoding all tasks that are not multi-class classification, such as binary classification, regression, and learning-to-rank.

The kBinaryClfRegr task type implies the following constraints on the task parameters: output_type=float, grove_per_class=false, num_class=1, leaf_vector_size=1.

enumerator kMultiClfGrovePerClass

The multi-class classification task, in which the prediction for each class is given by the sum of outputs from a subset of the trees. We refer to this method as “grove-per-class”.

In this setting, each leaf node in a tree produces a single scalar output. To obtain predictions for each class, we divide the trees into multiple groups (“groves”) and then compute the sum of outputs of the trees in each group. The prediction for the i-th class is given by the sum of the outputs of the trees whose index is congruent to [i] modulo [num_class].

Examples of “grove-per-class” classifier are found in XGBoost, LightGBM, and GradientBoostingClassifier of scikit-learn.

The kMultiClfGrovePerClass task type implies the following constraints on the task parameters: output_type=float, grove_per_class=true, num_class>1, leaf_vector_size=1. In addition, we require that the number of trees is evenly divisible by [num_class].

enumerator kMultiClfProbDistLeaf

The multi-class classification task, in which each tree produces a vector of probability predictions for all the classes.

In this setting, each leaf node in a tree produces a vector output whose length is [num_class]. The vector represents probability predictions for all the classes. The outputs of the trees are combined via summing or averaging, depending on the value of the [average_tree_output] field. In effect, each tree is casting a set of weighted (fractional) votes for the classes.

Examples of kMultiClfProbDistLeaf task type are found in RandomForestClassifier of scikit-learn and RandomForestClassifier of cuML.

The kMultiClfProbDistLeaf task type implies the following constraints on the task parameters: output_type=float, grove_per_class=false, num_class>1, leaf_vector_size=num_class.

enumerator kMultiClfCategLeaf

The multi-class classification task, in which each tree produces a single integer output representing an unweighted vote for a particular class.

In this setting, each leaf node in a tree produces a single integer output between 0 and [num_class-1] that indicates a vote for a particular class. The outputs of the trees are combined by summing one_hot(tree(i)), where one_hot(x) represents the one-hot-encoded vector with 1 in index [x] and 0 everywhere else, and tree(i) is the output from the i-th tree. Models of type kMultiClfCategLeaf can be converted into the kMultiClfProbDistLeaf type, by converting the output of every leaf node into the equivalent one-hot-encoded vector.

The kMultiClfCategLeaf task type implies the following constraints on the task parameters: output_type=int, grove_per_class=false, num_class>1, leaf_vector_size=1.

Other Model Parameters

The model_param field contains the parameters described in Model Parameters. You may safely omit a parameter as long as it has a default value.

Tree nodes

Each tree object must have root_id field to indicate which node is the root node.

The nodes array must have node objects. Each node object must have node_id field. It will also have other fields, depending on the type of the node. A typical leaf node will be like this:

{"node_id": 2, "leaf_value": 0.6}

To output a leaf vector, use a list instead.

{"node_id": 2, "leaf_value": [0.6, 0.4]}

A typical internal node with numerical test:

{
    "node_id": 1,
    "split_feature_id": 2,
    "default_left": false,
    "split_type": "numerical",
    "comparison_op": "<",
    "threshold": -3.0,
    "left_child": 3,
    "right_child": 4
}

A typical internal node with categorical test:

{
    "node_id": 0,
    "split_feature_id": 1,
    "default_left": true,
    "split_type": "categorical",
    "categories_list": [1, 2, 4],
    "categories_list_right_child": false,
    "left_child": 1,
    "right_child": 2
}

For the categorical test, the test criterion is in the form of

[Feature value] \in [categories_list]

where the categories_list defines a (mathematical) set. When the test criteron is evaluated to be true, the prediction function traverses to the left child node (if categories_list_right_child=false) or to the right child node (if categories_list_right_child=true).