Specifying models using JSON string
Treelite supports loading models from major tree libraries, such as XGBoost and scikit-learn. However, you may want to use models trained by other tree libraries that are not directly supported by Treelite. The JSON importer is useful in this use case. (Alternatively, consider using the model builder instead.)
Toy Example
Consider the following tree ensemble, consisting of two regression trees:
where each node is assign a unique integer key, indicated in red. Note that integer keys need to be unique only within the same tree.
You can construct this tree ensemble by calling
import_from_json()
with an appropriately formatted
JSON string. We will give you the example code first; in the following section,
we will explain the meaining of each field in the JSON string.
Note
dump_as_json()
will NOT preserve the JSON string that’s passed into import_from_json()
The operation performed in import_from_json()
is strictly one-way.
So the output of dump_as_json()
will differ from the JSON string
you used in calling import_from_json()
.
1import treelite
2
3json_str = """
4{
5 "num_feature": 3,
6 "task_type": "kBinaryClfRegr",
7 "average_tree_output": false,
8 "task_param": {
9 "output_type": "float",
10 "grove_per_class": false,
11 "num_class": 1,
12 "leaf_vector_size": 1
13 },
14 "model_param": {
15 "pred_transform": "identity",
16 "global_bias": 0.0
17 },
18 "trees": [
19 {
20 "root_id": 0,
21 "nodes": [
22 {
23 "node_id": 0,
24 "split_feature_id": 1,
25 "default_left": true,
26 "split_type": "categorical",
27 "categories_list": [1, 2, 4],
28 "categories_list_right_child": false,
29 "left_child": 1,
30 "right_child": 2
31 },
32 {
33 "node_id": 1,
34 "split_feature_id": 2,
35 "default_left": false,
36 "split_type": "numerical",
37 "comparison_op": "<",
38 "threshold": -3.0,
39 "left_child": 3,
40 "right_child": 4
41 },
42 {"node_id": 2, "leaf_value": 0.6},
43 {"node_id": 3, "leaf_value": -0.4},
44 {"node_id": 4, "leaf_value": 1.2}
45 ]
46 },
47 {
48 "root_id": 1,
49 "nodes": [
50 {
51 "node_id": 1,
52 "split_feature_id": 0,
53 "default_left": false,
54 "split_type": "numerical",
55 "comparison_op": "<",
56 "threshold": 2.5,
57 "left_child": 2,
58 "right_child": 4
59 },
60 {
61 "node_id": 4,
62 "split_feature_id": 2,
63 "default_left": true,
64 "split_type": "numerical",
65 "comparison_op": "<",
66 "threshold": -1.2,
67 "left_child": 6,
68 "right_child": 8
69 },
70 {"node_id": 2, "leaf_value": 1.6},
71 {"node_id": 6, "leaf_value": 0.1},
72 {"node_id": 8, "leaf_value": -0.3}
73 ]
74 }
75 ]
76}
77"""
78model = treelite.Model.import_from_json(json_str)
Building model components using JSON
Model metadata
In the beginning, we must specify certain metadata of the model.
num_teature
: Number of features (columns) in the training dataaverage_tree_output
: Whether to average the outputs of trees. Set this to True if the model is a random forest.task_type
/task_param
: Parameters that together define a machine learning task.model_param
: Other important parameters in the model.
Task Parameters: Define a machine learing task
The task_type
parameter is closely related to the content of task_param
.
The task_param
object has the following parameters:
output_type
: Type of leaf output. Eitherfloat
orint
.grove_per_class
: Boolean indicating a particular organization of multi-class classifier.num_class
: Number of targer classes in a multi-class classifier. Set this to 1 if the model is a binary classifier or a non-classifier.leaf_vector_size
: Length of leaf output. A value of 1 indicates scalar output.
The docstring of TaskType
explains the relationship between
task_type
and the parameters in task_param
:
-
enum class treelite::TaskType : uint8_t
Enum type representing the task type.
The task type places constraints on the parameters of TaskParam. See the docstring for each enum constants for more details.
Values:
-
enumerator kBinaryClfRegr
Catch-all task type encoding all tasks that are not multi-class classification, such as binary classification, regression, and learning-to-rank.
The kBinaryClfRegr task type implies the following constraints on the task parameters: output_type=float, grove_per_class=false, num_class=1, leaf_vector_size=1.
-
enumerator kMultiClfGrovePerClass
The multi-class classification task, in which the prediction for each class is given by the sum of outputs from a subset of the trees. We refer to this method as “grove-per-class”.
In this setting, each leaf node in a tree produces a single scalar output. To obtain predictions for each class, we divide the trees into multiple groups (“groves”) and then compute the sum of outputs of the trees in each group. The prediction for the i-th class is given by the sum of the outputs of the trees whose index is congruent to [i] modulo [num_class].
Examples of “grove-per-class” classifier are found in XGBoost, LightGBM, and GradientBoostingClassifier of scikit-learn.
The kMultiClfGrovePerClass task type implies the following constraints on the task parameters: output_type=float, grove_per_class=true, num_class>1, leaf_vector_size=1. In addition, we require that the number of trees is evenly divisible by [num_class].
-
enumerator kMultiClfProbDistLeaf
The multi-class classification task, in which each tree produces a vector of probability predictions for all the classes.
In this setting, each leaf node in a tree produces a vector output whose length is [num_class]. The vector represents probability predictions for all the classes. The outputs of the trees are combined via summing or averaging, depending on the value of the [average_tree_output] field. In effect, each tree is casting a set of weighted (fractional) votes for the classes.
Examples of kMultiClfProbDistLeaf task type are found in RandomForestClassifier of scikit-learn and RandomForestClassifier of cuML.
The kMultiClfProbDistLeaf task type implies the following constraints on the task parameters: output_type=float, grove_per_class=false, num_class>1, leaf_vector_size=num_class.
-
enumerator kMultiClfCategLeaf
The multi-class classification task, in which each tree produces a single integer output representing an unweighted vote for a particular class.
In this setting, each leaf node in a tree produces a single integer output between 0 and [num_class-1] that indicates a vote for a particular class. The outputs of the trees are combined by summing one_hot(tree(i)), where one_hot(x) represents the one-hot-encoded vector with 1 in index [x] and 0 everywhere else, and tree(i) is the output from the i-th tree. Models of type kMultiClfCategLeaf can be converted into the kMultiClfProbDistLeaf type, by converting the output of every leaf node into the equivalent one-hot-encoded vector.
The kMultiClfCategLeaf task type implies the following constraints on the task parameters: output_type=int, grove_per_class=false, num_class>1, leaf_vector_size=1.
-
enumerator kBinaryClfRegr
Other Model Parameters
The model_param
field contains the parameters described in Model Parameters.
You may safely omit a parameter as long as it has a default value.
Tree nodes
Each tree object must have root_id
field to indicate which node is the root node.
The nodes
array must have node objects. Each node object must have node_id
field.
It will also have other fields, depending on the type of the node. A typical leaf node
will be like this:
{"node_id": 2, "leaf_value": 0.6}
To output a leaf vector, use a list instead.
{"node_id": 2, "leaf_value": [0.6, 0.4]}
A typical internal node with numerical test:
{
"node_id": 1,
"split_feature_id": 2,
"default_left": false,
"split_type": "numerical",
"comparison_op": "<",
"threshold": -3.0,
"left_child": 3,
"right_child": 4
}
A typical internal node with categorical test:
{
"node_id": 0,
"split_feature_id": 1,
"default_left": true,
"split_type": "categorical",
"categories_list": [1, 2, 4],
"categories_list_right_child": false,
"left_child": 1,
"right_child": 2
}
For the categorical test, the test criterion is in the form of
[Feature value] \in [categories_list]
where the categories_list
defines a (mathematical) set.
When the test criteron is evaluated to be true, the prediction function
traverses to the left child node (if categories_list_right_child=false
)
or to the right child node (if categories_list_right_child=true
).