Treelite Serialization Format v4

The v4 serialization format was designed with the following goals in mind:

First-class support for multi-target models
Support “boosting from the average” in scikit-learn, where a simple base estimator is fitted from the class label distribution (or the average label, for regression) and is used as the initial learner in the ensemble model.
Use integer types with defined widths (so int32_t instead of int)

We first define a set of enum types to be used in the serialization format.

TypeInfo: underlying type uint8_t. Indicates the data type of another field
- kInvalid (0)
- kUInt32 (1)
- kFloat32 (2)
- kFloat64 (3)
TaskType: underlying type uint8_t. Indicates the type of the learning task.
- kBinaryClf (0): binary classifier
- kRegressor (1): regressor
- kMultiClf (2): multi-class classifier
- kLearningToRank (3): learning-to-rank
- kIsolationForest (4): isolation forest
TreeNodeType: underlying type int8_t. Indicates the type of a node in a tree.
- kLeafNode (0)
- kNumericalTestNode (1)
- kCategoricalTestNode (2)
Operator: underlying type int8_t. Indicates the comparison operator used in an internal test node in a tree.
- kNone (0)
- kEQ (1)
- kLT (2)
- kLE (3)
- kGT (4)
- kGE (5)

The model type is currently parametrized with two template parameters: ThresholdType and LeafOutputType. In v4, the following combinations are allowed:

ThresholdType	LeafOutputType
`float`	`float`
`double`	`double`

A given Treelite model object is to be serialized as follows, with the fields to be written to the byte sequence in the exact order they appear in the following list.

Header
- Major version (major_ver): single int32_t scalar. Set it to 4 for the v4 version.
- Minor version (minor_ver): single int32_t scalar.
- Patch version (patch_ver): single int32_t scalar.
- Threshold type (threshold_type): single uint8_t scalar representing enum TypeInfo.
- Leaf output type (leaf_output_type): single uint8_t scalar representing enum TypeInfo.
Number of trees (num_tree): single uint64_t scalar.
Header 2
- Number of features in data (num_feature): single int32_t scalar.
- Task type (task_type): single uint8_t scalar representing enum TaskType.
- average_tree_output: single bool scalar indicating whether to average tree outputs. When this field is set to True, each output out[target_id, row_id, class_id] is divided by the number of trees that are associated with the same target and class. (See target_id and class_id fields below.)
- Task parameters
  - num_target: single int32_t scalar. Number of targets in the model. num_target > 1 indicates a multi-target models. Negative value is invalid.
  - num_class: an array of int32_t with length num_target. Negative value is invalid. Set num_class=[1, 1, 1, ...] for regression and other non-classifier models.
    
    Note
    
    Writing an array to the disk or a stream
    
    When writing an array to the disk or a stream, we first write the length of the array (uint64_t scalar), and then the content of the array.
  - leaf_vector_shape: an array of int32_t with length 2. The first dimension is either 1 or num_target. The second dimension is either 1 or max(num_class).
- Per-tree Metadata
  - target_id: an array of int32_t. target_id[i] indicates the target for which the i-th tree produces output. If the tree is a multi-target tree (i.e. it yields output for all targets), target_id[i] is set to -1. This array is expected to have length num_tree.
  - class_id: an array of int32_t. class_id[i] indicates the class for which the i-th tree produces output. For vector-leaf trees that produce outputs for multiple classes, the corresponding class_id[i] is set to -1. For regression and other non-classifier models, class_id[i] should be 0 for all trees. The class_id array is expected to have length num_tree.
- Model parameters
  - postprocessor: an array of char. Stores a human-readable name of the post-processing function that’s applied to prediction outputs. Consult List of postprocessor functions for the list of available postprocessors.
  - sigmoid_alpha: single float scalar. This model parameter is relevant when postprocessor="sigmoid".
  - ratio_c: single float scalar. This model parameter is relevant when postprocessor="exponential_standard_ratio".
  - base_scores: an array of double. This vector is expected to have length num_target * max(num_class). The elements will be laid out in the row-major layout. The predicted margin scores of all data points will be adjusted by this vector.
  - attributes: an array of char containing a JSON string. The JSON string can store arbitrary model attributes. The JSON string must be a valid JSON object. To indicate the lack of an attribute, you may either:
    - Set the field to an empty string (zero length) or
    - Set the field to {}.
Extension slot 1: Per-model optional fields. This field is currently not used.
- num_opt_field_per_model: single int32_t scalar. Set this value to 0, to indicate the lack of optional fields.
Tree 0: First tree, which is to be represented by the following fields.
- num_nodes: single int32_t scalar indicating the number of nodes
- has_categorical_split: single bool scalar indicating if categorical splits exist
- node_type: an array of int8_t representing enum TreeNodeType. node_type[i] indicates the type of node i.
- cleft: an array of int32_t, so that cleft[i] identifies the left child node of node i. Set to -1 to indicate the lack of the left child.
- cright: an array of int32_t, so that cright[i] identifies the right child node of node i. Set to -1 to indicate the lack of the right child.
- split_index: an array of int32_t, where split_index[i] gives the feature ID used in the test node i. If node i is not a test node, split_index[i] shall be -1.
- default_left: an array of bool, where default_left[i] indicates the default direction for the missing value in the test node i.
- leaf_value: an array of LeafOutputType, where leaf_value[i] is the output of the leaf node i. leaf_value[i] is only valid if node i is a leaf node with a scalar output. To access the output of a leaf node that produces a vector output, use leaf_vector instead. (See below.)
- threshold: an array of ThresholdType, where threshold[i] is the threshold used in the test node i. threshold[i] is only valid if node i is a test node with a numerical test (of form [feature value] [op] [threshold]). For categorical test nodes, use category_list instead. (See below.)
- cmp: an array of int8_t (representing enum Operator). cmp[i] is the comparison operator used in the test node i. cmp[i] is only valid if node i is a numerical test node.
- category_list_right_child: an array of bool where category_list_right_child[i] indicates which child node should be followed when a categorical test (of form [feature value] in [category list]). category_list_right_child[i] is not defined if node i is not a categorical test node.
- Leaf vectors
  - Content (leaf_vector): an array of LeafOutputType. This array stores the leaf vectors for all nodes, such that the sub-array leaf_vector[leaf_vector_begin[i]:leaf_vector_end[i]] yields the leaf vector for the i-th node. The leaf vector uses the row-major layout to store a 2D array. If node i is not a leaf node with a vector output, the sub-array should be empty (leaf_vector_begin[i] == leaf_vector_end[i]).
  - Beginning offset of each segment (leaf_vector_begin): an array of uint64_t.
  - Ending offset of each segment (leaf_vector_end): an array of uint64_t.
- Category list (for categorical splits)
  - Content (category_list): an array of uint32_t. This array stores the category lists of all nodes, such that the sub-array category_list[category_list_begin[i]:category_list_end[i]] yields the category list of the i-th node. If node i is not a categorical test node, the sub-array should be empty (category_list_begin[i] == category_list_end[i]).
  - Beginning offset of each segment (category_list_begin): an array of uint64_t.
  - Ending offset of each segment (category_list_end): an array of uint64_t.
- Metadata for node statistics
  - data_count: an array of uint64_t. data_count[i] indicates the number of data points in the training data set whose traversal paths include node i. LightGBM provides this statistics.
  - data_count_present: an array of bool. data_count_present[i] indicates whether data_count[i] is available. You may assign an empty array (length 0) to data_count and data_count_present if data count is unavailable for all nodes.
  - sum_hess: an array of double. sum_hess[i] indicates the sum of the Hessian values for all data points whose traversal paths include node i. This information is available in XGBoost and is used as a proxy of the number of data points.
  - sum_hess_present: an array of bool. sum_hess_present[i] indicates whether sum_hess[i] is available. You may assign an empty array (length 0) to sum_hess and sum_hess_present if Hessian sum is unavailable for all nodes.
  - gain: an array of double. gain[i] indicates the change in the loss function that is attributed to the particular split at node i.
  - gain_present: an array of bool. gain_present[i] indicates whether gain[i] is present. You may assign an empty array (length 0) to gain and gain_present if gain is unavailable for all nodes.
- Extension slot 2: Per-tree optional fields. This field is currently not used.
  - num_opt_field_per_tree: single int32_t scalar. Set this value to 0, to indicate the lack of optional fields.
- Extension slot 3: Per-node optional fields. This field is currently not used.
  - num_opt_field_per_node: single int32_t scalar. Set this value to 0, to indicate the lack of optional fields.
Tree 1: Use the same set of fields as Tree 0.
Other trees …

Note

Caveat for multi-target, multi-class classifiers

When the number of classes are different for targets, we use the larget number of classes (max_num_class) to shape the leaf vector (and base_scores). The leaf vector will have shape (num_target, max_num_class), with extra elements padded with 0. This heuristic has the following consequences: If a target has significantly more classes than other targets, a lot of space will be wasted.

This is the method used in scikit-learn’s sklearn.ensemble.RandomForestClassifier.

Note

A few v3 models are not representable using v4

We designed the v4 format to be mostly backwards compatible with v3, but there are a few exceptions:

The task type kMultiClfCategLeaf is no longer supported. This task type has not found any use in the wild. Neither GTIL nor TL2cgen supports it.
It is no longer possible to output integers from leaves. So LeafOutputType can no longer be uint32_t; output_type can no longer be kInt. Leaf outputs will now be assumed to be float or double. The output_type field is removed in v4. Integer outputs are being removed, as they have found little use in practice.

Note

Always use the little-endian order when reading and writing bytes

Always use the little-endian byte order when reading and writing scalars and arrays.