Treelite Serialization Format v4
The v4 serialization format was designed with the following goals in mind:
First-class support for multi-target models
Support “boosting from the average” in scikit-learn, where a simple base estimator is fitted from the class label distribution (or the average label, for regression) and is used as the initial learner in the ensemble model.
Use integer types with defined widths (so
int32_tinstead ofint)
We first define a set of enum types to be used in the serialization format.
TypeInfo: underlying typeuint8_t. Indicates the data type of another fieldkInvalid(0)kUInt32(1)kFloat32(2)kFloat64(3)
TaskType: underlying typeuint8_t. Indicates the type of the learning task.kBinaryClf(0): binary classifierkRegressor(1): regressorkMultiClf(2): multi-class classifierkLearningToRank(3): learning-to-rankkIsolationForest(4): isolation forest
TreeNodeType: underlying typeint8_t. Indicates the type of a node in a tree.kLeafNode(0)kNumericalTestNode(1)kCategoricalTestNode(2)
Operator: underlying typeint8_t. Indicates the comparison operator used in an internal test node in a tree.kNone(0)kEQ(1)kLT(2)kLE(3)kGT(4)kGE(5)
The model type is currently parametrized with two template parameters: ThresholdType and LeafOutputType.
In v4, the following combinations are allowed:
ThresholdType |
LeafOutputType |
|
|
|
|
A given Treelite model object is to be serialized as follows, with the fields to be written to the byte sequence in the exact order they appear in the following list.
Header
Major version (
major_ver): singleint32_tscalar. Set it to4for the v4 version.Minor version (
minor_ver): singleint32_tscalar.Patch version (
patch_ver): singleint32_tscalar.Threshold type (
threshold_type): singleuint8_tscalar representing enumTypeInfo.Leaf output type (
leaf_output_type): singleuint8_tscalar representing enumTypeInfo.
Number of trees (
num_tree): singleuint64_tscalar.Header 2
Number of features in data (
num_feature): singleint32_tscalar.Task type (
task_type): singleuint8_tscalar representing enumTaskType.average_tree_output: singleboolscalar indicating whether to average tree outputs. When this field is set to True, each outputout[target_id, row_id, class_id]is divided by the number of trees that are associated with the same target and class. (Seetarget_idandclass_idfields below.)Task parameters
num_target: singleint32_tscalar. Number of targets in the model.num_target > 1indicates a multi-target models. Negative value is invalid.num_class: an array ofint32_twith lengthnum_target. Negative value is invalid. Setnum_class=[1, 1, 1, ...]for regression and other non-classifier models.Note
Writing an array to the disk or a stream
When writing an array to the disk or a stream, we first write the length of the array (
uint64_tscalar), and then the content of the array.leaf_vector_shape: an array ofint32_twith length 2. The first dimension is either 1 ornum_target. The second dimension is either 1 ormax(num_class).
Per-tree Metadata
target_id: an array ofint32_t.target_id[i]indicates the target for which thei-th tree produces output. If the tree is a multi-target tree (i.e. it yields output for all targets),target_id[i]is set to -1. This array is expected to have lengthnum_tree.class_id: an array ofint32_t.class_id[i]indicates the class for which thei-th tree produces output. For vector-leaf trees that produce outputs for multiple classes, the correspondingclass_id[i]is set to -1. For regression and other non-classifier models,class_id[i]should be 0 for all trees. Theclass_idarray is expected to have lengthnum_tree.
Model parameters
postprocessor: an array ofchar. Stores a human-readable name of the post-processing function that’s applied to prediction outputs. Consult List of postprocessor functions for the list of available postprocessors.sigmoid_alpha: singlefloatscalar. This model parameter is relevant whenpostprocessor="sigmoid".ratio_c: singlefloatscalar. This model parameter is relevant whenpostprocessor="exponential_standard_ratio".base_scores: an array ofdouble. This vector is expected to have lengthnum_target * max(num_class). The elements will be laid out in the row-major layout. The predicted margin scores of all data points will be adjusted by this vector.attributes: an array ofcharcontaining a JSON string. The JSON string can store arbitrary model attributes. The JSON string must be a valid JSON object. To indicate the lack of an attribute, you may either:Set the field to an empty string (zero length) or
Set the field to
{}.
Extension slot 1: Per-model optional fields. This field is currently not used.
num_opt_field_per_model: singleint32_tscalar. Set this value to0, to indicate the lack of optional fields.
Tree 0: First tree, which is to be represented by the following fields.
num_nodes: singleint32_tscalar indicating the number of nodeshas_categorical_split: singleboolscalar indicating if categorical splits existnode_type: an array ofint8_trepresenting enumTreeNodeType.node_type[i]indicates the type of nodei.cleft: an array ofint32_t, so thatcleft[i]identifies the left child node of nodei. Set to-1to indicate the lack of the left child. When the tree is traversed, the left child node is chosen whenever the test in the test node evaluates to True. (For missing values, the test’s outcome is unknown. Seedefault_leftfield.)cright: an array ofint32_t, so thatcright[i]identifies the right child node of nodei. Set to-1to indicate the lack of the right child. When the tree is traversed, the right child node is chosen whenever the test in the test node evaluates to False. (For missing values, the test’s outcome is unknown. Seedefault_leftfield.)split_index: an array ofint32_t, wheresplit_index[i]gives the feature ID used in the test nodei. If nodeiis not a test node,split_index[i]shall be-1.default_left: an array ofbool, wheredefault_left[i]indicates the default direction for the missing value in the test nodei.leaf_value: an array ofLeafOutputType, whereleaf_value[i]is the output of the leaf nodei.leaf_value[i]is only valid if nodeiis a leaf node with a scalar output. To access the output of a leaf node that produces a vector output, useleaf_vectorinstead. (See below.)threshold: an array ofThresholdType, wherethreshold[i]is the threshold used in the test nodei.threshold[i]is only valid if nodeiis a test node with a numerical test (of form[feature value] [op] [threshold]). For categorical test nodes, usecategory_listinstead. (See below.)cmp: an array ofint8_t(representing enumOperator).cmp[i]is the comparison operator used in the test nodei.cmp[i]is only valid if nodeiis a numerical test node.category_list_right_child: an array ofboolwherecategory_list_right_child[i]indicates which child node should be followed when a categorical test (of form[feature value] in [category list]).category_list_right_child[i]is not defined if nodeiis not a categorical test node.Leaf vectors
Content (
leaf_vector): an array ofLeafOutputType. This array stores the leaf vectors for all nodes, such that the sub-arrayleaf_vector[leaf_vector_begin[i]:leaf_vector_end[i]]yields the leaf vector for the i-th node. The leaf vector uses the row-major layout to store a 2D array. If nodeiis not a leaf node with a vector output, the sub-array should be empty (leaf_vector_begin[i] == leaf_vector_end[i]).Beginning offset of each segment (
leaf_vector_begin): an array ofuint64_t.Ending offset of each segment (
leaf_vector_end): an array ofuint64_t.
Category list (for categorical splits)
Content (
category_list): an array ofuint32_t. This array stores the category lists of all nodes, such that the sub-arraycategory_list[category_list_begin[i]:category_list_end[i]]yields the category list of the i-th node. If nodeiis not a categorical test node, the sub-array should be empty (category_list_begin[i] == category_list_end[i]).Beginning offset of each segment (
category_list_begin): an array ofuint64_t.Ending offset of each segment (
category_list_end): an array ofuint64_t.
Metadata for node statistics
data_count: an array ofuint64_t.data_count[i]indicates the number of data points in the training data set whose traversal paths include nodei. LightGBM provides this statistics.data_count_present: an array ofbool.data_count_present[i]indicates whetherdata_count[i]is available. You may assign an empty array (length 0) todata_countanddata_count_presentif data count is unavailable for all nodes.sum_hess: an array ofdouble.sum_hess[i]indicates the sum of the Hessian values for all data points whose traversal paths include nodei. This information is available in XGBoost and is used as a proxy of the number of data points.sum_hess_present: an array ofbool.sum_hess_present[i]indicates whethersum_hess[i]is available. You may assign an empty array (length 0) tosum_hessandsum_hess_presentif Hessian sum is unavailable for all nodes.gain: an array ofdouble.gain[i]indicates the change in the loss function that is attributed to the particular split at nodei.gain_present: an array ofbool.gain_present[i]indicates whethergain[i]is present. You may assign an empty array (length 0) togainandgain_presentif gain is unavailable for all nodes.
Extension slot 2: Per-tree optional fields. This field is currently not used.
num_opt_field_per_tree: singleint32_tscalar. Set this value to0, to indicate the lack of optional fields.
Extension slot 3: Per-node optional fields. This field is currently not used.
num_opt_field_per_node: singleint32_tscalar. Set this value to0, to indicate the lack of optional fields.
Tree 1: Use the same set of fields as Tree 0.
Other trees …
Note
Caveat for multi-target, multi-class classifiers
When the number of classes are different for targets, we use the larget number of
classes (max_num_class) to shape the leaf vector (and base_scores). The leaf vector
will have shape (num_target, max_num_class), with extra elements padded with 0. This heuristic has the following
consequences: If a target has significantly more classes than other targets, a lot
of space will be wasted.
This is the method used in scikit-learn’s sklearn.ensemble.RandomForestClassifier.
Note
A few v3 models are not representable using v4
We designed the v4 format to be mostly backwards compatible with v3, but there are a few exceptions:
The task type
kMultiClfCategLeafis no longer supported. This task type has not found any use in the wild. Neither GTIL nor TL2cgen supports it.It is no longer possible to output integers from leaves. So
LeafOutputTypecan no longer beuint32_t;output_typecan no longer bekInt. Leaf outputs will now be assumed to befloatordouble. Theoutput_typefield is removed in v4. Integer outputs are being removed, as they have found little use in practice.
Note
Always use the little-endian order when reading and writing bytes
Always use the little-endian byte order when reading and writing scalars and arrays.