Treelite Serialization Format v4
The v4 serialization format was designed with the following goals in mind:
First-class support for multi-target models
Support “boosting from the average” in scikit-learn, where a simple base estimator is fitted from the class label distribution (or the average label, for regression) and is used as the initial learner in the ensemble model.
Use integer types with defined widths (so
int32_t
instead ofint
)
We first define a set of enum types to be used in the serialization format.
TypeInfo
: underlying typeuint8_t
. Indicates the data type of another fieldkInvalid
(0)kUInt32
(1)kFloat32
(2)kFloat64
(3)
TaskType
: underlying typeuint8_t
. Indicates the type of the learning task.kBinaryClf
(0): binary classifierkRegressor
(1): regressorkMultiClf
(2): multi-class classifierkLearningToRank
(3): learning-to-rankkIsolationForest
(4): isolation forest
TreeNodeType
: underlying typeint8_t
. Indicates the type of a node in a tree.kLeafNode
(0)kNumericalTestNode
(1)kCategoricalTestNode
(2)
Operator
: underlying typeint8_t
. Indicates the comparison operator used in an internal test node in a tree.kNone
(0)kEQ
(1)kLT
(2)kLE
(3)kGT
(4)kGE
(5)
The model type is currently parametrized with two template parameters: ThresholdType
and LeafOutputType
.
In v4, the following combinations are allowed:
ThresholdType |
LeafOutputType |
|
|
|
|
A given Treelite model object is to be serialized as follows, with the fields to be written to the byte sequence in the exact order they appear in the following list.
Header
Major version (
major_ver
): singleint32_t
scalar. Set it to4
for the v4 version.Minor version (
minor_ver
): singleint32_t
scalar.Patch version (
patch_ver
): singleint32_t
scalar.Threshold type (
threshold_type
): singleuint8_t
scalar representing enumTypeInfo
.Leaf output type (
leaf_output_type
): singleuint8_t
scalar representing enumTypeInfo
.
Number of trees (
num_tree
): singleuint64_t
scalar.Header 2
Number of features in data (
num_feature
): singleint32_t
scalar.Task type (
task_type
): singleuint8_t
scalar representing enumTaskType
.average_tree_output
: singlebool
scalar indicating whether to average tree outputs. When this field is set to True, each outputout[target_id, row_id, class_id]
is divided by the number of trees that are associated with the same target and class. (Seetarget_id
andclass_id
fields below.)Task parameters
num_target
: singleint32_t
scalar. Number of targets in the model.num_target > 1
indicates a multi-target models. Negative value is invalid.num_class
: an array ofint32_t
with lengthnum_target
. Negative value is invalid. Setnum_class=[1, 1, 1, ...]
for regression and other non-classifier models.Note
Writing an array to the disk or a stream
When writing an array to the disk or a stream, we first write the length of the array (
uint64_t
scalar), and then the content of the array.leaf_vector_shape
: an array ofint32_t
with length 2. The first dimension is either 1 ornum_target
. The second dimension is either 1 ormax(num_class)
.
Per-tree Metadata
target_id
: an array ofint32_t
.target_id[i]
indicates the target for which thei
-th tree produces output. If the tree is a multi-target tree (i.e. it yields output for all targets),target_id[i]
is set to -1. This array is expected to have lengthnum_tree
.class_id
: an array ofint32_t
.class_id[i]
indicates the class for which thei
-th tree produces output. For vector-leaf trees that produce outputs for multiple classes, the correspondingclass_id[i]
is set to -1. For regression and other non-classifier models,class_id[i]
should be 0 for all trees. Theclass_id
array is expected to have lengthnum_tree
.
Model parameters
postprocessor
: an array ofchar
. Stores a human-readable name of the post-processing function that’s applied to prediction outputs. Consult List of postprocessor functions for the list of available postprocessors.sigmoid_alpha
: singlefloat
scalar. This model parameter is relevant whenpostprocessor="sigmoid"
.ratio_c
: singlefloat
scalar. This model parameter is relevant whenpostprocessor="exponential_standard_ratio"
.base_scores
: an array ofdouble
. This vector is expected to have lengthnum_target * max(num_class)
. The elements will be laid out in the row-major layout. The predicted margin scores of all data points will be adjusted by this vector.attributes
: an array ofchar
containing a JSON string. The JSON string can store arbitrary model attributes. The JSON string must be a valid JSON object. To indicate the lack of an attribute, you may either:Set the field to an empty string (zero length) or
Set the field to
{}
.
Extension slot 1: Per-model optional fields. This field is currently not used.
num_opt_field_per_model
: singleint32_t
scalar. Set this value to0
, to indicate the lack of optional fields.
Tree 0: First tree, which is to be represented by the following fields.
num_nodes
: singleint32_t
scalar indicating the number of nodeshas_categorical_split
: singlebool
scalar indicating if categorical splits existnode_type
: an array ofint8_t
representing enumTreeNodeType
.node_type[i]
indicates the type of nodei
.cleft
: an array ofint32_t
, so thatcleft[i]
identifies the left child node of nodei
. Set to-1
to indicate the lack of the left child.cright
: an array ofint32_t
, so thatcright[i]
identifies the right child node of nodei
. Set to-1
to indicate the lack of the right child.split_index
: an array ofint32_t
, wheresplit_index[i]
gives the feature ID used in the test nodei
. If nodei
is not a test node,split_index[i]
shall be-1
.default_left
: an array ofbool
, wheredefault_left[i]
indicates the default direction for the missing value in the test nodei
.leaf_value
: an array ofLeafOutputType
, whereleaf_value[i]
is the output of the leaf nodei
.leaf_value[i]
is only valid if nodei
is a leaf node with a scalar output. To access the output of a leaf node that produces a vector output, useleaf_vector
instead. (See below.)threshold
: an array ofThresholdType
, wherethreshold[i]
is the threshold used in the test nodei
.threshold[i]
is only valid if nodei
is a test node with a numerical test (of form[feature value] [op] [threshold]
). For categorical test nodes, usecategory_list
instead. (See below.)cmp
: an array ofint8_t
(representing enumOperator
).cmp[i]
is the comparison operator used in the test nodei
.cmp[i]
is only valid if nodei
is a numerical test node.category_list_right_child
: an array ofbool
wherecategory_list_right_child[i]
indicates which child node should be followed when a categorical test (of form[feature value] in [category list]
).category_list_right_child[i]
is not defined if nodei
is not a categorical test node.Leaf vectors
Content (
leaf_vector
): an array ofLeafOutputType
. This array stores the leaf vectors for all nodes, such that the sub-arrayleaf_vector[leaf_vector_begin[i]:leaf_vector_end[i]]
yields the leaf vector for the i-th node. The leaf vector uses the row-major layout to store a 2D array. If nodei
is not a leaf node with a vector output, the sub-array should be empty (leaf_vector_begin[i] == leaf_vector_end[i]
).Beginning offset of each segment (
leaf_vector_begin
): an array ofuint64_t
.Ending offset of each segment (
leaf_vector_end
): an array ofuint64_t
.
Category list (for categorical splits)
Content (
category_list
): an array ofuint32_t
. This array stores the category lists of all nodes, such that the sub-arraycategory_list[category_list_begin[i]:category_list_end[i]]
yields the category list of the i-th node. If nodei
is not a categorical test node, the sub-array should be empty (category_list_begin[i] == category_list_end[i]
).Beginning offset of each segment (
category_list_begin
): an array ofuint64_t
.Ending offset of each segment (
category_list_end
): an array ofuint64_t
.
Metadata for node statistics
data_count
: an array ofuint64_t
.data_count[i]
indicates the number of data points in the training data set whose traversal paths include nodei
. LightGBM provides this statistics.data_count_present
: an array ofbool
.data_count_present[i]
indicates whetherdata_count[i]
is available. You may assign an empty array (length 0) todata_count
anddata_count_present
if data count is unavailable for all nodes.sum_hess
: an array ofdouble
.sum_hess[i]
indicates the sum of the Hessian values for all data points whose traversal paths include nodei
. This information is available in XGBoost and is used as a proxy of the number of data points.sum_hess_present
: an array ofbool
.sum_hess_present[i]
indicates whethersum_hess[i]
is available. You may assign an empty array (length 0) tosum_hess
andsum_hess_present
if Hessian sum is unavailable for all nodes.gain
: an array ofdouble
.gain[i]
indicates the change in the loss function that is attributed to the particular split at nodei
.gain_present
: an array ofbool
.gain_present[i]
indicates whethergain[i]
is present. You may assign an empty array (length 0) togain
andgain_present
if gain is unavailable for all nodes.
Extension slot 2: Per-tree optional fields. This field is currently not used.
num_opt_field_per_tree
: singleint32_t
scalar. Set this value to0
, to indicate the lack of optional fields.
Extension slot 3: Per-node optional fields. This field is currently not used.
num_opt_field_per_node
: singleint32_t
scalar. Set this value to0
, to indicate the lack of optional fields.
Tree 1: Use the same set of fields as Tree 0.
Other trees …
Note
Caveat for multi-target, multi-class classifiers
When the number of classes are different for targets, we use the larget number of
classes (max_num_class
) to shape the leaf vector (and base_scores
). The leaf vector
will have shape (num_target, max_num_class)
, with extra elements padded with 0
. This heuristic has the following
consequences: If a target has significantly more classes than other targets, a lot
of space will be wasted.
This is the method used in scikit-learn’s sklearn.ensemble.RandomForestClassifier
.
Note
A few v3 models are not representable using v4
We designed the v4 format to be mostly backwards compatible with v3, but there are a few exceptions:
The task type
kMultiClfCategLeaf
is no longer supported. This task type has not found any use in the wild. Neither GTIL nor TL2cgen supports it.It is no longer possible to output integers from leaves. So
LeafOutputType
can no longer beuint32_t
;output_type
can no longer bekInt
. Leaf outputs will now be assumed to befloat
ordouble
. Theoutput_type
field is removed in v4. Integer outputs are being removed, as they have found little use in practice.
Note
Always use the little-endian order when reading and writing bytes
Always use the little-endian byte order when reading and writing scalars and arrays.