Treelite Serialization Format v3
We first define a set of enum types to be used in the serialization format.
TypeInfo
: underlying typeuint8_t
. Indicates the data type of another fieldkInvalid
(0)kUInt32
(1)kFloat32
(2)kFloat64
(3)
TaskType
: underlying typeuint8_t
. Indicates the type of the learning task.kBinaryClfRegr
(0)kMultiClfGrovePerClass
(1)kMultiClfProbDistLeaf
(2)kMultiClfCategLeaf
(3)
OutputType
: underlying typeuint8_t
. Indicates whether the trees produce a float or integer output.kFloat
(0)kInt
(1)
SplitFeatureType
: underlying typeint8_t
. Indicates the type of an internal test node in a tree.kNone
(0)kNumerical
(1)kCategorical
(2)
Operator
: underlying typeint8_t
. Indicates the comparison operator used in an internal test node in a tree.kNone
(0)kEQ
(1)kLT
(2)kLE
(3)kGT
(4)kGE
(5)
The model type is currently parametrized with two template parameters: ThresholdType
and LeafOutputType
.
In v3, the following combinations are allowed:
ThresholdType |
LeafOutputType |
|
|
|
|
|
|
|
|
A given Treelite model object is to be serialized as follows, with the fields to be written to the byte sequence in the exact order they appear in the following list.
Header
Major version: single
int32_t
scalar. Set it to3
for the v3 version.Minor version: single
int32_t
scalar.Patch version: single
int32_t
scalar.Threshold type: single
uint8_t
scalar representing enumTypeInfo
.Leaf output type: single
uint8_t
scalar representing enumTypeInfo
.
Number of trees: single
uint64_t
scalar.Header 2
Number of features in data: single
int32_t
scalar.Task type: single
uint8_t
scalar representing enumTaskType
.Whether to average tree outputs: single
bool
scalar.Task parameters (
TaskParam
): a structure with following fieldsoutput_type
: singleuint8_t
scalar representing enumOutputType
grove_per_class
: singlebool
scalarnum_class
: singleunsigned int
scalarleaf_vector_size
: singleunsigned int
scalar
Model parameters (
ModelParam
) a structure with following fieldspred_transform
: 256-character longchar
array. Stores a human-readable name of the transformation function that’s applied to prediction outputs. The unused elements in the array should be padded with null characters (\0
).sigmoid_alpha
: singlefloat
scalar. This model parameter is relevant whenpred_transform="sigmoid"
.ratio_c
: singlefloat
scalar. This model parameter is relevant whenpred_transform="exponential_standard_ratio"
.global_bias
: singlefloat
scalar.
Extension slot 1: Per-model optional fields. This field is unused in the v3 version.
Number of fields: single
int32_t
scalar. Set this value to0
, to indicate the lack of optional fields.
Tree 0: First tree, which is to be represented by the following fields.
Number of nodes: single
int
scalar.If categorical splits exist: single
bool
scalar.Array of nodes: an array of
Node
structure, whereNode
consists of the following fields:cleft_
: singleint32_t
scalar. Indicates the ID of the left child node. Set to-1
to indicate the lack of the left child.cright_
: singleint32_t
scalar. Indicates the ID of the right child node. Set to-1
to indicate the lack of the right child.sindex_
: singleuint32_t
scalar. This field gives both the feature ID used in the current test node (split_index
), as well as the default direction for the missing value (default_left
). Set this value by computingsplit_index | (default_left ? (1U << 31U) : 0)
.info_
: a union type containingleaf_value
(of typeLeafOutputType
) andthreshold
(of typeThresholdType
). To set this field, determine whether the node is a leaf node or an internal test node. Useleaf_value
for leaf nodes; usethreshold
for internal test nodes.data_count_
: singleuint64_t
scalar. Indicates the number of data points in the training data set whose traversal paths include this node. LightGBM provides this statistics.sum_hess_
: singledouble
scalar. Indicates the sum of the Hessian values for all data points whose traversal paths include this node. This information is available in XGBoost and is used as a proxy of the number of data points.gain_
: singledouble
scalar. Indicates the change in the loss function that is attributed to this particular split.split_type_
: singleint8_t
scalar representing enumSplitFeatureType
.cmp_
: singleint8_t
scalar representing enumOperator
.data_count_present_
: singlebool
scalar. Indicates whetherdata_count_
is present.sum_hess_present_
: singlebool
scalar. Indicates whethersum_hess_
is present.gain_present_
: singlebool
scalar. Indicates whethergain_
is present.categories_list_right_child_
: singlebool
scalar.
Note
Writing an array to the disk or a stream
When writing an array to the disk or a stream, we first write the length of the array (
uint64_t
scalar), and then the content of the array (sizeof(Node) * len
bytes).Leaf vectors
Content (
leaf_vector_
): an array ofLeafOutputType
. This array stores the leaf vectors for all nodes, such that the sub-arrayleaf_vector_[leaf_vector_begin[i]_:leaf_vector_end_[i]]
yields the leaf vector for the i-th node.Beginning offset of each segment (
leaf_vector_begin_
): an array ofsize_t
.Ending offset of each segment (
leaf_vector_end_
): an array ofsize_t
.
Matching categories (for categorical splits)
Content (
matching_categories_
): an array ofuint32_t
. This array stores the category lists of all nodes, such that the sub-arraymatching_categories_[matching_categories_offset_[i]:matching_categories_offset_[i+1]]
yields the category list of the i-th node.Beginning offset of each segment (
matching_categories_offset_
): an array ofsize_t
.
Extension slot 2: Per-tree optional fields. This field is unused in the v3 version.
Number of fields: single
int32_t
scalar. Set this value to0
, to indicate the lack of optional fields.
Extension slot 3: Per-node optional fields. This field is unused in the v3 version.
Number of fields: single
int32_t
scalar. Set this value to0
, to indicate the lack of optional fields.
Tree 1: Use the same set of fields as Tree 0.
Other trees …