Treelite Serialization Format v3
We first define a set of enum types to be used in the serialization format.
TypeInfo: underlying typeuint8_t. Indicates the data type of another fieldkInvalid(0)kUInt32(1)kFloat32(2)kFloat64(3)
TaskType: underlying typeuint8_t. Indicates the type of the learning task.kBinaryClfRegr(0)kMultiClfGrovePerClass(1)kMultiClfProbDistLeaf(2)kMultiClfCategLeaf(3)
OutputType: underlying typeuint8_t. Indicates whether the trees produce a float or integer output.kFloat(0)kInt(1)
SplitFeatureType: underlying typeint8_t. Indicates the type of an internal test node in a tree.kNone(0)kNumerical(1)kCategorical(2)
Operator: underlying typeint8_t. Indicates the comparison operator used in an internal test node in a tree.kNone(0)kEQ(1)kLT(2)kLE(3)kGT(4)kGE(5)
The model type is currently parametrized with two template parameters: ThresholdType and LeafOutputType.
In v3, the following combinations are allowed:
ThresholdType |
LeafOutputType |
|
|
|
|
|
|
|
|
A given Treelite model object is to be serialized as follows, with the fields to be written to the byte sequence in the exact order they appear in the following list.
Header
Major version: single
int32_tscalar. Set it to3for the v3 version.Minor version: single
int32_tscalar.Patch version: single
int32_tscalar.Threshold type: single
uint8_tscalar representing enumTypeInfo.Leaf output type: single
uint8_tscalar representing enumTypeInfo.
Number of trees: single
uint64_tscalar.Header 2
Number of features in data: single
int32_tscalar.Task type: single
uint8_tscalar representing enumTaskType.Whether to average tree outputs: single
boolscalar.Task parameters (
TaskParam): a structure with following fieldsoutput_type: singleuint8_tscalar representing enumOutputTypegrove_per_class: singleboolscalarnum_class: singleunsigned intscalarleaf_vector_size: singleunsigned intscalar
Model parameters (
ModelParam) a structure with following fieldspred_transform: 256-character longchararray. Stores a human-readable name of the transformation function that’s applied to prediction outputs. The unused elements in the array should be padded with null characters (\0).sigmoid_alpha: singlefloatscalar. This model parameter is relevant whenpred_transform="sigmoid".ratio_c: singlefloatscalar. This model parameter is relevant whenpred_transform="exponential_standard_ratio".global_bias: singlefloatscalar.
Extension slot 1: Per-model optional fields. This field is unused in the v3 version.
Number of fields: single
int32_tscalar. Set this value to0, to indicate the lack of optional fields.
Tree 0: First tree, which is to be represented by the following fields.
Number of nodes: single
intscalar.If categorical splits exist: single
boolscalar.Array of nodes: an array of
Nodestructure, whereNodeconsists of the following fields:cleft_: singleint32_tscalar. Indicates the ID of the left child node. Set to-1to indicate the lack of the left child.cright_: singleint32_tscalar. Indicates the ID of the right child node. Set to-1to indicate the lack of the right child.sindex_: singleuint32_tscalar. This field gives both the feature ID used in the current test node (split_index), as well as the default direction for the missing value (default_left). Set this value by computingsplit_index | (default_left ? (1U << 31U) : 0).info_: a union type containingleaf_value(of typeLeafOutputType) andthreshold(of typeThresholdType). To set this field, determine whether the node is a leaf node or an internal test node. Useleaf_valuefor leaf nodes; usethresholdfor internal test nodes.data_count_: singleuint64_tscalar. Indicates the number of data points in the training data set whose traversal paths include this node. LightGBM provides this statistics.sum_hess_: singledoublescalar. Indicates the sum of the Hessian values for all data points whose traversal paths include this node. This information is available in XGBoost and is used as a proxy of the number of data points.gain_: singledoublescalar. Indicates the change in the loss function that is attributed to this particular split.split_type_: singleint8_tscalar representing enumSplitFeatureType.cmp_: singleint8_tscalar representing enumOperator.data_count_present_: singleboolscalar. Indicates whetherdata_count_is present.sum_hess_present_: singleboolscalar. Indicates whethersum_hess_is present.gain_present_: singleboolscalar. Indicates whethergain_is present.categories_list_right_child_: singleboolscalar.
Note
Writing an array to the disk or a stream
When writing an array to the disk or a stream, we first write the length of the array (
uint64_tscalar), and then the content of the array (sizeof(Node) * lenbytes).Leaf vectors
Content (
leaf_vector_): an array ofLeafOutputType. This array stores the leaf vectors for all nodes, such that the sub-arrayleaf_vector_[leaf_vector_begin[i]_:leaf_vector_end_[i]]yields the leaf vector for the i-th node.Beginning offset of each segment (
leaf_vector_begin_): an array ofsize_t.Ending offset of each segment (
leaf_vector_end_): an array ofsize_t.
Matching categories (for categorical splits)
Content (
matching_categories_): an array ofuint32_t. This array stores the category lists of all nodes, such that the sub-arraymatching_categories_[matching_categories_offset_[i]:matching_categories_offset_[i+1]]yields the category list of the i-th node.Beginning offset of each segment (
matching_categories_offset_): an array ofsize_t.
Extension slot 2: Per-tree optional fields. This field is unused in the v3 version.
Number of fields: single
int32_tscalar. Set this value to0, to indicate the lack of optional fields.
Extension slot 3: Per-node optional fields. This field is unused in the v3 version.
Number of fields: single
int32_tscalar. Set this value to0, to indicate the lack of optional fields.
Tree 1: Use the same set of fields as Tree 0.
Other trees …