Treelite Serialization Format v3

We first define a set of enum types to be used in the serialization format.

  • TypeInfo: underlying type uint8_t. Indicates the data type of another field

    • kInvalid (0)

    • kUInt32 (1)

    • kFloat32 (2)

    • kFloat64 (3)

  • TaskType: underlying type uint8_t. Indicates the type of the learning task.

    • kBinaryClfRegr (0)

    • kMultiClfGrovePerClass (1)

    • kMultiClfProbDistLeaf (2)

    • kMultiClfCategLeaf (3)

  • OutputType: underlying type uint8_t. Indicates whether the trees produce a float or integer output.

    • kFloat (0)

    • kInt (1)

  • SplitFeatureType: underlying type int8_t. Indicates the type of an internal test node in a tree.

    • kNone (0)

    • kNumerical (1)

    • kCategorical (2)

  • Operator: underlying type int8_t. Indicates the comparison operator used in an internal test node in a tree.

    • kNone (0)

    • kEQ (1)

    • kLT (2)

    • kLE (3)

    • kGT (4)

    • kGE (5)

The model type is currently parametrized with two template parameters: ThresholdType and LeafOutputType. In v3, the following combinations are allowed:

ThresholdType

LeafOutputType

float

uint32_t

float

float

double

uint32_t

double

double

A given Treelite model object is to be serialized as follows, with the fields to be written to the byte sequence in the exact order they appear in the following list.

  1. Header

    • Major version: single int32_t scalar. Set it to 3 for the v3 version.

    • Minor version: single int32_t scalar.

    • Patch version: single int32_t scalar.

    • Threshold type: single uint8_t scalar representing enum TypeInfo.

    • Leaf output type: single uint8_t scalar representing enum TypeInfo.

  2. Number of trees: single uint64_t scalar.

  3. Header 2

    • Number of features in data: single int32_t scalar.

    • Task type: single uint8_t scalar representing enum TaskType.

    • Whether to average tree outputs: single bool scalar.

    • Task parameters (TaskParam): a structure with following fields

      • output_type: single uint8_t scalar representing enum OutputType

      • grove_per_class: single bool scalar

      • num_class: single unsigned int scalar

      • leaf_vector_size: single unsigned int scalar

    • Model parameters (ModelParam) a structure with following fields

      • pred_transform: 256-character long char array. Stores a human-readable name of the transformation function that’s applied to prediction outputs. The unused elements in the array should be padded with null characters (\0).

      • sigmoid_alpha: single float scalar. This model parameter is relevant when pred_transform="sigmoid".

      • ratio_c: single float scalar. This model parameter is relevant when pred_transform="exponential_standard_ratio".

      • global_bias: single float scalar.

  4. Extension slot 1: Per-model optional fields. This field is unused in the v3 version.

    • Number of fields: single int32_t scalar. Set this value to 0, to indicate the lack of optional fields.

  5. Tree 0: First tree, which is to be represented by the following fields.

    • Number of nodes: single int scalar.

    • If categorical splits exist: single bool scalar.

    • Array of nodes: an array of Node structure, where Node consists of the following fields:

      • cleft_: single int32_t scalar. Indicates the ID of the left child node. Set to -1 to indicate the lack of the left child.

      • cright_: single int32_t scalar. Indicates the ID of the right child node. Set to -1 to indicate the lack of the right child.

      • sindex_: single uint32_t scalar. This field gives both the feature ID used in the current test node (split_index), as well as the default direction for the missing value (default_left). Set this value by computing split_index | (default_left ? (1U << 31U) : 0).

      • info_: a union type containing leaf_value (of type LeafOutputType) and threshold (of type ThresholdType). To set this field, determine whether the node is a leaf node or an internal test node. Use leaf_value for leaf nodes; use threshold for internal test nodes.

      • data_count_: single uint64_t scalar. Indicates the number of data points in the training data set whose traversal paths include this node. LightGBM provides this statistics.

      • sum_hess_: single double scalar. Indicates the sum of the Hessian values for all data points whose traversal paths include this node. This information is available in XGBoost and is used as a proxy of the number of data points.

      • gain_: single double scalar. Indicates the change in the loss function that is attributed to this particular split.

      • split_type_: single int8_t scalar representing enum SplitFeatureType.

      • cmp_: single int8_t scalar representing enum Operator.

      • data_count_present_: single bool scalar. Indicates whether data_count_ is present.

      • sum_hess_present_: single bool scalar. Indicates whether sum_hess_ is present.

      • gain_present_: single bool scalar. Indicates whether gain_ is present.

      • categories_list_right_child_: single bool scalar.

      Note

      Writing an array to the disk or a stream

      When writing an array to the disk or a stream, we first write the length of the array (uint64_t scalar), and then the content of the array (sizeof(Node) * len bytes).

    • Leaf vectors

      • Content (leaf_vector_): an array of LeafOutputType. This array stores the leaf vectors for all nodes, such that the sub-array leaf_vector_[leaf_vector_begin[i]_:leaf_vector_end_[i]] yields the leaf vector for the i-th node.

      • Beginning offset of each segment (leaf_vector_begin_): an array of size_t.

      • Ending offset of each segment (leaf_vector_end_): an array of size_t.

    • Matching categories (for categorical splits)

      • Content (matching_categories_): an array of uint32_t. This array stores the category lists of all nodes, such that the sub-array matching_categories_[matching_categories_offset_[i]:matching_categories_offset_[i+1]] yields the category list of the i-th node.

      • Beginning offset of each segment (matching_categories_offset_): an array of size_t.

    • Extension slot 2: Per-tree optional fields. This field is unused in the v3 version.

      • Number of fields: single int32_t scalar. Set this value to 0, to indicate the lack of optional fields.

    • Extension slot 3: Per-node optional fields. This field is unused in the v3 version.

      • Number of fields: single int32_t scalar. Set this value to 0, to indicate the lack of optional fields.

  6. Tree 1: Use the same set of fields as Tree 0.

  7. Other trees …