Optimizing prediction subroutine

Treelite offers system-level optimizations to boost prediction performance. Notice that the model information is wholly preserved; the optimizations only affect the manner at which prediction is performed.

Annotate conditional branches

This optimization analyzes and annotates every threshold conditions in the test nodes to improve performance.

How to use

The first step is to generate the branch annotation record for your ensemble model. Make sure to have your training data ready.

# model = your ensemble model (object of type treelite.Model)
# dmat = training data (object of type treelite.DMatrix)

# Create annotator object
annotator = treelite.Annotator()
# Annotate branches by iterating over the training data
annotator.annotate_branch(model=model, dmat=dmat, verbose=True)
# Save the branch annotation record as a JSON file
annotator.save(path='mymodel-annotation.json')

To utilize the branch annotation record, supply the compiler parameter annotate_in when exporting the model:

# Export a source directory
model.compile(dirpath='./mymodel', verbose=True,
              params={'annotate_in': 'mymodel-annotation.json'})

# Export a source directory, packaged in a zip archive
model.export_srcpkg(platform='unix', toolchain='gcc', pkgpath='./mymodel.zip',
                    libname='mymodel.so', verbose=True,
                    params={'annotate_in': 'mymodel-annotation.json'})

# Export a shared library
model.export_lib(toolchain='gcc', libpath='./mymodel.so', verbose=True,
                 params={'annotate_in': 'mymodel-annotation.json'})

Technical details

Rationale

Modern CPUs heavily rely on a technique known as branch prediction, in which they “guess” the result of the conditional expression in each if-else branch ahead of time. Given a program

if ( [conditional expression] ) {
  foo();
} else {
  bar();
}

the CPU will pre-fetch the instructions for the function foo() if the given condition is likely to be true. On the other hand, if the condition is likely to be false, the CPU will pre-fetch the instructions for the function bar(). It suffices to say that correctly predicting conditional branches has great impact on performance. Each time the CPU predicts a branch correctly, it can keep the instructions it had pre-fetched earlier. Each time the CPU fails to predict, it must throw away the pre-fetched instructions and fetch anew another set of instructions. If you’d like to learn more about the importance of branch prediction, read this excellent introductory article from Stanford.

The prediction subroutine for a decision tree ensemble is problematic, as it is replete with conditional branches that will have to be guessed well:

/* A slice of prediction subroutine */
float predict_margin(const float* data) {
  float sum = 0.0f;
  if (!(data[0].missing != -1) || data[0].fvalue <= 9.5) {
    if (!(data[0].missing != -1) || data[0].fvalue <= 3.5) {
      if (!(data[10].missing != -1) || data[10].fvalue <= 0.74185) {
        if (!(data[0].missing != -1) || data[0].fvalue <= 1.5) {
          if (!(data[2].missing != -1) || data[2].fvalue <= 2.08671) {
            if ( (data[4].missing != -1) && data[4].fvalue <= 2.02632) {
              if (!(data[3].missing != -1) || data[3].fvalue <= 0.763339) {
                sum += (float)0.00758165;
              } else {
                sum += (float)0.0060202;
              }
            } else {
              if ( (data[1].missing != -1) && data[1].fvalue <= 0.0397456) {
                sum += (float)0.00415399;
              } else {
                sum += (float)0.00821985;
              }
            }
/* and so forth... */

In fact, each threshold condition in the test nodes will need to be predicted. While CPUs lack adequate information to make good guesses on these conditions, we can help by providing that information.

Mechanism for supplying the C compiler with branch information

We predict the likelihood of each condition by counting the number of data points from the training data that satisfy that condition. See the diagram below for an illustration.

If a condition is true at least 50% of the time (over the training data), the condition is labeled as “expected to be true”:

/* expected to be true */
if ( __builtin_expect( [condition], 1 ) ) {
  ...
} else {
  ...
}

On the other hand, if a condition is false at least 50% of the time, the condition is labeled as “expected to be false”:

/* expected to be false */
if ( __builtin_expect( [condition], 0 ) ) {
  ...
} else {
  ...
}

Note

On the expression __builtin_expect

The __builtin_expect expression is a compiler intrinsic to supply the C compiler with branch prediction information. Both gcc and clang support it. Unfortunately, Microsoft Visual C++ does not. To take advantage of branch annotation, make sure to use gcc or clang on the target machine.

Use integer thresholds for conditions

This optimization replaces all thresholds in the test nodes with integers so that each threshold condition performs integer comparison instead of the usual floating-point comparison. The thresholds are said to be quantized into integer indices.

BEFORE:

if (data[3].fvalue < 1.5) {  /* floating-point comparison */
  ...
}

AFTER:

if (data[3].qvalue < 3) {     /* integer comparison */
  ...
}

How to use

Simply add the compiler parameter quantize=1 when exporting the model:

# Export a source directory
model.compile(dirpath='./mymodel', verbose=True,
              params={'quantize': 1})

# Export a source directory, packaged in a zip archive
model.export_srcpkg(platform='unix', toolchain='gcc', pkgpath='./mymodel.zip',
                    libname='mymodel.so', verbose=True,
                    params={'quantize': 1})

# Export a shared library
model.export_lib(toolchain='gcc', libpath='./mymodel.so', verbose=True,
                 params={'quantize': 1})

Technical details

Rationale

On some platforms such as x86-64, replacing floating-point thresholds with integers helps improve performance by 1) reducing executable code size and 2) improving data locality. This is so because on these platforms, integer constants can be embedded as part of the comparison instruction, whereas floating-point constants cannot.

Let’s look at x86-64 platform. The integer comparison

a <= 4

produces one assembly instruction:

cmpl    $4, 8(%rsp)       ;    8(%rsp) contains the variable a

Since the integer constant 4 got embedded into the comparison instruction cmpl, we only had to fetch the variable a from memory.

On the other hand, the floating-point comparison

b < 1.2f

produces two assembly instructions:

movss   250(%rip), %xmm0  ;  250(%rip) contains the constant 1.2f
ucomiss  12(%rsp), %xmm0  ;   12(%rsp) contains the variable b

Notice that the floating-point constant 1.2f did not get embedded into the comparison instruction ucomiss. The constant had to be fetched (with movss) into the register xmm0 before the comparsion could take place. To summarize,

  • a floating-point comparison takes twice as many instructions as an integer comparsion, increasing the executable code size;

  • a floating-point comparison involves an extra fetch instruction (movss), potentially causing a cache miss.

Caveats. As we’ll see in the next section, using integer thresholds will add overhead costs at prediction time. You should ensure that the benefits of integer comparisons outweights the overhead costs.

Mechanism for mapping features

When quantize option is enabled, Treelite will collect all thresholds occuring in the tree ensemble model. For each feature, one list will be generated that lists the thresholds in ascending order:

/* example of how per-feature threshold list may look like */

Feature 0:  [1.5, 6.5, 12.5]
Feature 3:  [0.15, 0.35, 1.5]
Feature 6:  [7, 9, 10, 135]

Using these lists, we may convert any data point into integer indices via simple look-ups. For feature 0 in the example above, values will be mapped to integer indices as follows:

Let x be the value of feature 0.

Assign -1 if          x  <  1.5
Assign  0 if          x ==  1.5
Assign  1 if   1.5  < x  <  6.5
Assign  2 if          x ==  6.5
Assign  3 if   6.5  < x  < 12.5
Assign  4 if          x == 12.5
Assign  5 if          x  > 12.5

Let’s look at a specific example of how a floating-point vector gets translated into a vector of integer indices:

feature id   0     1        2      3      4        5      6
            [7, missing, missing, 0.2, missing, missing, 20 ]
         => [3, missing, missing,   1, missing, missing,  5 ]

Since the prediction subroutine still needs to accept floating-point features, the features will be internally converted before actual prediction. If the prediction subroutine looked like below without quantize option,

float predict_margin(const Entry* data) {
  /* ... Run through the trees to compute the leaf output score ... */

  return score;
}

it will now have an extra step of mapping the incoming data vector into integers:

float predict_margin(const Entry* data) {
  /* ... Quantize feature values in data into integer indices   ... */

  /* ... Run through the trees to compute the leaf output score ... */
  return score;
}