Treelite offers system-level optimizations to boost prediction performance. Notice that the model information is wholly preserved; the optimizations only affect the manner at which prediction is performed.
Contents
This optimization analyzes and annotates every threshold conditions in the test nodes to improve performance.
The first step is to generate the branch annotation record for your ensemble model. Make sure to have your training data ready.
# model = your ensemble model (object of type treelite.Model)
# dmat = training data (object of type treelite.DMatrix)
# Create annotator object
annotator = treelite.Annotator()
# Annotate branches by iterating over the training data
annotator.annotate_branch(model=model, dmat=dmat, verbose=True)
# Save the branch annotation record as a JSON file
annotator.save(path='mymodel-annotation.json')
To utilize the branch annotation record, supply the compiler parameter
annotate_in
when exporting the model:
# Export a source directory
model.compile(dirpath='./mymodel', verbose=True,
params={'annotate_in': 'mymodel-annotation.json'})
# Export a source directory, packaged in a zip archive
model.export_srcpkg(platform='unix', toolchain='gcc', pkgpath='./mymodel.zip',
libname='mymodel.so', verbose=True,
params={'annotate_in': 'mymodel-annotation.json'})
# Export a shared library
model.export_lib(toolchain='gcc', libpath='./mymodel.so', verbose=True,
params={'annotate_in': 'mymodel-annotation.json'})
Modern CPUs heavily rely on a technique known as
branch prediction, in which
they “guess” the result of the conditional expression in each if
-else
branch ahead of time. Given a program
if ( [conditional expression] ) {
foo();
} else {
bar();
}
the CPU will pre-fetch the instructions for the function foo()
if the given
condition is likely to be true. On the other hand, if the condition is likely
to be false, the CPU will pre-fetch the instructions for the function bar()
.
It suffices to say that correctly predicting conditional branches has
great impact on performance. Each time the CPU predicts a branch correctly, it
can keep the instructions it had pre-fetched earlier. Each time the CPU fails to
predict, it must throw away the pre-fetched instructions and fetch anew another
set of instructions. If you’d like to learn more about the importance of branch
prediction, read
this excellent introductory article
from Stanford.
The prediction subroutine for a decision tree ensemble is problematic, as it is replete with conditional branches that will have to be guessed well:
/* A slice of prediction subroutine */
float predict_margin(const float* data) {
float sum = 0.0f;
if (!(data[0].missing != -1) || data[0].fvalue <= 9.5) {
if (!(data[0].missing != -1) || data[0].fvalue <= 3.5) {
if (!(data[10].missing != -1) || data[10].fvalue <= 0.74185) {
if (!(data[0].missing != -1) || data[0].fvalue <= 1.5) {
if (!(data[2].missing != -1) || data[2].fvalue <= 2.08671) {
if ( (data[4].missing != -1) && data[4].fvalue <= 2.02632) {
if (!(data[3].missing != -1) || data[3].fvalue <= 0.763339) {
sum += (float)0.00758165;
} else {
sum += (float)0.0060202;
}
} else {
if ( (data[1].missing != -1) && data[1].fvalue <= 0.0397456) {
sum += (float)0.00415399;
} else {
sum += (float)0.00821985;
}
}
/* and so forth... */
In fact, each threshold condition in the test nodes will need to be predicted. While CPUs lack adequate information to make good guesses on these conditions, we can help by providing that information.
We predict the likelihood of each condition by counting the number of data points from the training data that satisfy that condition. See the diagram below for an illustration.
If a condition is true at least 50% of the time (over the training data), the condition is labeled as “expected to be true”:
/* expected to be true */
if ( __builtin_expect( [condition], 1 ) ) {
...
} else {
...
}
On the other hand, if a condition is false at least 50% of the time, the condition is labeled as “expected to be false”:
/* expected to be false */
if ( __builtin_expect( [condition], 0 ) ) {
...
} else {
...
}
Note
On the expression __builtin_expect
The __builtin_expect
expression is a compiler intrinsic to supply the C
compiler with branch prediction information. Both
gcc
and clang
support it. Unfortunately, Microsoft Visual C++ does not. To take advantage
of branch annotation, make sure to use gcc or clang on the target machine.
This optimization replaces all thresholds in the test nodes with integers so that each threshold condition performs integer comparison instead of the usual floating-point comparison. The thresholds are said to be quantized into integer indices.
BEFORE:
if (data[3].fvalue < 1.5) { /* floating-point comparison */
...
}
AFTER:
if (data[3].qvalue < 3) { /* integer comparison */
...
}
Simply add the compiler parameter quantize=1
when exporting the model:
# Export a source directory
model.compile(dirpath='./mymodel', verbose=True,
params={'quantize': 1})
# Export a source directory, packaged in a zip archive
model.export_srcpkg(platform='unix', toolchain='gcc', pkgpath='./mymodel.zip',
libname='mymodel.so', verbose=True,
params={'quantize': 1})
# Export a shared library
model.export_lib(toolchain='gcc', libpath='./mymodel.so', verbose=True,
params={'quantize': 1})
On some platforms such as x86-64, replacing floating-point thresholds with integers helps improve performance by 1) reducing executable code size and 2) improving data locality. This is so because on these platforms, integer constants can be embedded as part of the comparison instruction, whereas floating-point constants cannot.
Let’s look at x86-64 platform. The integer comparison
a <= 4
produces one assembly instruction:
cmpl $4, 8(%rsp) ; 8(%rsp) contains the variable a
Since the integer constant 4
got embedded into the comparison instruction
cmpl, we only
had to fetch the variable a
from memory.
On the other hand, the floating-point comparison
b < 1.2f
produces two assembly instructions:
movss 250(%rip), %xmm0 ; 250(%rip) contains the constant 1.2f
ucomiss 12(%rsp), %xmm0 ; 12(%rsp) contains the variable b
Notice that the floating-point constant 1.2f
did not get embedded into
the comparison instruction
ucomiss. The
constant had to be fetched (with
movss) into the
register xmm0
before the comparsion could take place. To summarize,
a floating-point comparison takes twice as many instructions as an integer comparsion, increasing the executable code size;
a floating-point comparison involves an extra fetch instruction (movss
),
potentially causing a
cache miss.
Caveats. As we’ll see in the next section, using integer thresholds will add overhead costs at prediction time. You should ensure that the benefits of integer comparisons outweights the overhead costs.
When quantize
option is enabled, Treelite will collect all thresholds
occuring in the tree ensemble model. For each feature, one list will be
generated that lists the thresholds in ascending order:
/* example of how per-feature threshold list may look like */
Feature 0: [1.5, 6.5, 12.5]
Feature 3: [0.15, 0.35, 1.5]
Feature 6: [7, 9, 10, 135]
Using these lists, we may convert any data point into integer indices via simple look-ups. For feature 0 in the example above, values will be mapped to integer indices as follows:
Let x be the value of feature 0.
Assign -1 if x < 1.5
Assign 0 if x == 1.5
Assign 1 if 1.5 < x < 6.5
Assign 2 if x == 6.5
Assign 3 if 6.5 < x < 12.5
Assign 4 if x == 12.5
Assign 5 if x > 12.5
Let’s look at a specific example of how a floating-point vector gets translated into a vector of integer indices:
feature id 0 1 2 3 4 5 6
[7, missing, missing, 0.2, missing, missing, 20 ]
=> [3, missing, missing, 1, missing, missing, 5 ]
Since the prediction subroutine still needs to accept floating-point features,
the features will be internally converted before actual prediction. If the
prediction subroutine looked like below without quantize
option,
float predict_margin(const Entry* data) {
/* ... Run through the trees to compute the leaf output score ... */
return score;
}
it will now have an extra step of mapping the incoming data vector into integers:
float predict_margin(const Entry* data) {
/* ... Quantize feature values in data into integer indices ... */
/* ... Run through the trees to compute the leaf output score ... */
return score;
}