Version: v1.6.0

Profiler

Overview

Taichi includes a collection of profiling tools to help with code debugging and optimization. These tools collect hardware and Taichi-related information to measure program performance and identify bottlenecks.

Currently, Taichi provides two profiling tools:

ScopedProfiler, which is responsible for analyzing the performance of the Taichi JIT compiler (host).
KernelProfiler, which is responsible for analyzing the performance of Taichi kernels (device). Its advanced mode, which works with the CUDA backend only, provides detailed low-level performance metrics, such as memory bandwidth consumption.

ScopedProfiler

ScopedProfiler is a profiler in Taichi that tracks the time spent on host tasks such as JIT compilation. It is enabled by default. To display results in a hierarchical format, you can call ti.profiler.print_scoped_profiler_info().

For example:

import taichi as ti

ti.init(arch=ti.cpu)
var = ti.field(ti.f32, shape=1)

@ti.kernel
def compute():
    var[0] = 1.0
    print("Setting var[0] =", var[0])

compute()
ti.profiler.print_scoped_profiler_info()

note

ScopedProfiler is a C++ class in Taichi.

KernelProfiler

KernelProfiler retrieves kernel profiling records from the backend, aggregates them in the Python scope, and prints the results to the console. Note that kernelProfiler supports CPU and CUDA only. Ensure that you call ti.sync() before performance profiling if your program runs on GPU.

To enable the profiler, set kernel_profiler=True when calling ti.init().
Use ti.profiler.print_kernel_profiler_info() to display profiling results. There are two printing modes:
1. In "count" mode (the default), profiling recordings with the same kernel name are counted as a single profiling result.
2. In "trace" mode, the profiler prints a list of kernels launched on hardware during the profiling period, including comprehensive performance and hardware characteristics for each kernel.
Use ti.profiler.clear_kernel_profiler_info() to clear the entries in this profiler.

For example:

import taichi as ti

ti.init(ti.cpu, kernel_profiler=True)
x = ti.field(ti.f32, shape=1024*1024)

@ti.kernel
def fill():
    for i in x:
        x[i] = i

for i in range(8):
    fill()
ti.profiler.print_kernel_profiler_info('trace')
ti.profiler.clear_kernel_profiler_info()  # Clears all records

for i in range(100):
    fill()
ti.profiler.print_kernel_profiler_info()  # The default mode: 'count'

The profiler outputs the following:

=========================================================================
X64 Profiler(trace)
=========================================================================
[      % |     time    ] Kernel name
[  0.00% |    0.000  ms] jit_evaluator_0_kernel_0_serial
[ 60.11% |    2.668  ms] fill_c4_0_kernel_1_range_for
[  6.06% |    0.269  ms] fill_c4_0_kernel_1_range_for
[  5.73% |    0.254  ms] fill_c4_0_kernel_1_range_for
[  5.68% |    0.252  ms] fill_c4_0_kernel_1_range_for
[  5.61% |    0.249  ms] fill_c4_0_kernel_1_range_for
[  5.63% |    0.250  ms] fill_c4_0_kernel_1_range_for
[  5.61% |    0.249  ms] fill_c4_0_kernel_1_range_for
[  5.59% |    0.248  ms] fill_c4_0_kernel_1_range_for
-------------------------------------------------------------------------
[100.00%] Total kernel execution time:   0.004 s   number of records:  9
=========================================================================
=========================================================================
X64 Profiler(count)
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
[100.00%   0.033 s    100x |    0.244     0.329     2.970 ms] fill_c4_0_kernel_1_range_for
-------------------------------------------------------------------------
[100.00%] Total kernel execution time:   0.033 s   number of records:  1
=========================================================================

note

jit_evaluator_xxx can be ignored because it is automatically generated by the system.
Taichi recommends running performance profiling multiple times to observe the minimum or average execution time.

Advanced mode

KernelProfiler provides an experimental GPU profiling toolkit based on the Nvidia CUPTI for the CUDA backend, which offers minimal and predictable profiling overhead and can record over 6,000 hardware metrics.

To use the CUPTI-based GPU profiler, you must first satisfy the following prerequisites:

Install the CUDA Toolkit.
Build Taichi from source with the CUDA toolkit using the command: TAICHI_CMAKE_ARGS="-DTI_WITH_CUDA_TOOLKIT:BOOL=ON" python3 setup.py develop --user.
Resolve any permission issues related to the Nvidia profiling module by:
- Adding the options nvidia NVreg_RestrictProfilingToAdminUsers=0 line to the /etc/modprobe.d/nvidia-kernel-common.conf file.
- After modifying the configuration file, reboot the system, which should resolve the permission issue. Note that you may need to run update-initramfs -u before rebooting the system.
- Refer to the ERR_NVGPUCTRPERM documentation for more information.