跳转至主要内容

Taichi AOT, the solution for deploying kernels in mobile devices

· 一分钟阅读时间
Ye Kuang

Physical simulation, which Taichi Lang is best at, has wide applications on mobile devices, such as real-time physically-based interactions in mobile games or cool visual effects in short videos. This is thanks to Taichi's features such as fast prototyping and cross-platform GPU acceleration.

However, Taichi is currently a language embedded in the Python frontend. Python is not the most ideal language when it comes to deployment, because Python's heavy virtual machine design often makes it hard to embed Python in other host languages. Therefore, we've been constantly thinking about how to have Taichi's users enjoy both the rapid iteration of Python and seamless deployment in real industrial scenarios.

Today, we are delighted to introduce Taichi AOT, a deployment solution jointly developed by Taichi and OPPO US Research Center. AOT is short for Ahead-of-time and is a pre-compilation mechanism. Unlike Taichi, which is embedded in Python and compiles the Python code Just-in-time (JIT), Taichi AOT compiles the required Taichi kernels into the specified backend instructions, more specifically SPIR-V shaders. These shaders can then be loaded and run by the C++ AOT runtime library provided by Taichi.

We chose Vulkan as the first GPU backend to support by Taichi AOT. The advantages of Vulkan are obvious: cross-platform portability, rigorous API specifications, and more developed software and hardware ecosystem. Of course, Taichi boasts its universal deployment. The support for Vulkan is just the beginning. We are currently working hard to support other platforms such as Apple Metal, OpenGL, and CUDA. We also welcome contributors from the industry and academia to join this effort.

image

To give you a sense as to how the Taichi AOT solution works, we will deploy and run a Taichi program on an Android cellphone. :-)

A demo Taichi program

The Taichi program we chose is a reworking of the implicit FEM demo released in v0.9.0. To make the effect a bit more interesting, we replaced the small cube with an Armadillo 3D mesh.

# Data container
ox = ti.Vector.ndarray(args.dim, dtype=ti.f32, shape=n_verts)
vertices = ti.Vector.ndarray(4, dtype=ti.i32, shape=n_cells)
indices = ti.field(ti.i32, shape=n_faces * 3)
edges = ti.Vector.ndarray(2, dtype=ti.i32, shape=n_edges)
c2e = ti.Vector.ndarray(6, dtype=ti.i32, shape=n_cells)

# Load 3D Armadillo info into containers
ox.from_numpy(ox_np)
vertices.from_numpy(vertices_np)
indices.from_numpy(indices_np.reshape(-1))
edges.from_numpy(np.array(list(edges_np)))
c2e.from_numpy(c2e_np)

Perhaps the first thing you have noticed here is that x is defined as a ti.ndarray.

Ndarray

Ndarray is a data container designed specifically for the Taichi AOT solution. Those who are familiar with NumPy or PyTorch should also be familiar with this concept. In the Taichi AOT scenarios, Ndarray is usually more flexible than the Taichi fields:

  • Generic data interface to external GPU pipelines: All fields in a Taichi program reside in a contiguous piece of physical memory, and each field corresponds to a different offset in it. In an AOT scenario, the need to query the offset of a specific field every time before reading it could become a drain on the development effort. Each Ndarray corresponds to an independent piece of memory (vkBuffer). Ndarrays can easily bind to the shaders generated by the Taichi AOT, or to the existing Vulkan pipeline in your app in a zero-copy fashion.

  • Dynamic size: The shape of a field is a compile-time constant, which is advantageous in JIT mode. However, in an AOT scenario, we often need to allocate memory of different sizes for different situations. For example, you have a Taichi program, which has an Ndarray for storing particle positions. When you need to deploy the program on different devices, you can allocate Ndarrays of different sizes without regenerating shaders from Taichi AOT.

  • Dynamic binding: When you need to have a Taichi kernel accept different fields in Python/JIT mode, you need to annotate this parameter as ti.template(). In this way, whenever you pass in a new field, the Taichi runtime actually generates a new kernel and stores it in the kernel cache. In the Python environment, these complexities are well hidden away from the users because the Taichi runtime can manage the kernel cache in itself. In an AOT scenario, you have to manually find each corresponding shader. Though Taichi AOT also supports kernel templates, this lacks flexibility. An easier way is to have your Taichi kernel accept Ndarray. In that way, you only need to bind different Ndarrays to the same kernel at runtime.

AOT module

We need to create a ti.aot.Module and call add_kernel () to save all Taichi kernels used in the demo. For more details, see the corresponding API reference and this tutorial.

mod = ti.aot.Module(ti.vulkan)
mod.add_kernel(get_force,
template_args={
'x': x,
'f': f,
'vertices': vertices
})
...
mod.save(dir_name, '')

Import AOT runtime

The Taichi AOT runtime library allows you to import a Taichi AOT module, search for Taichi kernels in it, and execute them. You can use Taichi's pre-compiled dynamic link library directly:

https://github.com/taichi-dev/taichi/releases/download/v1.0.0/libtaichi_export_core.so

You can also use the cmake command to build a library for the Android platform yourself:

export ANDROID_NDK_ROOT="<path_to_android_ndk>"  # e.g. ~/Android/Sdk/ndk/22.1.7171670/
python setup.py clean
TAICHI_CMAKE_ARGS="-DCMAKE_TOOLCHAIN_FILE=${ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake -DANDROID_NATIVE_API_LEVEL=29 -DANDROID_ABI=arm64-v8a" python3 setup.py build_ext

Reproduce call sequence

The AOT module stores all the kernels that need to run on the mobile side. All we need to do next is to load them from the AOT module and call them again from the Android app in the sequence by which they were called in Python. If you are interested in this, refer to the run_render_loop() function in the implicit_fem demo.

It is more difficult to debug than to understand the principle. Therefore, Taichi also plans to automatically record these call sequences and build a calculation graph to automate the entire Taichi AOT solution in a future release. You are welcome to follow the Taichi AOT Roadmap, join our discussions, and take an active part in its development. :-D

Execution

We are now one step away from completion. The next thing to do is to render the simulation results on your screen. For convenience, the rendering here uses a rendering pipeline built using the Taichi Unified Device APIs. In this pipeline, the Ndarray used to store the particle positions can be binded directly to the vertex shader as a VBO (assuming you don't need to store additional properties in the vertex). This also demonstrates the strength of Ndarray as a generic data interface to the external GPU pipelines.

// Draw mesh
{
auto resource_binder = render_mesh_pipeline_->resource_binder();
resource_binder->buffer(0, 0, render_constants_.get_ptr(0));
resource_binder->vertex_buffer(devalloc_x_.get_ptr(0));
resource_binder->index_buffer(devalloc_indices_.get_ptr(0), 32);

cmd_list->bind_pipeline(render_mesh_pipeline_.get());
cmd_list->bind_resources(resource_binder);
cmd_list->draw_indexed(N_FACES * 3);
}
cmd_list->end_renderpass();
stream->submit_synced(cmd_list.get());

surface_->present_image();

Done!

Use the following command to compile your Android program and view the cool visual effects enabled by our Taichi AOT solution on your cellphone!

./gradlew assembleDebug && adb install ./app/build/outputs/apk/debug/app-debug.apk