跳转至主要内容
Version: v1.1.3

使用 Taichi 加速 Python

Taichi 是一个嵌入在 Python 中的领域特定语言(DSL)。 Taichi 的主要功能之一是加速计算密集的 Python 程序,帮助这些程序 实现可以媲美 C/C++ 甚至 CUDA 的性能。 这使得 Taichi 在科学计算领域处于更有利的地位。

在下面的章节中,我们提供两个示例让你感觉一下 Taichi 能给你的 Python 程序带来多少加速。

统计质数个数

Python 的大型 for 循环或嵌套 for 循环总是导致运行时性能不佳。 下面的演示计数在指定范围内的质数,并涉及循环嵌套(完整版本见 这里)。 只需导入 Taichi 或切换到 Taichi 的 GPU 后端,您将看到整体性能的极大提升:

"""Count the prime numbers in the range [1, n]
"""

# Checks if a positive integer is a prime number
def is_prime(n: int):
result = True
# Traverses the range between 2 and sqrt(n)
# - Returns False if n can be divided by one of them;
# - otherwise, returns True
for k in range(2, int(n ** 0.5) + 1):
if n % k == 0:
result = False
break
return result

# Traverses the range between 2 and n
# Counts the primes according to the return of is_prime()
def count_primes(n: int) -> int:
count = 0
for k in range(2, n):
if is_prime(k):
count += 1

return count

print(count_primes(1000000))
  1. 将代码保存为 count_prime.py 并在您的终端中运行以下命令:
time python count_primes.py

质数个数和执行时间会出现在屏幕上。 运行该程序需要 2.235 秒。

78498

real 0m2.235s
user 0m2.235s
sys 0m0.000s
  1. 现在,让我们对代码稍作修改:导入 Taichi 到您的 Python 代码并使用 CPU 后端初始化它:
import taichi as ti
ti.init(arch=ti.cpu)
  1. @ti.func 装饰 is_prime(),并用 @ti.kernel 装饰 count_primes()
  • Taichi's compiler compiles the Python code decorated with @ti.kernel onto different devices, such as CPU and GPU, for high-performance computation.
  • 请参阅 kernel 与函数 以详细了解 Taichi 的核心概念:内核与函数。
@ti.func
def is_prime(n: int):
result = True
for k in range(2, int(n ** 0.5) + 1):
if n % k == 0:
result = False
break
return result

@ti.kernel
def count_primes(n: int) -> int:
count = 0
for k in range(2, n):
if is_prime(k):
count += 1

return count
  1. 再次运行 count_primes.py
time python count_primes.py

计算速度提高了6倍(2.235/0.363)。

78498

real 0m0.363s
user 0m0.546s
sys 0m0.179s
  1. N 增大十倍到 10,000,000 并再次运行 count_primes.py:

    The calculation time with Taichi is 0.8s vs. 55s with Python only. Taichi 实现了 70x 的加速。

  2. 将 Taichi 的后端从 CPU 更改为 GPU 并重新运行:

ti.init(arch=ti.gpu)

Taichi 的计算时间是 0.45 秒,作为对比,单独使用 Python 的计算时间是 55 秒。 Taichi 的计算速度进一步提高到 120x。

动态规划:最长公共子序列

The core philosophy behind dynamic programming is that it sacrifices some storage space for less execution time and stores intermediate results to avoid repetitive computation. In the following section, we will walk you through a complete implementation of DP, and demonstrate another area where Taichi can make a real 'acceleration'.

The example below follows the philosophy of DP to work out the length of the longest common subsequence (LCS) of two given sequences. For instance, the LCS of sequences a = [0, 1, 0, 2, 4, 3, 1, 2, 1] and b = [4, 0, 1, 4, 5, 3, 1, 2], is [0, 1, 4, 3, 1, 2], and the LCS' length is six. Let's get started:

  1. Import NumPy and Taichi to your Python program:
import taichi as ti
import numpy as np
  1. Initialize Taichi:
ti.init(arch=ti.cpu)
  1. Create two 15,000-long NumPy arrays of random integers in the range of [0, 100] to compare:
N = 15000
a_numpy = np.random.randint(0, 100, N, dtype=np.int32)
b_numpy = np.random.randint(0, 100, N, dtype=np.int32)
  1. Here we define an N×N Taichi field f, using its [i, j]-th element to represent the length of the LCS of sequence a's first i elements and sequence b's first j elements:
f = ti.field(dtype=ti.i32, shape=(N + 1, N + 1))
  1. Now we turn the dynamic programming issue to the traversal of a field f, where a and b are the two sequences to compare:
f[i, j] = max(f[i - 1, j - 1] + (a[i - 1] == b[j - 1]),
max(f[i - 1, j], f[i, j - 1]))
  1. Define a kernel function compute_lcs(), which takes in two sequences and works out the length of their LCS.
@ti.kernel
def compute_lcs(a: ti.types.ndarray(), b: ti.types.ndarray()) -> ti.i32:
len_a, len_b = a.shape[0], b.shape[0]

ti.loop_config(serialize=True) # Disable auto-parallelism in Taichi
for i in range(1, len_a + 1):
for j in range(1, len_b + 1):
f[i, j] = max(f[i - 1, j - 1] + (a[i - 1] == b[j - 1]),
max(f[i - 1, j], f[i, j - 1]))

return f[len_a, len_b]
  • NumPy arrays are stored as ndarray in Taichi.
  • Ensure that you set ti.loop_config(serialize=True) to disable auto-parallelism in Taichi. The iterations here should not happen in parallelism because the computation of a loop iteration is dependent on its previous iterations.
  1. Print the result of compute_lcs(a_numpy, b_numpy). Now you get the following program:
import taichi as ti
import numpy as np

ti.init(arch=ti.cpu)

benchmark = True

N = 15000

f = ti.field(dtype=ti.i32, shape=(N + 1, N + 1))

if benchmark:
a_numpy = np.random.randint(0, 100, N, dtype=np.int32)
b_numpy = np.random.randint(0, 100, N, dtype=np.int32)
else:
a_numpy = np.array([0, 1, 0, 2, 4, 3, 1, 2, 1], dtype=np.int32)
b_numpy = np.array([4, 0, 1, 4, 5, 3, 1, 2], dtype=np.int32)

@ti.kernel
def compute_lcs(a: ti.types.ndarray(), b: ti.types.ndarray()) -> ti.i32:
len_a, len_b = a.shape[0], b.shape[0]

ti.loop_config(serialize=True) # Disable auto-parallelism in Taichi
for i in range(1, len_a + 1):
for j in range(1, len_b + 1):
f[i, j] = max(f[i - 1, j - 1] + (a[i - 1] == b[j - 1]),
max(f[i - 1, j], f[i, j - 1]))

return f[len_a, len_b]


print(compute_lcs(a_numpy, b_numpy))
  1. Save the above code as lcs.py and run:
time python lcs.py

The system prints the length of the LCS, along with the execution time.

2721

real 0m1.409s
user 0m1.112s
sys 0m0.549s

In this repo, we provide our implementation of this dynamic planning algorithm in Taichi and NumPy:

  • With Python only, it takes 476s to work out the length of the LCS of two 15,000-long random sequences.
  • With Taichi, it only takes about 0.9s and sees an up to 500x speed up!
note

The actual execution time may vary depending on your machine, but we believe that the performance improvements you will see is comparable to ours.