Version: v1.4.0

# Accelerate Python with Taichi

Taichi is a domain-specific language embedded in Python. One of its key features is that Taichi can accelerate computation-intensive Python programs and help these programs achieve comparable performance to C/C++ or even CUDA. This makes Taichi much better positioned in the area of scientific computation.

In the following sections, we provide two examples to give you a sense as to how much acceleration Taichi can bring to your Python programs.

## Count the primes

Large-scale or nested for loops in Python always leads to poor runtime performance. The following demo counts the primes within a specified range and involves nested for loops (see here for the complete version). Simply by importing Taichi or switching to Taichi's GPU backends, you will see a significant boost to the overall performance:

``"""Count the prime numbers in the range [1, n]"""# Checks if a positive integer is a prime numberdef is_prime(n: int):    result = True    # Traverses the range between 2 and sqrt(n)    # - Returns False if n can be divided by one of them;    # - otherwise, returns True    for k in range(2, int(n ** 0.5) + 1):        if n % k == 0:            result = False            break    return result# Traverses the range between 2 and n# Counts the primes according to the return of is_prime()def count_primes(n: int) -> int:    count = 0    for k in range(2, n):        if is_prime(k):           count += 1    return countprint(count_primes(1000000))``
1. Save the code as count_prime.py and run the following command in your terminal:
``time python count_primes.py``

The count of prime numbers along with the execution time appears on the screen. It takes 2.235s to run this program.

``78498real        0m2.235suser        0m2.235ssys        0m0.000s``
1. Now, let's change the code a bit: import Taichi to your Python code and initialize it using the CPU backend:
``import taichi as titi.init(arch=ti.cpu)``
1. Decorate `is_prime()` with `@ti.func` and `count_primes()` with `@ti.kernel`:
• Taichi's compiler compiles the Python code decorated with `@ti.kernel` onto different devices, such as CPU and GPU, for high-performance computation.
• See Kernels & Functions for a detailed explanation of Taichi's core concepts: kernels and functions.
``@ti.funcdef is_prime(n: int):    result = True    for k in range(2, int(n ** 0.5) + 1):        if n % k == 0:            result = False            break    return result@ti.kerneldef count_primes(n: int) -> int:    count = 0    for k in range(2, n):        if is_prime(k):            count += 1    return count``
1. Rerun count_primes.py
``time python count_primes.py``

The calculation speed is six times up (2.235/0.363).

``78498real        0m0.363suser        0m0.546ssys        0m0.179s``
1. Increase `N` tenfold to `10,000,000` and rerun count_primes.py:

The calculation time with Taichi is 0.8s vs. 55s with Python only. The calculation speed with Taichi is 70x up.

2. Change Taichi's backend from CPU to GPU and give it a rerun:

``ti.init(arch=ti.gpu)``

The calculation time with Taichi is 0.45s vs. 55s with Python only. The calculation speed with Taichi is taken further to 120x up.

## Dynamic programming: longest common subsequence

The core philosophy behind dynamic programming is that it sacrifices some storage space for less execution time and stores intermediate results to avoid repetitive computation. In the following section, we will walk you through a complete implementation of DP, and demonstrate another area where Taichi can make a real 'acceleration'.

The example below follows the philosophy of DP to work out the length of the longest common subsequence (LCS) of two given sequences. For instance, the LCS of sequences a = [0, 1, 0, 2, 4, 3, 1, 2, 1] and b = [4, 0, 1, 4, 5, 3, 1, 2], is [0, 1, 4, 3, 1, 2], and the LCS' length is six. Let's get started:

1. Import NumPy and Taichi to your Python program:
``import taichi as tiimport numpy as np``
1. Initialize Taichi:
``ti.init(arch=ti.cpu)``
1. Create two 15,000-long NumPy arrays of random integers in the range of [0, 100] to compare:
``N = 15000a_numpy = np.random.randint(0, 100, N, dtype=np.int32)b_numpy = np.random.randint(0, 100, N, dtype=np.int32)``
1. Here we define an `N`×`N` Taichi field `f`, using its `[i, j]`-th element to represent the length of the LCS of sequence `a`'s first `i` elements and sequence `b`'s first `j` elements:
``f = ti.field(dtype=ti.i32, shape=(N + 1, N + 1))``
1. Now we turn the dynamic programming issue to the traversal of a field `f`, where `a` and `b` are the two sequences to compare:
``f[i, j] = ti.max(f[i - 1, j - 1] + (a[i - 1] == b[j - 1]),              ti.max(f[i - 1, j], f[i, j - 1]))``
1. Define a kernel function `compute_lcs()`, which takes in two sequences and works out the length of their LCS.
``@ti.kerneldef compute_lcs(a: ti.types.ndarray(), b: ti.types.ndarray()) -> ti.i32:       len_a, len_b = a.shape, b.shape    ti.loop_config(serialize=True) # Disable auto-parallelism in Taichi    for i in range(1, len_a + 1):        for j in range(1, len_b + 1):            f[i, j] = ti.max(f[i - 1, j - 1] + (a[i - 1] == b[j - 1]),                          ti.max(f[i - 1, j], f[i, j - 1]))    return f[len_a, len_b]``
• NumPy arrays are stored as ndarray in Taichi.
• Ensure that you set `ti.loop_config(serialize=True)` to disable auto-parallelism in Taichi. The iterations here should not happen in parallelism because the computation of a loop iteration is dependent on its previous iterations.
1. Print the result of `compute_lcs(a_numpy, b_numpy)`. Now you get the following program:
``import taichi as tiimport numpy as npti.init(arch=ti.cpu)benchmark = TrueN = 15000f = ti.field(dtype=ti.i32, shape=(N + 1, N + 1))if benchmark:    a_numpy = np.random.randint(0, 100, N, dtype=np.int32)    b_numpy = np.random.randint(0, 100, N, dtype=np.int32)else:    a_numpy = np.array([0, 1, 0, 2, 4, 3, 1, 2, 1], dtype=np.int32)    b_numpy = np.array([4, 0, 1, 4, 5, 3, 1, 2], dtype=np.int32)@ti.kerneldef compute_lcs(a: ti.types.ndarray(), b: ti.types.ndarray()) -> ti.i32:    len_a, len_b = a.shape, b.shape    ti.loop_config(serialize=True) # Disable auto-parallelism in Taichi    for i in range(1, len_a + 1):        for j in range(1, len_b + 1):               f[i, j] = ti.max(f[i - 1, j - 1] + (a[i - 1] == b[j - 1]),                          ti.max(f[i - 1, j], f[i, j - 1]))    return f[len_a, len_b]print(compute_lcs(a_numpy, b_numpy))``
1. Save the above code as lcs.py and run:
``time python lcs.py``

The system prints the length of the LCS, along with the execution time.

``2721real        0m1.409suser        0m1.112ssys        0m0.549s``

In this repo, we provide our implementation of this dynamic planning algorithm in Taichi and NumPy:

• With Python only, it takes 476s to work out the length of the LCS of two 15,000-long random sequences.
• With Taichi, it only takes about 0.9s and sees an up to 500x speed up!
##### note

The actual execution time may vary depending on your machine, but we believe that the performance improvements you will see is comparable to ours.