# Accelerate Python with Taichi

Taichi is a domain-specific language *embedded* in Python. One of its key features is that Taichi can accelerate computation-intensive Python programs and help these programs achieve comparable performance to C/C++ or even CUDA. This makes Taichi much better positioned in the area of scientific computation.

In the following sections, we provide two examples to give you a sense as to how much acceleration Taichi can bring to your Python programs.

## Count the primes

Large-scale or nested for loops in Python always leads to poor runtime performance. The following demo counts the primes within a specified range and involves nested for loops (see here for the complete version). Simply by importing Taichi or switching to Taichi's GPU backends, you will see a significant boost to the overall performance:

`"""Count the prime numbers in the range [1, n]`

"""

# Checks if a positive integer is a prime number

def is_prime(n: int):

result = True

# Traverses the range between 2 and sqrt(n)

# - Returns False if n can be divided by one of them;

# - otherwise, returns True

for k in range(2, int(n ** 0.5) + 1):

if n % k == 0:

result = False

break

return result

# Traverses the range between 2 and n

# Counts the primes according to the return of is_prime()

def count_primes(n: int) -> int:

count = 0

for k in range(2, n):

if is_prime(k):

count += 1

return count

print(count_primes(1000000))

- Save the code as
**count_prime.py**and run the following command in your terminal:

`time python count_primes.py`

*The count of prime numbers along with the execution time appears on the screen. It takes 2.235s to run this program.*

`78498`

real 0m2.235s

user 0m2.235s

sys 0m0.000s

- Now, let's change the code a bit: import Taichi to your Python code and initialize it using the CPU backend:

`import taichi as ti`

ti.init(arch=ti.cpu)

- Decorate
`is_prime()`

with`@ti.func`

and`count_primes()`

with`@ti.kernel`

:

- Taichi's compiler compiles the Python code decorated with
`@ti.kernel`

onto different devices, such as CPU and GPU, for high-performance computation.- See Kernels & Functions for a detailed explanation of Taichi's core concepts: kernels and functions.

`@ti.func`

def is_prime(n: int):

result = True

for k in range(2, int(n ** 0.5) + 1):

if n % k == 0:

result = False

break

return result

@ti.kernel

def count_primes(n: int) -> int:

count = 0

for k in range(2, n):

if is_prime(k):

count += 1

return count

- Rerun
**count_primes.py**：

`time python count_primes.py`

*The calculation speed is six times up (2.235/0.363).*

`78498`

real 0m0.363s

user 0m0.546s

sys 0m0.179s

Increase

`N`

tenfold to`10,000,000`

and rerun**count_primes.py**:*The calculation time with Taichi is 0.8s vs. 55s with Python only. The calculation speed with Taichi is 70x up.*Change Taichi's backend from CPU to GPU and give it a rerun:

`ti.init(arch=ti.gpu)`

*The calculation time with Taichi is 0.45s vs. 55s with Python only. The calculation speed with Taichi is taken further to 120x up.*

## Dynamic programming: longest common subsequence

The core philosophy behind dynamic programming is that it sacrifices some storage space for less execution time and stores intermediate results to avoid repetitive computation. In the following section, we will walk you through a complete implementation of DP, and demonstrate another area where Taichi can make a real 'acceleration'.

The example below follows the philosophy of DP to work out the length of the longest common subsequence (LCS) of two given sequences. For instance, the LCS of sequences a = [**0**, **1**, 0, 2, **4**, **3**, 1, **2**, 1] and b = [4, **0**, **1**, **4**, 5, **3**, 1, **2**], is [0, 1, 4, 3, 1, 2], and the LCS' length is six. Let's get started:

- Import NumPy and Taichi to your Python program:

`import taichi as ti`

import numpy as np

- Initialize Taichi:

`ti.init(arch=ti.cpu)`

- Create two 15,000-long NumPy arrays of random integers in the range of [0, 100] to compare:

`N = 15000`

a_numpy = np.random.randint(0, 100, N, dtype=np.int32)

b_numpy = np.random.randint(0, 100, N, dtype=np.int32)

- Here we define an
`N`

×`N`

Taichi field`f`

, using its`[i, j]`

-th element to represent the length of the LCS of sequence`a`

's first`i`

elements and sequence`b`

's first`j`

elements:

`f = ti.field(dtype=ti.i32, shape=(N + 1, N + 1))`

- Now we turn the dynamic programming issue to the traversal of a field
`f`

, where`a`

and`b`

are the two sequences to compare:

`f[i, j] = ti.max(f[i - 1, j - 1] + (a[i - 1] == b[j - 1]),`

ti.max(f[i - 1, j], f[i, j - 1]))

- Define a kernel function
`compute_lcs()`

, which takes in two sequences and works out the length of their LCS.

`@ti.kernel`

def compute_lcs(a: ti.types.ndarray(), b: ti.types.ndarray()) -> ti.i32:

len_a, len_b = a.shape[0], b.shape[0]

ti.loop_config(serialize=True) # Disable auto-parallelism in Taichi

for i in range(1, len_a + 1):

for j in range(1, len_b + 1):

f[i, j] = ti.max(f[i - 1, j - 1] + (a[i - 1] == b[j - 1]),

ti.max(f[i - 1, j], f[i, j - 1]))

return f[len_a, len_b]

- NumPy arrays are stored as ndarray in Taichi.
- Ensure that you set
`ti.loop_config(serialize=True)`

to disable auto-parallelism in Taichi. The iterationshereshould not happen in parallelism because the computation of a loop iteration is dependent on its previous iterations.

- Print the result of
`compute_lcs(a_numpy, b_numpy)`

.*Now you get the following program:*

`import taichi as ti`

import numpy as np

ti.init(arch=ti.cpu)

benchmark = True

N = 15000

f = ti.field(dtype=ti.i32, shape=(N + 1, N + 1))

if benchmark:

a_numpy = np.random.randint(0, 100, N, dtype=np.int32)

b_numpy = np.random.randint(0, 100, N, dtype=np.int32)

else:

a_numpy = np.array([0, 1, 0, 2, 4, 3, 1, 2, 1], dtype=np.int32)

b_numpy = np.array([4, 0, 1, 4, 5, 3, 1, 2], dtype=np.int32)

@ti.kernel

def compute_lcs(a: ti.types.ndarray(), b: ti.types.ndarray()) -> ti.i32:

len_a, len_b = a.shape[0], b.shape[0]

ti.loop_config(serialize=True) # Disable auto-parallelism in Taichi

for i in range(1, len_a + 1):

for j in range(1, len_b + 1):

f[i, j] = ti.max(f[i - 1, j - 1] + (a[i - 1] == b[j - 1]),

ti.max(f[i - 1, j], f[i, j - 1]))

return f[len_a, len_b]

print(compute_lcs(a_numpy, b_numpy))

- Save the above code as
**lcs.py**and run:

`time python lcs.py`

*The system prints the length of the LCS, along with the execution time.*

`2721`

real 0m1.409s

user 0m1.112s

sys 0m0.549s

In this repo, we provide our implementation of this dynamic planning algorithm in Taichi and NumPy:

- With Python only, it takes 476s to work out the length of the LCS of two 15,000-long random sequences.
- With Taichi, it only takes about 0.9s and sees an up to 500x speed up!

##### note

The actual execution time may vary depending on your machine, but we believe that the performance improvements you will see is comparable to ours.