跳转至主要内容
Version: v1.1.3

Field(进阶版)

现代处理器内核的运算速度往往比配套的内存系统高出几个数量级。 为了缩小这一性能差距,计算机结构中采用了多级缓存系统和高带宽多通道内存。

在你熟悉了Taichi Field的基础知识之后, 本文将帮助你进一步了解对编写高性能 Taichi 程序至关重要的底层内存布局。 我们还将特别介绍如何有效地组织数据布局以及如何管理内存占用。

组织一个高效的数据布局

本节将介绍如何组织 Taichi field 的数据布局。 高效数据布局的核心原则是 locality。 Generally speaking, a program with desirable locality has at least one of the following features:

  • 稠密数据结构
  • 在小范围内循环数据(小于 32KB 对大多数处理器合适)
  • 序列加载/存储
note

请注意,数据总是从内存中以 block(page)为单位获取的。 硬件对 block 内某一具体的数据元素是如何使用的并不知情。 处理器只管从请求的内存地址读取整个 block 的数据 因此,当数据没有得到充分利用时,内存带宽将被浪费。

如需了解稀疏 field,请参阅 稀疏计算

布局 101:从 shapeti.root.X

在基本用法中,我们使用 shape 描述符来构建一个 field。 Taichi 提供了灵活的语句 ti.root.X,用于描述更复杂的数据组织。 让我们来熟悉一些例子:

  • 声明一个 0 维 filed:
x = ti.field(ti.f32)
ti.root.place(x)
# is equivalent to:
x = ti.field(ti.f32, shape=())
  • 声明一个尺寸为 3 的 1 维的 field:
x = ti.field(ti.f32)
ti.root.dense(ti.i, 3).place(x)
# is equivalent to:
x = ti.field(ti.f32, shape=3)
  • 声明一个尺寸为 (3, 4) 的二维 field:
x = ti.field(ti.f32)
ti.root.dense(ti.ij, (3, 4)).place(x)
# is equivalent to:
x = ti.field(ti.f32, shape=(3, 4))

你也可以把两个 1 维 dense 语句嵌套起来描述相同的二维数组。

x = ti.field(ti.f32)
ti.root.dense(ti.i, 3).dense(ti.j, 4).place(x)

简而言之, ti.root.X 语句逐步将 field 的形状绑定到对应的轴。 通过多个语句的嵌套,我们可以构建一个更高维度的 field。

你可以使用 struct-for 来遍历嵌套语句:

for i, j in A:
A[i, j] += 1

读取 A的顺序, 也就是迭代 ij的顺序,会对程序性能产生少量影响。 Taichi 编译器能够自动计算底层的数据布局并适当调整数据读取顺序。 这是 Taichi 编程语言与其他大多数通用编程语言的一大优势。在这些语言中,数据读取顺序只能手动优化。

行主序 vs. 列主序

你可能已经通过计算机体系架构课程中了解到内存地址空间是线性的。 在不失一般性的情况下,我们忽略了数据类型的差异,并假定每个数据元素都有尺寸 1。 此外,我们把一个 field 的起始内存地址标为 basis, 那么一个 1 维 Taichi field 的第 i 个元素的索引公式是 base + i

对于多维 field,我们可以通过以下两种方式将高维度索引平展到线性内存地址空间: 举一个形状为 (M, N) 的 2 维 field 的例子:我们可以存储 M 行与 N长度1D 缓冲区, 说 行主序 路径,或存储 N 列,表示 列主序 The index flatten formula for the第 (i, j) 个元素的展平公式对于行主序 field 是 base + i * N + j,对于列主序 field 是 base + j * M + i

We can easily derive that elements in the same row are close in memory for row-major fields. The selection of the optimal layout is based on how the elements are accessed, namely, the access patterns. Patterns such as frequently accessing elements of the same row in a column-major field typically lead to performance degradation.

默认的 Taichi field 的布局是行主序的。 你可以通过 ti.root 语句这样定义 field:

ti.root.dense(ti.i, M).dense(ti.j, N).place(x)   # row-major
ti.root.dense(ti.j, N).dense(ti.i, M).place(y) # column-major

In the code above, the axis denotation in the rightmost dense statement indicates the continuous axis. For the x field, elements in the same row (with same i and different j) are close in memory, hence it's row-major; For the y field, elements in the same column (same j and different i) are close, hence it's column-major. 我们通过 (2, 3) 这个例子可视化了 xy 的内存布局:

# 地址: 小 ........................................... 大
# x: x[0, 0] x[0, 1] x[1, 0] x[1, 1] x[2, 0] x[2, 1]
# y: y[0, 0] y[1, 0] y[2, 0] y[0, 1] y[1, 1] y[2, 1]

It is worth noting that the accessor is unified for Taichi fields: the (i, j)-th element in the field is accessed with the identical 2D index x[i, j] and y[i, j]. Taichi handles the layout variants and applies proper indexing equations internally. Thanks to this feature, users can specify their desired layout at definition, and use the fields without concerning about the underlying memory organizations. To change the layout, it's sufficient to just swap the order of dense statements, and leave rest of the code intact.

note

以下 C 语言代码片段为熟悉 C/C++ 的读者演示了 2D 数组的数据访问:

int x[3][2];  // 行主序
int y[2][3]; // 列主序

for (int i = 0; i < 3; i++) {
for (int j = 0; j < 2; j++) {
do_something(x[i][j]);
do_something(y[j][i]);
}
}

The accessors of x and y are in reverse order between row-major arrays and column-major arrays, respectively. Compared with Taichi fields, there is much more code to revise when you change the memory layout.

AoS versus SoA

AoS means array of structures and SoA means structure of arrays. Consider an RGB image with 4 pixels and 3 color channels, an AoS layout stores RGBRGBRGBRGB while an SoA layout stores RRRRGGGGBBBB.

The selection of AoS or SoA layout largely depends on the access pattern to the field. Let's discuss a scenario to process large RGB images. The two layouts have the following arrangements in memory:

# address: low ...................... high
# AoS: RGBRGBRGBRGBRGBRGB.............
# SoA: RRRRR...RGGGGGGG...GBBBBBBB...B

要计算每个像素的灰度尺度,你需要所有颜色通道但不需要其他像素的值。 In this case, the AoS layout has a better memory access pattern: Since color channels are stored continuously, and adjacent channels can be fetched instantly. The SoA layout is not a good option because the color channels of a pixel are stored far apart in the memory space.

我们将介绍如何使用 ti.root.X 语句构建 AoS 和 SoA field。 The SoA fields are trivial:

x = ti.field(ti.f32)
y = ti.field(ti.f32)
ti.root.dense(ti.i, M).place(x)
ti.root.dense(ti.i, M).place(y)

其中 M 是 xy 的长度。 xy 中的数据元素在内存中是连续的:

#  address: low ................................. high
# x[0] x[1] x[2] ... y[0] y[1] y[2] ...

我们这样构建 AoS field

x = ti.field(ti.f32)
y = ti.field(ti.f32)
ti.root.dense(ti.i, M).place(x, y)

内存布局就变成了

#  address: low .............................. high
# x[0] y[0] x[1] y[1] x[2] y[2] ...

Here, place interleaves the elements of Taichi fields x and y.

As previously introduced, the access methods to x and y remain the same for both AoS and SoA. Therefore, the data layout can be changed flexibly without revising the application logic.

For better illustration, let's see an example of an 1D wave equation solver:

N = 200000
pos = ti.field(ti.f32)
vel = ti.field(ti.f32)
# SoA placement
ti.root.dense(ti.i, N).place(pos)
ti.root.dense(ti.i, N).place(vel)

@ti.kernel
def step():
pos[i] += vel[i] * dt
vel[i] += -k * pos[i] * dt

上面的代码片段定义了 SoA field 和一个可以顺序访问每个元素对 step kernel。 The kernel fetches an element from pos and vel for every iteration, respectively. For SoA fields, the closest distance of any two elements in memory is N, which is unlikely to be efficient for large N.

如下面的代码所示,我们把布局切换为 AoS:

N = 200000
pos = ti.field(ti.f32)
vel = ti.field(ti.f32)
# AoS placement
ti.root.dense(ti.i, N).place(pos, vel)

@ti.kernel
def step():
pos[i] += vel[i] * dt
vel[i] += -k * pos[i] * dt

Merely revising the place statement is sufficient to change the layout. With this optimization, the instant elements pos[i] and vel[i] are now adjacent in memory, which is more efficient.<!-- ```python

SoA version

N = 200000 pos = ti.field(ti.f32) vel = ti.field(ti.f32) ti.root.dense(ti.i, N).place(pos) ti.root.dense(ti.i, N).place(vel)

@ti.kernel def step(): pos[i] += vel[i] dt vel[i] += -k pos[i] * dt


```python
# AoS version
N = 200000
pos = ti.field(ti.f32)
vel = ti.field(ti.f32)
ti.root.dense(ti.i, N).place(pos)
ti.root.dense(ti.i, N).place(vel)

@ti.kernel
def step():
pos[i] += vel[i] * dt
vel[i] += -k * pos[i] * dt

Here, pos and vel for SoA are placed separately, so the distance in address space between pos[i] and vel[i] is 200000. This results in poor spatial locality and poor performance. A better way is to place them together:

ti.root.dense(ti.i, N).place(pos, vel)

Then vel[i] is placed right next to pos[i], which can increase spatial locality and therefore improve performance. --><!-- For example, the following places two 1D fields of shape 3 together, which is called Array of Structures (AoS):

ti.root.dense(ti.i, 3).place(x, y)

Their memory layout is:

By contrast, the following places these two fields separately, which is called Structure of Arrays (SoA): --><!-- ```python ti.root.dense(ti.i, 3).place(x) ti.root.dense(ti.i, 3).place(y)

Now their memory layout is:


**To improve spatial locality of memory accesses, it may be helpful to
place data elements that are often accessed together within
relatively close addresses.** -->
### AoS extension: hierarchical fields
<!-- haidong: I hope to remove this subsection. This content just repeats the AoS topic --> Sometimes we want to access memory in a complex but fixed pattern, like traversing an image in 8x8 blocks. The apparent best practice is to flatten each 8x8 block and concatenate them together. From a Taichi user's perspective, however, the field is no longer a flat buffer. It now has a hierarchy with two levels: The image level and the block level. Equivalently, the field is an array of implicit 8x8 block structures.
We demonstrate the statements as follows:

```python
# Flat field
val = ti.field(ti.f32)
ti.root.dense(ti.ij, (M, N)).place(val)
# Hierarchical field
val = ti.field(ti.f32)
ti.root.dense(ti.ij, (M // 8, N // 8)).dense(ti.ij, (8, 8)).place(val)

其中 M and N 是 8 的倍数。 一定要尝试一下啊! 你会发现性能的差异巨大!

管理内存占用

Field 的手动分配和销毁

总的来说 Taichi 对内存的管理和销毁对用户是不可见的。 不过,用户有时会需要接管他们的内存分配。

针对这种情况,Taichi 提供了 FieldsBuilder,用于支持 field 相关内存的手动分配和销毁。 FieldsBuilder features identical declaration APIs as ti.root. 还需要在所有声明之后调用 finalize()finalize() 返回一个 SNodeTree 对象用于处理随后的内存销毁。

我们来看一个简单的例子:

import taichi as ti
ti.init()

@ti.kernel
def func(v: ti.template()):
for I in ti.grouped(v):
v[I] += 1

fb1 = ti.FieldsBuilder()
x = ti.field(dtype=ti.f32)
fb1.dense(ti.ij, (5, 5)).place(x)
fb1_snode_tree = fb1.finalize() # Finalizes the FieldsBuilder and returns a SNodeTree
func(x)
fb1_snode_tree.destroy() # Destruction

fb2 = ti.FieldsBuilder()
y = ti.field(dtype=ti.f32)
fb2.dense(ti.i, 5).place(y)
fb2_snode_tree = fb2.finalize() # Finalizes the FieldsBuilder and returns a SNodeTree
func(y)
fb2_snode_tree.destroy() # Destruction

尽管 ti.root 在实际上具有自动管理内存分配和回收的能力,上面演示的 ti.root 语句是用由 FieldsBuilder实现的。

紧凑排布模式

默认情况下,Taichi 会自动把一个 field 放进一个尺寸为 2 次方的略大一些的缓存当中。 我们采用 2 的 n 次幂的填充规则,因为它在计算机图形学中被广泛采用。 The design enables fast indexing with bitwise arithmetic and better memory address alignment, while trading off memory occupations.

例如:一个 (18, 65) 的 field 在物理内存中占据一块 (32, 128) 的缓存,这还是可以接受的。 但是,随着 field 大小的增长,这一填充策略就会变得令人无法忍受:(129, 6553600) 的 field 将独占 (256, 6335600) 的缓冲区,造成大量内存未被使用。 针对这种情况,Taichi 提供了可选的紧凑排布模式为指定的 field 分配更为合理和紧凑的缓存空间。 当内存占用成为一个重点考量时,这一模式就变得尤为有用。

如需利用紧凑排布模式,请在调用 ti.init() 时指定参数 packed

ti.init()  # default: packed=False
a = ti.field(ti.i32, shape=(18, 65)) # padded to (32, 128)
ti.init(packed=True)
a = ti.field(ti.i32, shape=(18, 65)) # no padding

你可能会发现少量的性能回退,这是因为紧凑排布模式下的内存寻址与内存对齐会更加复杂。 因此,只有在内存容量是成为主要问题时才需要设置紧凑排布模式。