⚠ 转载请注明出处：作者：ZobinHuang，更新日期：July 28 2022

本作品由 ZobinHuang 采用知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议进行许可，在进行使用或分享前请查看权限要求。若发现侵权行为，会采取法律手段维护作者正当合法权益，谢谢配合。

本文我们将讨论 Square Matrix 乘法的 CUDA 实现。对于 Square Matrix 乘法来说，目标矩阵 $\text{Matrix C}$ 的第 $(x,y)$ 个元素，是由源矩阵 $\text{Matrix A}$ 的第 $x$ 行元素，和源矩阵 $\text{Matrix B}$ 的第 $y$ 列对应元素相乘后相加所得到的。在我们基础实现的版本中，我们为目标矩阵 $\text{Matrix C}$ 中的每一个元素分配一条线程进行处理，CUDA Kernel 如下所示:

/*!
 * \brief [CUDA Kernel] Conduct square matrix multiplication (A*B=C)
 * \param matrix_A  source matrix
 * \param matrix_B  source matrix
 * \param matrix_C  destination matrix
 * \param d         dimension of square matrix
 */
__global__ void squareMatrixMul(
    const int *matrix_A,
    const int *matrix_B,
    int *matrix_C,
    int d){
  // check kernel shape
  assert(blockDim.x == blockDim.y);
  assert(gridDim.x == gridDim.y);

  // obtain corresponding row and column for current thread
  int row_index = blockIdx.y * blockDim.y + threadIdx.y;
  int col_index = blockIdx.x * blockDim.x + threadIdx.x;

  // initialize destination element
  int dest_index = row_index*d+col_index;
  matrix_C[dest_index] = 0;

  // sum of product
  if(dest_index < d*d){
    for(int i=0; i<d; i++){
      matrix_C[dest_index] += matrix_A[row_index*d+i] * matrix_B[i*d+col_index];
    }
  }
}

在 Line 18~19 中，结合 basic_mul 图示，我们可以根据 Thread Block 和 Thread ID 对当前 Thread 处理的目标矩阵 $\text{Matrix C}$ 元素的行序号和列序号进行计算。然后在 Line 27~29 的 for 循环中我们就可以根据源矩阵 $\text{Matrix A}$ 的目标行序号和 $\text{Matrix B}$ 的目标列序号展开计算，并将最后结果写入到 $\text{Matrix C}$ 中。

基于 Memory Access Coalescing 的矩阵乘法

在上面矩阵乘法的操作中，针对 $\text{Matrix A}$ 是按行访问，针对 $\text{Matrix A}$ 是按列进行访问。二者的 Memory Access Pattern 如 matrix_a_access 和 matrix_b_access 所示。可以看到 $\text{Matrix A}$ 事实上访问的是一段连续的内存空间，而 $\text{Matrix B}$ 访问的是离散的内存空间。对于 GPU 来说，如果指令访问的是多段连续的内存空间，那么多次对小内存的访问可以被合并为一次多大内存的访问，因此 $\text{Matrix A}$ 的访问事实上会比 $\text{Matrix B}$ 更快。

为了使得 $\text{Matrix B}$ 的访问有和 $\text{Matrix A}$ 一样的性能，我们可以采用的优化方法是将 $\text{Matrix B}$ 转置后进行存储，并且使用和 $\text{Matrix A}$ 一样的访存计算方式进行访问，这样一来 $\text{Matrix B}$ 的 Memory Access Pattern 就有和 $\text{Matrix A}$ 一样的性能了。

优化后的 CUDA Kernel 如下所示:

/*!
 * \brief [CUDA Kernel] Conduct square matrix multiplication (A*B=C)
 *        with aligned memory access pattern
 * \param matrix_A  source matrix
 * \param matrix_B  source matrix (after transposed)
 * \param matrix_C  destination matrix
 * \param d         dimension of square matrix
*/
__global__ void alignedSquareMatrixMul(
    const int *matrix_A,
    const int *matrix_B,
    int *matrix_C,
    const int d){
  // check kernel shape
  assert(blockDim.x == blockDim.y);
  assert(gridDim.x == gridDim.y);

  // obtain corresponding row and column for current thread
  int row_index = blockIdx.y * blockDim.y + threadIdx.y;
  int col_index = blockIdx.x * blockDim.x + threadIdx.x;

  // initialize destination element
  int dest_index = row_index*d+col_index;
  matrix_C[dest_index] = 0;

  // sum of product
  if(dest_index < d*d){
    for(int i=0; i<d; i++){
      matrix_C[dest_index] += matrix_A[row_index*d+i] * matrix_B[col_index*d+i];
    }
  }
}

/*!
 * \brief Transpose square matrix
 * \param matrix    source matrix
 */
void transposeSquareMatrix(std::vector<int> &matrix){
    // obtained matrix size
    int size = matrix.size();
    double _dimension = sqrt(size);
    int dimension = (int)_dimension;

    // assertion to check matrix shape
    assert(_dimension-dimension == 0.0f);
    
    // temp intermediate matrix
    std::vector<int> tmp_matrix;
    tmp_matrix.reserve(size);
    tmp_matrix = matrix;

    // for all rows
    for(int i=0; i<dimension; i++){
        // for all column
        for(int j=0; j<dimension; j++){
            matrix[i*dimension+j] = tmp_matrix[j*dimension+i];
        }
    }
}

基于 Cache Tiling 的矩阵乘法

上面展示的基础版本的矩阵乘法中，我们在进行 sum of product 时，在 Line 28 中引入了大量的内存访问。与 Shared Memory 和其它 Cache 相比，访存的开销是较大的。访存将导致运行 CUDA Kernel 的 SM 陷入 Stall 的状态。本节中我们将把矩阵加载到 Shared Memory 中，以加速 CUDA Kernel 的执行过程。

如 tiled_mul 所示，以 $\text{Matrix C}$ 中标有颜色的 Thread Block 为例。要计算出该方块中的元素的值，则必须加载 $\text{Matrix A}$ 和 $\text{Matrix B}$ 中带有颜色的方块中的数据。由于 Shared Memory 通常容量有限，因此在这里我们采取 Matrix Tiling 的策略。我们首先加载 $\text{Matrix A}$ 和 $\text{Matrix B}$ 中蓝色部分的数据，并进行 sum of product 的计算，然后将中间结果记录下来；然后加载红色部分的数据，然后进行 sum of product 的计算，最后和中间结果进行相加，就能得到 $\text{Matrix C}$ 的最终目标值。每一个 Thread Block 都进行上述步骤，最终我们就能得到整个 $\text{Matrix C}$ 的计算结果。

基于 Cache Tiling 的 CUDA Kernel 源代码如下所示:

/*!
 * \brief [CUDA Kernel] Conduct square matrix multiplication (A*B=C)
 *        based on cache tiling, make sure tile size is equal to 
 *        block size
 * \param matrix_A  source matrix
 * \param matrix_B  source matrix
 * \param matrix_C  destination matrix
 * \param tile_size size of each tile
 * \param d         dimension of square matrix
 */
__global__ void tiledSquareMatrixMul(
    const int *matrix_A,
    const int *matrix_B,
    int *matrix_C,
    const int tile_size,
    const int d){
  // check kernel shape
  assert(blockDim.x == blockDim.y);
  assert(gridDim.x == gridDim.y);
  assert(tile_size == blockDim.x);
  
  // declare share memory are for submatrix
  extern __shared__ int tile[];
  int* tile_A = tile;
  int* tile_B = tile+tile_size*tile_size;

  // obtain global row and column index for current thread
  int row_index = blockIdx.y * blockDim.y + threadIdx.y;
  int col_index = blockIdx.x * blockDim.x + threadIdx.x;
  
  // obtain shared memory index for current thread
  int shared_tile_index = threadIdx.y*blockDim.x+threadIdx.x;

  int tmp = 0;

  // traverse all tiles
  for(int i=0; i<d; i += tile_size){
    // load element into shared memory
    tile_A[shared_tile_index] = matrix_A[row_index*d+threadIdx.x+i];
    tile_B[shared_tile_index] = matrix_B[threadIdx.y*d+col_index+i*d];

    // wait for both tiles to be loaded by threads in current CLB
    __syncthreads();

    // computation
    for(int j=0; j<tile_size; j++){
      tmp += tile_A[threadIdx.y*blockDim.x+j]*tile_B[j*blockDim.x+threadIdx.x];
    }

    // wait for all threads finish computation for current tiles
    // before loading in new one
    __syncthreads();
  }

  // write result to global memory
  matrix_C[row_index*d+col_index] = tmp;
}

在上面的代码中，Line 23~25 我们首先基于 __shared__ 关键字声明了一块位于 Shared Memory 中的内存空间，值得注意的是我们是通过 extern 关键字以动态分配的方式声明这段空间的。接着我们在 Line 37~53 这个 for loop 中，分 Tiles 地进行数据的加载和计算，其中在 Line 39~40 我们将当前 Thread Block 在当前 Tile 所需要的数据拉取到 Shared Memory 中，然后在 Line 46~48 的 for loop 中完成相应的计算。

另外值得注意的是，我们在将数据加载到 Shared Memory 后，在 Line 43 我们调用了 __syncthreads() API，这个 API 会阻塞等待当前 Thread Block 中所有的 Threads 运行到当前这一步，因为只有在当前 Thread Block 中所有的 Threads 都完成数据向 Shared Memory 中的拷贝后，我们才能进行后续的计算操作。在当前 Thread Block 中所有的 Threads 完成计算过程后，在 Line 52 我们同样调用了该 API，因为只有当所有的 Threads 完成运算之后，我们才能继续将目标数据拷贝到 Shared Memory 中，这个操作将会覆盖 Shared Memory 中原有的数据。

另外，由于该 Kernel 采用了动态分配 Shared Memory 的方式，因此我们在 CPU 侧代码 launch 这个 kernel 的时候，需要在 <<<>>> 中加入第三个参数，用于指定动态分配的 Shared Memory 空间的大小，如下所示:

// initialize kernel configuration
// number of kernels per block (one dimension)
int NUM_THREADS_PER_BLOCK = 1 << 5;

// obtain shared memory size for each thread block
int shared_memory_size = 2*NUM_THREADS_PER_BLOCK*NUM_THREADS_PER_BLOCK*sizeof(int);

// launch kernel
tiledSquareMatrixMul<<<blocks, threads, shared_memory_size>>>(
              d_matrix_A, d_matrix_B, d_matrix_C, NUM_THREADS_PER_BLOCK, N);

Zobin

Matrix Multiplication

⚠ 转载请注明出处：作者：ZobinHuang，更新日期：July 28 2022

目录

矩阵在内存中的存储形式

基础矩阵乘法

基于 Memory Access Coalescing 的矩阵乘法

基于 Cache Tiling 的矩阵乘法