Tiled Matrix Multiplication -- CUDA implementation

Aim

Use tiled MM can reduce global memory access in GPUs.

Mathematical operation of tiling MM

Ci,jT×T=k=1n/TAi,kT×TBk,jT×T

Pasted image 20230705160636.png

Algorithm

for (int m = 0; m<M/T; m+=T) { /* Step through the blocks along A rows */
	for (int n=0; n<N/T; n+=T) {/* Step through the blocks along B columns */
		for(int k=0; k<K/T; k+=T) { /*Step through A'columns and B's rows to complete the computations for a block of C */
			/*inside the block, do the Matrix Multiplication*/
			for(int mt=m; mt< m+T && mt < M; mt++){
				for(int nt=n; nt<n+T && nt < N; nt++){
					for(int kt = k; kt<k+T && kt<K; kt++){
						C[mt*M+nt] += A[mt * M + kt] * B[kt*K+nt]; /* += complete each block */
					}
				}
			}	
		}	
	}
}

On GPUs

notes/CUDA - grids, blocks and threads
Each thread is responsible for one element as well

Access each element of matrices in the thread
int row = by*blockDim.y + ty;
int col = bx*blockDim.x + tx;
/*
A[row][tx]
B[ty][col]
/*

blockId.x
blockId.x

Each thread load block by block and accumulate to compute the final element

    shared T mat_1_tile[BLOCK_DIM][BLOCK_DIM];
shared T mat_2_tile[BLOCK_DIM][BLOCK_DIM];

    T acc_sum = 0;
    for (tile-id=0; tile-id < n/BLOCK_DIM); ++ tile-id){

	 /load mat_1_tile/
	 i = blockIdx.y * blockDim.y + threadIdx.y;
         j = tile-id * blockDim.x + threadIdx.x; 
	 /* load the pointer of the first element /
         mat_1_tile[threadIdx.y][threadIdx.x] = mat_1[in+j];

         /load mat_2_tile/
	 i = tile-id* blockDim.y + threadIdx.y;
         j = blockIdx.x * blockDim.x + threadIdx.x; 
	 /* load the pointer of the first element /
         mat_1_tile[threadIdx.y][threadIdx.x] = mat_1[in+j];

	 __syncthreads();
           for (size_t k=0; k < BLOCK_DIM; ++k)
{
acc_sum += mat_1_tile[threadIdx.y][k] * mat_2_tile[k][threadIdx.x];
          }
        
    }

...
.
      .
              .
.
.
.
bxbdim.x+tx,
by
bdim.y+ty
Example for first iteration, i.e. tile-id =0
Each thread load the corresponding m1 and m2 block by block through this tiling iterations. Then compute the sum for this block. (Notice this happened in parallel within all the threads.) As the loop continues, the remaining blocks' values are added to acc_sum.