|
6 | 6 | "source": [ |
7 | 7 | "# 3D Grid on GPU with Kernel Tuner\n", |
8 | 8 | "\n", |
9 | | - "In this tutrial we are going to see how to map a series of Gaussian functions, each located at a different point on a 3D a grid. We are going to optimize the kernel of the GPU code and compare its performance with the CPU implementation. \n", |
| 9 | + "In this tutorial we are going to see how to map a series of Gaussian functions, each located at a different point on a 3D a grid. We are going to optimize the GPU code and compare its performance with the CPU implementation. \n", |
10 | 10 | "\n", |
11 | 11 | "<div class=\"alert alert-info\">\n", |
12 | 12 | "\n", |
|
21 | 21 | "source": [ |
22 | 22 | "## Let's start on the CPU\n", |
23 | 23 | "\n", |
24 | | - "Before dwelving on the GPU implementation, let's start with a simple CPU implementation of the problem. The problem at hand is to compute the values of the following function \n", |
| 24 | + "Before delving into the GPU implementation, let's start with a simple CPU implementation of the problem. The problem at hand is to compute the values of the following function \n", |
25 | 25 | "\n", |
26 | 26 | "\\begin{equation} \\nonumber\n", |
27 | 27 | "f = \\sum_{i=1}^{N}\\exp\\left(-\\beta \\sqrt{(x-x_i)^2+(y-y_i)^2+(z-z_i)^2}\\right)\n", |
28 | 28 | "\\end{equation}\n", |
29 | 29 | "\n", |
30 | | - "on a 3d grid. The $x$, $y$ and $z$ vectors contain here the coordinate of the points in the cartesian space. We can define a simple python function that computes the value of the function $f$ for one given Gaussian. " |
| 30 | + "on a 3d grid. The $x$, $y$ and $z$ vectors contain the coordinate of the points in the Cartesian space. We can define a simple Python function that computes the value of the function $f$ for one given Gaussian. Don't forget to execute all the code cells, like the one below, as you read through this notebook by selecting the cell and pressing *shift+enter*." |
31 | 31 | ] |
32 | 32 | }, |
33 | 33 | { |
|
54 | 54 | "cell_type": "markdown", |
55 | 55 | "metadata": {}, |
56 | 56 | "source": [ |
57 | | - "For a given center, this function returns the values of the corresponding Gaussian function mapped on the 3d Grid. The grid points are here defined by the variables `xgrid`, `ygrid` and `zgrid`. These variables are themselves 3D grid obtained as we wil see in an instant with the `numpy.meshgrid` function. \n", |
| 57 | + "For a given center, this function returns the values of the corresponding Gaussian function mapped on the 3D grid. The grid points are here defined by the variables `xgrid`, `ygrid` and `zgrid`. These variables are themselves 3D grids obtained, as we will see in an instant, using the `numpy.meshgrid` function. \n", |
58 | 58 | "\n", |
59 | | - "To use this function we simply have to create the x,y and z grids. Since we want to later on send these vectors on the GPU we define them as float. For simplicity we here select the interval $[-1:1]$ to define the grid. We use $n=256$ grid points in order to have a sufficiently large probelm without requiring too long calculations. We then create meshgrids to be passed to the function above. We define here 100 gaussian centers that are randomly distributed within the grid space." |
| 59 | + "To use this function we simply have to create the grid, defined by the vectors x, y, and z. Since we want to later on send these vectors to the GPU we define them as 32-bit floats. For simplicity, we here select the interval $[-1:1]$ to define our grid. We use $n=256$ grid points in order to have a sufficiently large problem without requiring too long calculations. We then create meshgrids to be passed to the function above. We define here 100 gaussian centers that are randomly distributed within the 3D space." |
60 | 60 | ] |
61 | 61 | }, |
62 | 62 | { |
63 | 63 | "cell_type": "code", |
64 | 64 | "execution_count": 2, |
65 | | - "metadata": {}, |
| 65 | + "metadata": { |
| 66 | + "collapsed": false |
| 67 | + }, |
66 | 68 | "outputs": [ |
67 | 69 | { |
68 | 70 | "name": "stdout", |
|
88 | 90 | "# centers\n", |
89 | 91 | "npts = 100\n", |
90 | 92 | "center = (-1 + 2*np.random.rand(npts,3)).astype(np.float32)\n", |
91 | | - "center = center.astype(np.float32)\n", |
92 | 93 | "\n", |
93 | 94 | "# compute the grid and time the operation\n", |
94 | 95 | "t0 = time()\n", |
|
101 | 102 | "cell_type": "markdown", |
102 | 103 | "metadata": {}, |
103 | 104 | "source": [ |
104 | | - "Depending on your hardware it might take a few seconds for the calculations of the function to finish." |
| 105 | + "Depending on your hardware it might take a few seconds for the calculations above to finish." |
105 | 106 | ] |
106 | 107 | }, |
107 | 108 | { |
108 | 109 | "cell_type": "markdown", |
109 | 110 | "metadata": {}, |
110 | 111 | "source": [ |
111 | | - "## Let's go on GPU\n", |
| 112 | + "## Let's move to the GPU\n", |
112 | 113 | "\n", |
113 | | - "Let's see now how that will look like on the GPU. We first write a kernel template that does the same calculation as the function above. As yp can see see below, the variables *block_size_x, block_size_y* and *block_size_z* are npt explicitly defined here. These variables define how many thread will run simultaneously on the GPU and are the main parameters that the kernel tuner will optimize.Therefore during the tuning phase, the kernel tuner will automatically insert **#define** statements for these parameters at the top of the kernel code. So for now we don't have to sepcify their values. \n", |
| 114 | + "Let's see now how that will look like on the GPU. We first write a kernel that does the same calculation as the above function. As you can see see below, the variables `block_size_x`, `block_size_y` and `block_size_z` are not yet defined here. These variables are used to set the number of threads per thread block on the GPU and are the main parameters that we will optimize in this tutorial. During tuning, Kernel Tuner will automatically insert `#define` statements for these parameters at the top of the kernel code. So for now we don't have to specify their values. \n", |
114 | 115 | "\n", |
115 | | - "The dimensions of the problem, called here *nx, ny, nz*, are parameters of the template. We are going to see in the following how to use this template to generate a kernel code with specific dimensions." |
| 116 | + "The dimensions of the problem `nx`, `ny`, and `nz`, are the number of grid points in the x, y, and z dimensions. We can again use Kernel Tuner to insert these parameters into the code." |
116 | 117 | ] |
117 | 118 | }, |
118 | 119 | { |
|
127 | 128 | "# several parameters are available\n", |
128 | 129 | "# block sizes : bx, by, bz \n", |
129 | 130 | "# dimensions : nx, ny, nz\n", |
130 | | - "kernel_code_template = \"\"\"\n", |
| 131 | + "kernel_code = \"\"\"\n", |
131 | 132 | "#include <math.h>\n", |
132 | 133 | "\n", |
133 | 134 | "// a simple gaussian function\n", |
|
146 | 147 | " int y = threadIdx.y + block_size_y * blockIdx.y;\n", |
147 | 148 | " int z = threadIdx.z + block_size_z * blockIdx.z;\n", |
148 | 149 | "\n", |
149 | | - " if ( ( x < %(nx)s ) && (y < %(ny)s) && (z < %(nz)s) )\n", |
| 150 | + " if ( ( x < nx ) && (y < ny) && (z < nz) )\n", |
150 | 151 | " {\n", |
151 | 152 | "\n", |
152 | 153 | " float dx = xvect[x]-x0;\n", |
153 | 154 | " float dy = yvect[y]-y0;\n", |
154 | 155 | " float dz = zvect[z]-z0;\n", |
155 | 156 | " float d = sqrt(dx*dx + dy*dy + dz*dz);\n", |
156 | | - " out[y * %(nx)s * %(nz)s + x * %(nz)s + z] = f(d);\n", |
| 157 | + " out[y * nx * nz + x * nz + z] = f(d);\n", |
157 | 158 | " }\n", |
158 | 159 | "}\n", |
159 | 160 | "\"\"\"" |
|
165 | 166 | "source": [ |
166 | 167 | "### Tune the kernel\n", |
167 | 168 | "\n", |
168 | | - "We can now use the kernel tuner to optimize the block sizes on our GPU. To do so we define the tune_params dictionary that assigns to each block size the values we want the kernel tuner to explore. We also define a list containing the arguments of the CUDA function (AddGrid) above. Since we only want to optimize the performance of the kernel we only consider here one center in the middle of the grid. Note that the kernel tuner needs either numpy.ndarray or numpy.scalar as argumenst of the kernel. Hence we need to be specific on the types of the gaussians positions. " |
| 169 | + "We can now use the tuner to optimize the thread block dimensions on our GPU. To do so we define the tunable parameters of our kernel using the `tune_params` dictionary, which assigns to each block size the values we want the tuner to explore. We also use the tunable parameters to insert the domain dimensions `nx`, `ny`, and `nz`.\n", |
| 170 | + "\n", |
| 171 | + "We also define a list containing the arguments of the CUDA function (AddGrid) above. Since we only want to optimize the performance of the kernel we only consider here one center in the middle of the grid. Note that Kernel Tuner needs either `numpy.ndarray` or `numpy.scalar` as arguments of the kernel. Hence we need to be specific on the types of the Gaussians positions. " |
169 | 172 | ] |
170 | 173 | }, |
171 | 174 | { |
172 | 175 | "cell_type": "code", |
173 | | - "execution_count": 4, |
| 176 | + "execution_count": null, |
174 | 177 | "metadata": { |
175 | 178 | "collapsed": true, |
176 | 179 | "scrolled": true |
|
185 | 188 | "tune_params['block_size_x'] = [2,4,8,16,32]\n", |
186 | 189 | "tune_params['block_size_y'] = [2,4,8,16,32]\n", |
187 | 190 | "tune_params['block_size_z'] = [2,4,8,16,32]\n", |
| 191 | + "tune_params['nx'] = [n]\n", |
| 192 | + "tune_params['ny'] = [n]\n", |
| 193 | + "tune_params['nz'] = [n]\n", |
188 | 194 | "\n", |
189 | 195 | "# define the final grid\n", |
190 | 196 | "grid = np.zeros_like(xgrid)\n", |
|
201 | 207 | "cell_type": "markdown", |
202 | 208 | "metadata": {}, |
203 | 209 | "source": [ |
204 | | - "We then generate the kernel code by replacing the dimensions of the problem by their values. This is simply done by using the kernel template to replace the variables *nx,ny,nz*, that are written *%(nx)s, %(ny)s, %(nz)s* in the template by their values. As mentionned earlier, the kernel tuner will automatically insert #define statements at the top of the kernel to define the block sizes so we don't need to specify them here. Then we simply call the tune_kernel function. " |
| 210 | + "As mentioned earlier, the tuner will automatically insert `#define` statements at the top of the kernel to define the block sizes and domain dimensions, so we don't need to specify them here. Then, we simply call the `tune_kernel` function. " |
205 | 211 | ] |
206 | 212 | }, |
207 | 213 | { |
208 | 214 | "cell_type": "code", |
209 | 215 | "execution_count": 5, |
210 | | - "metadata": {}, |
| 216 | + "metadata": { |
| 217 | + "collapsed": false |
| 218 | + }, |
211 | 219 | "outputs": [ |
212 | 220 | { |
213 | 221 | "name": "stdout", |
|
309 | 317 | } |
310 | 318 | ], |
311 | 319 | "source": [ |
312 | | - "# generate the kernel code from the template\n", |
313 | | - "kernel_code = kernel_code_template % {'nx' : n, 'ny': n, 'nz' : n}\n", |
314 | | - "\n", |
315 | 320 | "# call the kernel tuner\n", |
316 | | - "result = tune_kernel('AddGrid', kernel_code,problem_size,args,tune_params)" |
| 321 | + "result = tune_kernel('AddGrid', kernel_code, problem_size, args, tune_params)" |
317 | 322 | ] |
318 | 323 | }, |
319 | 324 | { |
320 | 325 | "cell_type": "markdown", |
321 | 326 | "metadata": {}, |
322 | 327 | "source": [ |
323 | | - "The kernel explore all the possible combinations of tunable parameters (here only the block size). For each of them the kernel automatically introduces **#define** statements that specify the block sizes, compile the code and time the execution. At the end of the the run the tuner output the optimal conbination of the tunable parameters. As you can see the range of performances is quite large. With our GPU (GeForce GTX 1080 Ti) we obtained a maximum time of 5.30 ms and minimum one of 0.84 ms. Optimising the kernel allows dividing the execution time by a factor 6 !" |
| 328 | + "The `tune_kernel` function explores all the possible combinations of tunable parameters (here only the block size). For each possible kernel configuration, the tuner compiles the code and its measures execution time (by default using 7 iterations). At the end of the the run, the `tune_kernel` outputs the optimal combination of the tunable parameters. But the measured execution time of all benchmarked kernels is also returned by `tune_kernel` for programmatic access to the data.\n", |
| 329 | + "\n", |
| 330 | + "As you can see the range of performances is quite large. With our GPU (GeForce GTX 1080 Ti) we obtained a maximum time of 5.30 ms and minimum one of 0.84 ms. The performance of the kernel varies by a factor 6 depending on the thread block size!" |
324 | 331 | ] |
325 | 332 | }, |
326 | 333 | { |
|
344 | 351 | "import pycuda.autoinit\n", |
345 | 352 | "\n", |
346 | 353 | "# optimal values of the block size\n", |
347 | | - "block = [4,2,16]\n", |
| 354 | + "block = [4, 2, 16]\n", |
348 | 355 | "\n", |
349 | 356 | "# corresponding grid size\n", |
350 | | - "grid_dim = [ int(np.ceil(n/b)) for b,n in zip(block,problem_size)]" |
| 357 | + "grid_dim = [int(np.ceil(n/b)) for b, n in zip(block, problem_size)]" |
351 | 358 | ] |
352 | 359 | }, |
353 | 360 | { |
354 | 361 | "cell_type": "markdown", |
355 | 362 | "metadata": {}, |
356 | 363 | "source": [ |
357 | | - "Before using the kernel we need to specify the block size in its definition. There are different ways of doing this, we here simply replace the *block_size_x,block_size_y* and *block_size_z* by their values determined by the tuner.Inorder to do that we create a dictionary that associates the name of the block size and their values and simply make the substitution. Once the block size are specified, we can compile the kernel ourselves and get the function." |
| 364 | + "Before using the kernel we need to specify the block size in its definition. There are different ways of doing this, we here simply replace the `block_size_x`, `block_size_y` and `block_size_z` by their values determined by the tuner. In order to do that we create a dictionary that associates the name of the block size and their values and simply make the substitution. Once the block size are specified, we can compile the kernel ourselves and get the function." |
358 | 365 | ] |
359 | 366 | }, |
360 | 367 | { |
|
370 | 377 | "fixed_params['block_size_x'] = block[0]\n", |
371 | 378 | "fixed_params['block_size_y'] = block[1]\n", |
372 | 379 | "fixed_params['block_size_z'] = block[2]\n", |
| 380 | + "fixed_params['nx'] = n\n", |
| 381 | + "fixed_params['ny'] = n\n", |
| 382 | + "fixed_params['nz'] = n\n", |
373 | 383 | "\n", |
374 | 384 | "for k,v in fixed_params.items():\n", |
375 | 385 | " kernel_code = kernel_code.replace(k,str(v))\n", |
|
389 | 399 | { |
390 | 400 | "cell_type": "code", |
391 | 401 | "execution_count": 8, |
392 | | - "metadata": {}, |
| 402 | + "metadata": { |
| 403 | + "collapsed": false |
| 404 | + }, |
393 | 405 | "outputs": [ |
394 | 406 | { |
395 | 407 | "name": "stdout", |
|
441 | 453 | "name": "python", |
442 | 454 | "nbconvert_exporter": "python", |
443 | 455 | "pygments_lexer": "ipython3", |
444 | | - "version": "3.6.1" |
| 456 | + "version": "3.5.1" |
445 | 457 | } |
446 | 458 | }, |
447 | 459 | "nbformat": 4, |
|
0 commit comments