Profiling CUDA Utilizing Nsight Techniques: A Numba Instance | by Carlos Costa, Ph.D. | Could, 2024
Optimization is a vital a part of writing excessive efficiency code, irrespective of if you’re writing an internet server or computational fluid dynamics simulation software program. Profiling means that you can make knowledgeable selections concerning your code. In a way, optimization with out profiling is like flying blind: principally advantageous for seasoned professionals with skilled information and fine-tuned instinct, however a recipe for catastrophe for nearly everybody else.
Following my preliminary collection CUDA by Numba Examples (see elements 1, 2, 3, and 4), we are going to research a comparability between unoptimized, single-stream code and a barely higher model which makes use of stream concurrency and different optimizations. We’ll study, from the ground-up, the way to use NVIDIA Nsight Systems to profile and analyze CUDA code. All of the code on this tutorial can be discovered within the repo cako/profiling-cuda-nsight-systems.
NVIDIA recommends as greatest observe to observe the APOD framework (Assess, Parallelize, Optimize, Deploy). There are a number of proprietary, open-source, free, and industrial software program for several types of assessments and profiling. Veteran Python customers could also be acquainted with fundamental profilers resembling cProfile
, line_profiler
, memory_profiler
(sadly, unmaintaned as of 2024) and extra superior instruments like PyInstrument and Memray. These profilers goal particular elements of the “host” resembling CPU and RAM utilization.
Nevertheless, profiling “system” (e.g., GPU) code — and its interactions with the host — requires specialised instruments supplied by the system vendor. For NVIDIA GPUs, Nsight Techniques, Nsight Compute, Nsight Graphics can be found for profiling completely different elements of computation. On this tutorial we are going to deal with utilizing Nsight Techniques, which is a system-wide profiler. We’ll use it to profile Python code which interacts with the GPU through Numba CUDA.
To get began, you will want Nsight Techniques CLI and GUI. The CLI will be put in individually and will probably be used to profile the code in a GPGPU-capable system. The total model contains each CLI and GUI. Word that each variations could possibly be put in in a system with out a GPU. Seize the model(s) you want from the NVIDIA website.
To make it simpler to visualise code sections within the GUI, NVIDIA additionally supplies the Python pip
and conda
-installable library nvtx
which we are going to use to annotate sections of our code. Extra on this later.
On this part we are going to set our growth and profiling atmosphere up. Beneath are two quite simple Python scripts: kernels.py
and run_v1.py
. The previous will comprise all CUDA kernels, and the latter will function the entry level to run the instance. On this instance we’re following the “cut back” sample launched in article CUDA by Numba Examples Part 3: Streams and Events to compute the sum of an array.
#%%writefile kernels.py
import numba
from numba import cudaTHREADS_PER_BLOCK = 256
BLOCKS_PER_GRID = 32 * 40
@cuda.jit
def partial_reduce(array, partial_reduction):
i_start = cuda.grid(1)
threads_per_grid = cuda.blockDim.x * cuda.gridDim.x
s_thread = numba.float32(0.0)
for i_arr in vary(i_start, array.dimension, threads_per_grid):
s_thread += array[i_arr]
s_block = cuda.shared.array((THREADS_PER_BLOCK,), numba.float32)
tid = cuda.threadIdx.x
s_block[tid] = s_thread
cuda.syncthreads()
i = cuda.blockDim.x // 2
whereas i > 0:
if tid < i:
s_block[tid] += s_block[tid + i]
cuda.syncthreads()
i //= 2
if tid == 0:
partial_reduction[cuda.blockIdx.x] = s_block[0]
@cuda.jit
def single_thread_sum(partial_reduction, sum):
sum[0] = numba.float32(0.0)
for aspect in partial_reduction:
sum[0] += aspect
@cuda.jit
def divide_by(array, val_array):
i_start = cuda.grid(1)
threads_per_grid = cuda.gridsize(1)
for i in vary(i_start, array.dimension, threads_per_grid):
array[i] /= val_array[0]
#%%writefile run_v1.py
import argparse
import warningsimport numpy as np
from numba import cuda
from numba.core.errors import NumbaPerformanceWarning
from kernels import (
BLOCKS_PER_GRID,
THREADS_PER_BLOCK,
divide_by,
partial_reduce,
single_thread_sum,
)
# Ignore NumbaPerformanceWarning
warnings.simplefilter("ignore", class=NumbaPerformanceWarning)
def run(dimension):
# Outline host array
a = np.ones(dimension, dtype=np.float32)
print(f"Previous sum: {a.sum():.3f}")
# Array copy to system and array creation on the system.
dev_a = cuda.to_device(a)
dev_a_reduce = cuda.device_array((BLOCKS_PER_GRID,), dtype=dev_a.dtype)
dev_a_sum = cuda.device_array((1,), dtype=dev_a.dtype)
# Launching kernels to normalize array
partial_reduce[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_reduce)
single_thread_sum[1, 1](dev_a_reduce, dev_a_sum)
divide_by[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_sum)
# Array copy to host
dev_a.copy_to_host(a)
cuda.synchronize()
print(f"New sum: {a.sum():.3f}")
def essential():
parser = argparse.ArgumentParser(description="Easy Instance v1")
parser.add_argument(
"-n",
"--array-size",
kind=int,
default=100_000_000,
metavar="N",
assist="Array dimension",
)
args = parser.parse_args()
run(dimension=args.array_size)
if __name__ == "__main__":
essential()
This can be a easy script that may simply be run with:
$ python run_v1.py
Previous sum: 100000000.000
New sum: 1.000
We additionally run this code by means of our profiler, which simply entails calling nsys
with some choices earlier than the decision to our script:
$ nsys profile
--trace cuda,osrt,nvtx
--gpu-metrics-device=all
--cuda-memory-usage true
--force-overwrite true
--output profile_run_v1
python run_v1.py
GPU 0: Basic Metrics for NVIDIA TU10x (any frequency)
Previous sum: 100000000.000
New sum: 1.000
Producing '/tmp/nsys-report-fb78.qdstrm'
[1/1] [========================100%] profile_run_v1.nsys-rep
Generated:
/content material/profile_run_v1.nsys-rep
You may seek the advice of the Nsight CLI docs for all of the accessible choices to the nsys
CLI. For this tutorial we are going to at all times use those above. Let’s dissect this command:
profile
placesnsys
in profile mode. There are lots of different modes likeexport
andlaunch
.--trace cuda,osrt,nvtx
ensures we “pay attention” to all CUDA calls (cuda
), OS runtime library calls (osrt
) andnvtx
annotations (none on this instance). There are lots of extra hint choices resemblingcublas
,cudnn
,mpi
,dx11
and several other others. Examine the docs for all choices.--gpu-metrics-device=all
information GPU metrics for all GPUs, together with Tensor Core utilization.--cuda-memory-usage
tracks GPU reminiscence utilization of kernels. It could considerably decelerate execution and requires--trace=cuda
. We use it as a result of our scripts our fairly quick in any case.
If the command exited efficiently, we may have a profile_run_v1.nsys-rep
within the present folder. We’ll open this file by launching the Nsight Techniques GUI, File > Open
. The preliminary view is barely complicated. So we are going to begin by decluttering: resize the Occasions View
port to the underside, and reduce CPU
, GPU
and Processes
underneath the Timeline View
port. Now develop solely Processes > python > CUDA HW
. See Figures 1a and 1b.
First up, let’s discover our kernels. On the CUDA HW
line, you’ll find inexperienced and crimson blobs, and really small slivers of sunshine blue (see Determine 1b). Should you hover over these you will note tooltips saying, “CUDA Reminiscence operations in progress” for crimson and inexperienced, and “CUDA Kernel Working (89.7%)” for the sunshine blues. These are going to be the bread and butter of our profiling. On this line, we will inform when and the way reminiscence is being transferred (crimson and inexperienced) and when and the way our kernels are operating (mild blue).
Let’s dig in somewhat bit extra on our kernels. You need to see three very small blue slivers, every representing a kernel name. We’ll zoom into the area by clicking and dragging the mouse from simply earlier than the beginning of the primary kernel name to simply after the tip of the final one, after which urgent Shift + Z. See Determine 2.
Now that now we have discovered our kernels, let’s see some metrics. We open the GPU > GPU Metrics
tabs for that. On this panel, can discover “Warp Occupancy” (beige) for compute kernels. One technique to optimize CUDA code is to make sure that the warp occupancy is as near 100% as potential for so long as potential. Because of this our GPU will not be idling. We discover that that is occurring for the primary and final kernels however not the center kernel. That’s anticipated as the center kernel launches a single thread. One last factor to notice on this part is the GPU > GPU Metrics > SMs Energetic > Tensor Energetic / FP16 Energetic
line. This line will present whether or not the tensor cores are getting used. On this case you must confirm that they don’t seem to be.
Now let’s briefly have a look at the Occasions View. Proper click on Processes > python > CUDA HW
and click on “Present in Occasions View”. Then kind the occasions by descending period. In Determine 3, we see that the slowest occasions are two pageable reminiscence transfers. Now we have seen in CUDA by Numba Examples Part 3: Streams and Events that pageable reminiscence transfers will be suboptimal, and we must always want page-locked or “pinned” reminiscence transfers. If now we have sluggish reminiscence transfers due to make use of of pageable reminiscence, the Occasions View is usually a nice location to establish the place these sluggish transfers will be discovered. Professional tip: you’ll be able to isolate reminiscence transfers by proper clicking Processes > python > CUDA HW > XX% Reminiscence
as an alternative.
On this part we realized the way to profile a Python program which makes use of CUDA, and the way to visualize fundamental info of this program within the Nsight Techniques GUI. We additionally seen that on this easy program, we’re utilizing pageable as an alternative of pinned reminiscence, that one in every of our kernels will not be occupying all warps, that the GPU is idle for fairly a while between kernels being run and that we aren’t utilizing tensor cores.
On this part we are going to learn to enhance our profiling expertise by annotation sections in Nsight Techniques with NVTX. NVTX permits us to mark completely different areas of the code. It could actually mark ranges and instantaneous occasions. For a deeper look, verify the docs. Beneath we create run_v2.py
, which, along with annotating run_v1.py
, additionally modifications this line:
a = np.ones(dimension, dtype=np.float32)
to those:
a = cuda.pinned_array(dimension, dtype=np.float32)
a[...] = 1.0
Due to this fact, along with the annotations, we at the moment are utilizing a pinned reminiscence. If you wish to study extra in regards to the several types of recollections that CUDA helps, see the CUDA C++ Programming Guide. It’s of relevance that this isn’t the one technique to pin an array in Numba. A beforehand created Numpy array can be created with a context, as defined within the Numba documentation.
#%%writefile run_v2.py
import argparse
import warningsimport numpy as np
import nvtx
from numba import cuda
from numba.core.errors import NumbaPerformanceWarning
from kernels import (
BLOCKS_PER_GRID,
THREADS_PER_BLOCK,
divide_by,
partial_reduce,
single_thread_sum,
)
# Ignore NumbaPerformanceWarning
warnings.simplefilter("ignore", class=NumbaPerformanceWarning)
def run(dimension):
with nvtx.annotate("Compilation", shade="crimson"):
dev_a = cuda.device_array((BLOCKS_PER_GRID,), dtype=np.float32)
dev_a_reduce = cuda.device_array((BLOCKS_PER_GRID,), dtype=dev_a.dtype)
dev_a_sum = cuda.device_array((1,), dtype=dev_a.dtype)
partial_reduce[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_reduce)
single_thread_sum[1, 1](dev_a_reduce, dev_a_sum)
divide_by[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_sum)
# Outline host array
a = cuda.pinned_array(dimension, dtype=np.float32)
a[...] = 1.0
print(f"Previous sum: {a.sum():.3f}")
# Array copy to system and array creation on the system.
with nvtx.annotate("H2D Reminiscence", shade="yellow"):
dev_a = cuda.to_device(a)
dev_a_reduce = cuda.device_array((BLOCKS_PER_GRID,), dtype=dev_a.dtype)
dev_a_sum = cuda.device_array((1,), dtype=dev_a.dtype)
# Launching kernels to normalize array
with nvtx.annotate("Kernels", shade="inexperienced"):
partial_reduce[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_reduce)
single_thread_sum[1, 1](dev_a_reduce, dev_a_sum)
divide_by[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_sum)
# Array copy to host
with nvtx.annotate("D2H Reminiscence", shade="orange"):
dev_a.copy_to_host(a)
cuda.synchronize()
print(f"New sum: {a.sum():.3f}")
def essential():
parser = argparse.ArgumentParser(description="Easy Instance v2")
parser.add_argument(
"-n",
"--array-size",
kind=int,
default=100_000_000,
metavar="N",
assist="Array dimension",
)
args = parser.parse_args()
run(dimension=args.array_size)
if __name__ == "__main__":
essential()
Evaluating the 2 recordsdata, you’ll be able to see it’s so simple as wrapping some GPU kernel calls with
with nvtx.annotate("Area Title", shade="crimson"):
...
Professional tip: it’s also possible to annotate features by inserting the @nvtx.annotate
decorator above their definition, routinely annotate every little thing by calling your script with python -m nvtx run_v2.py
, or apply the autoannotator selectively in you code by enabling or disabling nvtx.Profile()
. See the docs!
Let’s run this new script and open the leads to Nsight Techniques.
$ nsys profile
--trace cuda,osrt,nvtx
--gpu-metrics-device=all
--cuda-memory-usage true
--force-overwrite true
--output profile_run_v2
python run_v2.py
GPU 0: Basic Metrics for NVIDIA TU10x (any frequency)
Previous sum: 100000000.000
New sum: 1.000
Producing '/tmp/nsys-report-69ab.qdstrm'
[1/1] [========================100%] profile_run_v2.nsys-rep
Generated:
/content material/profile_run_v2.nsys-rep
Once more, we begin by minimizing every little thing, leaving solely Processes > python > CUDA HW
open. See Determine 4. Discover that we now have a brand new line, NVTX
. On this line within the timeline window we must always see completely different coloured blocks similar to the annotation areas that we created within the code. These are Compilation
, H2D Reminiscence
, Kernels
and D2H Reminiscence
. A few of these my be too small to learn, however will probably be legible should you zoom into the area.
The profiler confirms that this reminiscence is pinned, guaranteeing that our code is really utilizing pinned reminiscence. As well as, H2D Reminiscence
and D2H Reminiscence
at the moment are taking lower than half of the time that they had been taking earlier than. Typically we are able to count on higher efficiency utilizing pinned reminiscence or prefetched mapped arrays (not supported by Numba).
Now we are going to examine whether or not we are able to enhance this code by introducing streams. The concept is that whereas reminiscence transfers are occurring, the GPU can begin processing the information. This permits a degree of concurrency, which hopefully will make sure that we’re occupying our warps as totally as potential.
Within the code under we are going to cut up the processing of our array into roughly equal elements. Every half will run in a separate stream, together with transferring information and computing the sum of the array. Then, we synchronize all streams and sum their partial sums. At this level we are able to then launch normalization kernels for every stream independently.
We wish to reply just a few questions:
- Will the code under actually create concurrency? May we be introducing a bug?
- Is it sooner than the code which makes use of a single stream?
- Is the warp occupancy higher?
#%%writefile run_v3_bug.py
import argparse
import warnings
from math import ceilimport numpy as np
import nvtx
from numba import cuda
from numba.core.errors import NumbaPerformanceWarning
from kernels import (
BLOCKS_PER_GRID,
THREADS_PER_BLOCK,
divide_by,
partial_reduce,
single_thread_sum,
)
# Ignore NumbaPerformanceWarning
warnings.simplefilter("ignore", class=NumbaPerformanceWarning)
def run(dimension, nstreams):
with nvtx.annotate("Compilation", shade="crimson"):
dev_a = cuda.device_array((BLOCKS_PER_GRID,), dtype=np.float32)
dev_a_reduce = cuda.device_array((BLOCKS_PER_GRID,), dtype=dev_a.dtype)
dev_a_sum = cuda.device_array((1,), dtype=dev_a.dtype)
partial_reduce[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_reduce)
single_thread_sum[1, 1](dev_a_reduce, dev_a_sum)
divide_by[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_sum)
# Outline host array
a = cuda.pinned_array(dimension, dtype=np.float32)
a[...] = 1.0
# Outline areas for streams
step = ceil(dimension / nstreams)
begins = [i * step for i in range(nstreams)]
ends = [min(s + step, size) for s in starts]
print(f"Previous sum: {a.sum():.3f}")
# Create streams
streams = [cuda.stream()] * nstreams
cpu_sums = [cuda.pinned_array(1, dtype=np.float32) for _ in range(nstreams)]
devs_a = []
with cuda.defer_cleanup():
for i, (stream, begin, finish) in enumerate(zip(streams, begins, ends)):
cpu_sums[i][...] = np.nan
# Array copy to system and array creation on the system.
with nvtx.annotate(f"H2D Reminiscence Stream {i}", shade="yellow"):
dev_a = cuda.to_device(a[start:end], stream=stream)
dev_a_reduce = cuda.device_array(
(BLOCKS_PER_GRID,), dtype=dev_a.dtype, stream=stream
)
dev_a_sum = cuda.device_array((1,), dtype=dev_a.dtype, stream=stream)
devs_a.append(dev_a)
# Launching kernels to sum array
with nvtx.annotate(f"Sum Kernels Stream {i}", shade="inexperienced"):
for _ in vary(50): # Make it spend extra time in compute
partial_reduce[BLOCKS_PER_GRID, THREADS_PER_BLOCK, stream](
dev_a, dev_a_reduce
)
single_thread_sum[1, 1, stream](dev_a_reduce, dev_a_sum)
with nvtx.annotate(f"D2H Reminiscence Stream {i}", shade="orange"):
dev_a_sum.copy_to_host(cpu_sums[i], stream=stream)
# Guarantee all streams are caught up
cuda.synchronize()
# Combination all 1D arrays right into a single 1D array
a_sum_all = sum(cpu_sums)
# Ship it to the GPU
with cuda.pinned(a_sum_all):
with nvtx.annotate("D2H Reminiscence Default Stream", shade="orange"):
dev_a_sum_all = cuda.to_device(a_sum_all)
# Normalize through streams
for i, (stream, begin, finish, dev_a) in enumerate(
zip(streams, begins, ends, devs_a)
):
with nvtx.annotate(f"Divide Kernel Stream {i}", shade="inexperienced"):
divide_by[BLOCKS_PER_GRID, THREADS_PER_BLOCK, stream](
dev_a, dev_a_sum_all
)
# Array copy to host
with nvtx.annotate(f"D2H Reminiscence Stream {i}", shade="orange"):
dev_a.copy_to_host(a[start:end], stream=stream)
cuda.synchronize()
print(f"New sum: {a.sum():.3f}")
def essential():
parser = argparse.ArgumentParser(description="Easy Instance v3")
parser.add_argument(
"-n",
"--array-size",
kind=int,
default=100_000_000,
metavar="N",
assist="Array dimension",
)
parser.add_argument(
"-s",
"--streams",
kind=int,
default=4,
metavar="N",
assist="Array dimension",
)
args = parser.parse_args()
run(dimension=args.array_size, nstreams=args.streams)
if __name__ == "__main__":
essential()
Let’s run the code and gather outcomes.
$ nsys profile
--trace cuda,osrt,nvtx
--gpu-metrics-device=all
--cuda-memory-usage true
--force-overwrite true
--output profile_run_v3_bug_4streams
python run_v3_bug.py -s 4
GPU 0: Basic Metrics for NVIDIA TU10x (any frequency)
Previous sum: 100000000.000
New sum: 1.000
Producing '/tmp/nsys-report-a666.qdstrm'
[1/1] [========================100%] profile_run_v3_bug_4streams.nsys-rep
Generated:
/content material/profile_run_v3_bug_4streams.nsys-rep
This system ran and yielded the right reply. However once we open the profiling file (see Determine 6), we discover that there are two streams as an alternative of 4! And one is mainly fully idle! What’s occurring right here?
There’s a bug within the creation of the streams. By doing
streams = [cuda.stream()] * nstreams
we are literally making a single stream and repeating it nstreams
occasions. So why are we seeing two streams as an alternative of 1? The truth that one will not be doing a lot computation must be an indicator that there’s a stream that we aren’t utilizing. This stream is the default stream, which we aren’t utilizing in any respect in out code since all GPU interactions are given a stream, the stream we created.
We will repair this bug with:
streams = [cuda.stream() for _ in range(nstreams)]
# Guarantee they're all completely different
assert all(s1.deal with != s2.deal with for s1, s2 in zip(streams[:-1], streams[1:]))
The code above may even guarantee they’re actually completely different streams, so it will have caught the bug had we had it within the code. It does so by checking the stream pointer worth.
Now we are able to run the mounted code with 1 stream and eight streams for comparability. See Figures 7 and eight, respectively.
$ nsys profile
--trace cuda,osrt,nvtx
--gpu-metrics-device=all
--cuda-memory-usage true
--force-overwrite true
--output profile_run_v3_1stream
python run_v3.py -s 1
GPU 0: Basic Metrics for NVIDIA TU10x (any frequency)
Previous sum: 100000000.000
New sum: 1.000
Producing '/tmp/nsys-report-de65.qdstrm'
[1/1] [========================100%] profile_run_v3_1stream.nsys-rep
Generated:
/content material/profile_run_v3_1stream.nsys-rep
$ nsys profile
--trace cuda,osrt,nvtx
--gpu-metrics-device=all
--cuda-memory-usage true
--force-overwrite true
--output profile_run_v3_8streams
python run_v3.py -s 8
GPU 0: Basic Metrics for NVIDIA TU10x (any frequency)
Previous sum: 100000000.000
New sum: 1.000
Producing '/tmp/nsys-report-1fb7.qdstrm'
[1/1] [========================100%] profile_run_v3_8streams.nsys-rep
Generated:
/content material/profile_run_v3_8streams.nsys-rep
Once more, each give appropriate outcomes. By opening the one with 8 streams we see that sure, the bug has been mounted (Determine 7). Certainly, we now see 9 streams (8 created + default). As well as, we see that they’re working on the similar time! So now we have achieved concurrency!
Sadly, if we dig a bit deeper we discover that the concurrent code will not be essentially sooner. On my machine the crucial part of each variations, from begin of reminiscence switch to the final GPU-CPU copy takes round 160 ms.
A possible perpetrator is the warp occupancy. We discover that the warp occupancy is considerably higher within the single-stream model. The beneficial properties we’re getting on this instance in compute are seemingly being misplaced by not occupying our GPU as effectively. That is seemingly associated to the construction of the code which (artificially) calls approach too many kernels. As well as, if all threads are stuffed by a single stream, there isn’t a acquire in concurrency, since different streams should be idle till assets unlock.
This instance is vital as a result of it reveals that our preconceived notions of efficiency are simply hypotheses. They should be verified.
At this level of APOD, now we have assessed, parallelized (each by means of threads and concurrency) and so the subsequent step can be to deploy. We additionally seen a slight efficiency regression with concurrency, so for this instance, a single-stream model would seemingly be the one deployed. In manufacturing, the subsequent step can be to observe the subsequent piece of code which is greatest fitted to parallelization and restarting APOD.
On this article we noticed the way to arrange, use and interpret outcomes from profiling Python code in NVIDIA Nsight Techniques. C and C++ code will be analyzed very equally, and certainly many of the materials on the market makes use of C and C++ examples.
We additionally present how profiling can permit us to catch bugs and efficiency take a look at our applications, guaranteeing that the options we introduce actually are enhancing efficiency, and if they don’t seem to be, why.