Overview
GPU performance can vary over time with factors like utilization and temperature. To ensure your application runs smoothly, it’s important to monitor these metrics and take action if they exceed certain thresholds. This can be accomplished easily using python,nvidia-smi
and the psutil
library. We recommend instrumenting this as a background
process, so that it runs independently of your application. This way, you can monitor the performance of your
application without affecting its performance.
Getting GPU Stats
To get GPU stats, you can use thenvidia-smi
command which is available by default in all gpu instances. This command
provides a wealth of information about the GPU, including utilization, memory usage, temperature, and more. You can see
the complete list of options by running nvidia-smi --help-query-gpu
.
To get this information in python, we can use the subprocess
library to run the command and capture its output. Here’s
an example of how to do this:
Getting System Stats
Other system information, such as CPU and Memory utilization, can be obtained using thepsutil
library. This library
is a cross-platform library for retrieving information on running processes and system utilization. It is not included
by default in the base python installation, so you will need to install it separately. You can do this using pip:
psutil
installed, you can use it to get system stats like CPU and Memory utilization. Here’s an example
of how to do this:
Combined example
Reallocating Under-Performing Nodes
Now that we know how to get the stats, we can use this information to reallocate under-performing nodes. This can be done by checking the stats and if they exceed certain thresholds, we can reallocate the node using the IMDS Reallocation Endpoint. You must provide a reason to the reallocation endpoint. We use this data to continuously improve the quality of our network. Here’s an example of how to do this:Putting it All Together
You can put all of this together in a single script that runs continuously and checks the stats at regular intervals, taking any required actions. We recommend taking all configuration values from environment variables, so that you can adjust these values easily without rebuilding your container. monitor.py&
in your Dockerfile CMD
to run the script in the
background
Conclusion
In this tutorial, we learned how to implement performance monitoring in your application usingnvidia-smi
and the
psutil
library. We also learned how to reallocate under-performing nodes using the IMDS Reallocation Endpoint. By
monitoring the performance of your application, you can ensure that it runs smoothly and efficiently, even in the face
of changing conditions. This is an important step in building a robust and reliable application.