cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions that share compatible APIs with other RAPIDS projects.
cuML enables data scientists, researchers, and software engineers to run traditional tabular ML tasks on GPUs without going into the details of CUDA programming. In most cases, cuML's Python API matches the API from scikit-learn.
For large datasets, these GPU-based implementations can complete 10-50x faster than their CPU equivalents. For details on performance, see the cuML Benchmarks Notebook.
As an example, the following Python snippet loads input and computes DBSCAN clusters, all on GPU, using cuDF:
import cudf
from cuml.cluster import DBSCAN
# Create and populate a GPU DataFrame
gdf_float = cudf.DataFrame()
gdf_float['0'] = [1.0, 2.0, 5.0]
gdf_float['1'] = [4.0, 2.0, 1.0]
gdf_float['2'] = [4.0, 2.0, 1.0]
# Setup and fit clusters
dbscan_float = DBSCAN(eps=1.0, min_samples=1)
dbscan_float.fit(gdf_float)
print(dbscan_float.labels_)
Output:
0 0
1 1
2 2
dtype: int32
cuML also features multi-GPU and multi-node-multi-GPU operation, using Dask, for a growing list of algorithms. The following Python snippet reads input from a CSV file and performs a NearestNeighbors query across a cluster of Dask workers, using multiple GPUs on a single node:
Initialize a LocalCUDACluster
configured with UCX for fast transport of CUDA arrays
# Initialize UCX for high-speed transport of CUDA arrays
from dask_cuda import LocalCUDACluster
# Create a Dask single-node CUDA cluster w/ one worker per device
cluster = LocalCUDACluster(protocol="ucx",
enable_tcp_over_ucx=True,
enable_nvlink=True,
enable_infiniband=False)
Load data and perform k-Nearest Neighbors
search. cuml.dask
estimators also support Dask.Array
as input:
from dask.distributed import Client
client = Client(cluster)
# Read CSV file in parallel across workers
import dask_cudf
df = dask_cudf.read_csv("/path/to/csv")
# Fit a NearestNeighbors model and query it
from cuml.dask.neighbors import NearestNeighbors
nn = NearestNeighbors(n_neighbors = 10, client=client)
nn.fit(df)
neighbors = nn.kneighbors(df)
For additional examples, browse our complete API documentation, or check out our example walkthrough notebooks. Finally, you can find complete end-to-end examples in the notebooks-contrib repo.
Category | Algorithm | Notes |
---|---|---|
Clustering | Density-Based Spatial Clustering of Applications with Noise (DBSCAN) | |
K-Means | Multi-node multi-GPU via Dask | |
Dimensionality Reduction | Principal Components Analysis (PCA) | Multi-node multi-GPU via Dask |
Incremental PCA | Experimental | |
Truncated Singular Value Decomposition (tSVD) | Multi-node multi-GPU via Dask | |
Uniform Manifold Approximation and Projection (UMAP) | Multi-node multi-GPU Inference via Dask | |
Random Projection | ||
t-Distributed Stochastic Neighbor Embedding (TSNE) | ||
Linear Models for Regression or Classification | Linear Regression (OLS) | Multi-node multi-GPU via Dask |
Linear Regression with Lasso or Ridge Regularization | Multi-node multi-GPU via Dask | |
ElasticNet Regression | ||
Logistic Regression | Multi-node multi-GPU via Dask-GLM demo | |
Naive Bayes | Multi-node multi-GPU via Dask | |
Stochastic Gradient Descent (SGD), Coordinate Descent (CD), and Quasi-Newton (QN) (including L-BFGS and OWL-QN) solvers for linear models | ||
Nonlinear Models for Regression or Classification | Random Forest (RF) Classification | Experimental multi-node multi-GPU via Dask |
Random Forest (RF) Regression | Experimental multi-node multi-GPU via Dask | |
Inference for decision tree-based models | Forest Inference Library (FIL) | |
K-Nearest Neighbors (KNN) Classification | Multi-node multi-GPU via Dask+UCX, uses Faiss for Nearest Neighbors Query. | |
K-Nearest Neighbors (KNN) Regression | Multi-node multi-GPU via Dask+UCX, uses Faiss for Nearest Neighbors Query. | |
Support Vector Machine Classifier (SVC) | ||
Epsilon-Support Vector Regression (SVR) | ||
Time Series | Holt-Winters Exponential Smoothing | |
Auto-regressive Integrated Moving Average (ARIMA) | Supports seasonality (SARIMA) | |
Model Explanation | SHAP Kernel Explainer | Based on SHAP (experimental) |
SHAP Permutation Explainer | Based on SHAP (experimental) | |
Other | K-Nearest Neighbors (KNN) Search | Multi-node multi-GPU via Dask+UCX, uses Faiss for Nearest Neighbors Query. |