High-level Computation¶
Overview¶
In this tutorial, you learn:¶
High level Xarray computations using CuPy arrays.
Applying custom kernels to DataArray with CuPy and NumPy
Prerequisites¶
Concepts |
Importance |
Notes |
---|---|---|
Necessary |
||
Necessary |
||
Necessary |
Time to learn: 40 minutes
Introduction¶
In the previous tutorial, we introduced the powerful combination of Xarray and CuPy for handling multi-dimensional datasets and leveraging GPU acceleration to significantly improve performance.
In this tutorial, we are going to explore high-level Xarray functions such as groupby
, rolling mean
, and weighted mean
, and compared their execution times with traditional NumPy-based implementations.
High-level Xarray Functions: CuPy vs. NumPy¶
In this tutorial, we’ll explore the performance differences between high-level Xarray functions using CuPy and NumPy. CuPy is a GPU-based NumPy-compatible library, while NumPy is the well-known CPU-based library for numerical computations. We’ll focus on three high-level functions: groupby, rolling mean, and weighted mean. We’ll also compare the time it takes to execute each function using both CuPy and NumPy. Let’s create some sample data to work with.
We’ll use a 3-dimensional dataset (time, latitude, longitude) with random values:
import time
import numpy as np
import pandas as pd
import xarray as xr
import cupy as cp
import cupy_xarray # Adds .cupy to Xarray objects
np.random.seed(0)
# Create the time range.
date = pd.date_range("2010-01-01", "2020-12-31", freq="M")
# Create the latitude range.
lat = np.arange(-90, 90, 1)
# Create the longitude range.
lon = np.arange(-180, 180, 1)
# Create random data
data_np = np.random.rand(len(date), len(lat), len(lon))
data_cp = cp.array(data_np)
# -- Create DataArray with Numpy data
data_xr_np = xr.DataArray(
data_np,
dims=["time", "lat", "lon"],
coords=[date, lat, lon],
)
# -- Create DataArray with CuPy data
data_xr_cp = xr.DataArray(
data_cp,
dims=["time", "lat", "lon"],
coords=[date, lat, lon],
)
Groupby¶
The groupby
function is used to group data based on one or more dimensions. Here, we’ll group our data by the season in the time
dimension using both CuPy and NumPy:
start_time_np = time.time()
grouped_data_np = data_xr_np.groupby("time.season")
mean_np = grouped_data_np.mean()
end_time_np = time.time()
time_np = end_time_np - start_time_np
The data type of data in grouped_data_np is numpy.ndarray
.
[type(arr.data) for group, arr in grouped_data_np]
[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]
start_time_cp = time.time()
grouped_data_cp = data_xr_cp.groupby("time.season")
mean_cp = grouped_data_cp.mean()
end_time_cp = time.time()
time_cp = end_time_cp - start_time_cp
print(
"GroupBy with Xarray DataArrays using CuPy provides a",
round(time_np / time_cp, 2),
"x speedup over NumPy.\n",
)
GroupBy with Xarray DataArrays using CuPy provides a 79.83 x speedup over NumPy.
What about the CuPy arrays? Does it preserve the array type?
[type(arr.data) for group, arr in grouped_data_cp]
[cupy.ndarray, cupy.ndarray, cupy.ndarray, cupy.ndarray]
What about different sizes of arrays? How does the performance compare then?¶
The example above showed a 1 degree DataArray. What if we increase the data size to 0.5 degree?
# Create the latitude range.
lat = np.arange(-90, 90, 0.5)
# Create the longitude range.
lon = np.arange(-180, 180, 0.5)
# Create random data
data_np = np.random.rand(len(date), len(lat), len(lon))
data_cp = cp.array(data_np)
# -- Create DataArray with Numpy data
data_xr_np = xr.DataArray(
data_np,
dims=["time", "lat", "lon"],
coords=[date, lat, lon],
)
# -- Create DataArray with CuPy data
data_xr_cp = xr.DataArray(
data_cp,
dims=["time", "lat", "lon"],
coords=[date, lat, lon],
)
start_time_np = time.time()
grouped_data_np = data_xr_np.groupby("time.season").mean()
mean_np = grouped_data_np.mean()
end_time_np = time.time()
time_np = end_time_np - start_time_np
start_time_cp = time.time()
grouped_data_cp = data_xr_cp.groupby("time.season").mean()
mean_cp = grouped_data_cp.mean()
end_time_cp = time.time()
time_cp = end_time_cp - start_time_cp
print(
"GroupBy with Xarray DataArrays using CuPy provides a",
round(time_np / time_cp, 2),
"x speedup over NumPy.\n",
)
GroupBy with Xarray DataArrays using CuPy provides a 89.87 x speedup over NumPy.
Attention
Is this consistent with what you have learned in the previous modules? What if we test a very low resolution dataset?
Rolling Mean¶
The rolling()
method is available in DataArray objects, providing support for rolling window aggregation. This feature allows for the computation of aggregated values over a sliding window of data points within the array.
A rolling window refers to a fixed-size window that moves sequentially across the data, calculating aggregated statistics or applying functions to the values within each window. This capability is particularly useful for analyzing time series or spatial data, where examining data within a specific window can reveal patterns, trends, or relationships.
The rolling mean is a widely used technique for smoothing data over a specified window.
In the example below, we calculate the rolling mean along the 'time'
dimension with a window size of 10:
xr.set_options(use_bottleneck=False)
<xarray.core.options.set_options at 0x2ad1d0ccdac0>
start_time_np = time.time()
rolling_mean_np = data_xr_np.rolling(time=10).mean()
end_time_np = time.time()
time_np = end_time_np - start_time_np
start_time_cp = time.time()
rolling_mean_cp = data_xr_cp.rolling(time=10).mean()
end_time_cp = time.time()
time_cp = end_time_cp - start_time_cp
print(
"Rolling mean with Xarray DataArrays using CuPy provides a",
round(time_np / time_cp, 2),
"x speedup over NumPy.\n",
)
Rolling mean with Xarray DataArrays using CuPy provides a 30.22 x speedup over NumPy.
Weighted Array Reductions¶
Weighted array reductions in Xarray empower users with the ability to perform aggregations on multidimensional arrays while considering the weights assigned to each element. They currently support weighted sum
, mean
, std
, var
and quantile
. By default, aggregation results in Xarray’s rolling window operations are assigned the coordinate at the end of each window. However, it is possible to center the results by specifying center=True
when creating the Rolling object.
For example, the weighted mean is another way to smooth data, taking into account the varying importance of each data point.
Here, we’ll use a uniform weight along the time
dimension:
start_time_np = time.time()
weights_np = xr.DataArray(np.ones_like(data_np), dims=["time", "lat", "lon"])
weighted_mean_np = data_xr_np.weighted(weights_np).mean(dim="time")
end_time_np = time.time()
time_np = end_time_np - start_time_np
start_time_cp = time.time()
weights_cp = xr.DataArray(cp.ones_like(data_cp), dims=["time", "lat", "lon"])
weighted_mean_cp = data_xr_cp.weighted(weights_cp).mean(dim="time")
end_time_cp = time.time()
time_cp = end_time_cp - start_time_cp
print(
"Weighted mean with Xarray DataArrays using CuPy provides a",
round(time_np / time_cp, 2),
"x speedup over NumPy.\n",
)
Weighted mean with Xarray DataArrays using CuPy provides a 13.32 x speedup over NumPy.
Similarly we can calculate weighted sum or weighted quantile, etc. To learn more about weighted array reduction, please see the user guide.
Coarsen large arrays¶
In Xarray, the coarsen
operation is a powerful tool for downsampling or reducing the size of large arrays. When dealing with large datasets, coarsening allows for efficient summarization of data by aggregating multiple values into a single value within a defined coarsening window. This process is particularly useful when working with high-resolution or fine-grained data, as it enables the transformation of large arrays into smaller ones while preserving the overall structure and key characteristics of the data.
In order to take a block mean for every 3 days along time dimension and every 2 points along lat and lon, we can use the following:
start_time_np = time.time()
coarsen_np = data_xr_np.coarsen(time=3, lat=2, lon=2).mean()
end_time_np = time.time()
time_np = end_time_np - start_time_np
coarsen
also works in similar fashion when using CuPy:
start_time_cp = time.time()
coarsen_cp = data_xr_cp.coarsen(time=3, lat=2, lon=2).mean()
end_time_cp = time.time()
time_cp = end_time_cp - start_time_cp
print(
"Coarsen with Xarray DataArrays using CuPy provides a",
round(time_np / time_cp, 2),
"x speedup over NumPy.\n",
)
Coarsen with Xarray DataArrays using CuPy provides a 443.79 x speedup over NumPy.
Summary¶
In this notebook, we have learned about:
High level Xarray computations using CuPy arrays.
Applying custom kernels to DataArray with CuPy and NumPy