Motivation
River
is a Python
library for online machine learning. One of the main features of River is its support for data streams, which are continuous streams of data that are produced and consumed in real-time. This makes it well-suited for handling large volumes of data, as well as for handling data that is generated continuously.
I started contributing to the project in February 2019, mainly in the stats
and time_series
modules.
Some users have reported that they require high performance for online statistics tasks and have experienced performance issues with the library, such as user fox-ds
for computing rolling quantiles.
In our last IRL river
devs meeting at Ile d’Yeu (June 2022), I suggested exploring the possibility of mitigating some of the bottlenecks in River
in a compiled language like Rust
.
I chose Rust
for this task due to its known benefits like performance, and boring to maintain, as well as its thriving ecosystem and excellent tooling.
Taking this idea forward, I developed the watermill
crate (Rust
library), which contains the stats
module of River
implemented in Rust
. Testing the Rust
bindings in the River
stats
module showed promising results, with a significant performance boost
Statistics | Pure Python (s) | Rust binding (s) | x times improvement |
---|---|---|---|
Quantile | 2.359 | 0.148 | 15.955 |
Peak To Peak | 0.216 | 0.47 | 4.609 |
EWMean | 0.158 | 0.105 | 1.512 |
EWVar | 0.426 | 0.104 | 4.075 |
IQR | 4.541 | 0.169 | 26.846 |
Kurtosis | 1.785 | 0.106 | 16.872 |
Skewness | 1.086 | 0.105 | 10.354 |
Rolling Quantile | 323.520 | 77.247 | 4.573 |
Rolling IQR | 636.528 | 76.688 | 9.113 |
The benchmark is the total time to update 1 million of data. For rolling statistics, the window is 1 million and there are 2 million updates.
The significant performance improvement we saw in the proof of concept for the Rust
bindings motivated us to integrate Rust
into the stats
module. In the following sections, we will dive into the technical details of how we called a Rust
struct from Python
and bind it into the stats
module, as well as how we built Python
wheels with the Rust bindings using Github Action CI.
Calling a Rust Struct from Python
To call a Rust
struct from Python
, we can use the PyO3
library, which enables the development of Python
extensions in Rust
.
PyO3
is a library for developing Python
extensions in Rust
. It provides Rust
bindings for the Python
C
API and allows you to write Rust
code that can be called from Python
.
To use PyO3
, we need to add it to our Cargo.toml
file as a dependency, along with other dependencies such as watermill
, bincode
, and serde
(we will cover that further). Our Cargo.toml
file should look like this:
[package]
name = "river"
version = "0.1.0"
authors = ["Adil Zouitine <adilzouitinegm@gmail.com>"]
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[lib]
name = "river"
path = "rust_src/lib.rs"
crate-type = ["cdylib"]
[dependencies]
pyo3 = { version = "0.16.5", features = ["extension-module"] }
watermill = "0.1.0"
bincode = "1.3.3"
serde = { version = "1.0", features = ["derive"] }
Once you have added PyO3
as a dependency and annotated your Rust functions with the appropriate attributes (we will cover that further), you can use the PyO3
library to create a Python
extension module that can be imported into a Python
script.
In a typical Rust
project, the main source code directory is called src
. However, in the Cargo.toml
file shown above, the main source code directory has been renamed to rust_src
. This is because the src
directory confuse for River
maintainers. By renaming the Rust
source directory to rust_src
and specifying this path in the Cargo.toml
file, we can keep the Rust
and Python
source code separate and avoid any confusion. The lib.rs
file, which contains the main code for the Rust
crate, is located in the rust_src
directory. By specifying the path element in the Cargo.toml
file, the cargo
build tool is able to locate and compile the crate’s source code, even if it is located in a directory other than the default src
directory.
Now that we’ve set up our Rust
project and configured our dependencies, let’s dive into the code. We can create our Rust
struct and annotate it with #[pyclass]
to make it accessible from Python
. For example, if we have a struct called RsEWMean
, we can annotate it this way:
use pyo3::prelude::*;
use watermill::{ewmean::EWMean, ...};
#[pyclass(module = "river.stats._rust_stats")]
pub struct RsEWMean {
ewmean: EWMean<f64>
}
The full code for the EWMean
struct is available here.
The #[pyclass(module = "river.stats._rust_stats")]
annotation on the Rust
struct indicates that it will be made available in the river.stats._rust_stats
module, which corresponds to the _rust_stats.so
file. This enables the Rust
struct to be used from Python
as a module, allowing it to be easily called from Python
. The Python
interpreter will load the .so
file and use it to execute the Rust
code.
To make the methods of our struct available from Python
, we can annotate them with #[pymethods]
. For example, if our RsEWMean
struct has methods called new
, update
, and get
, we can annotate them this way:
use pyo3::prelude::*;
#[pymethods]
impl RsEWMean {
#[new]
pub fn new(alpha: f64) -> RsEWMean {
RsEWMean {
ewmean: EWMean::new(alpha),
alpha,
}
}
pub fn update(&mut self, x: f64) {
self.ewmean.update(x);
}
pub fn get(&self) -> f64 {
self.ewmean.get()
}
}
To save and load objects in the river
library, we use Python
’s pickle
module, allowing them to be easily stored and transferred between programs.
To use pickle
with our Rust
struct, we need to ensure that the struct is properly serializable and deserializable. To do this, we can use the serde
crate and annotate the struct with #[derive(Serialize, Deserialize)]
. This allows the struct to be serialized and deserialized using the serialize
and deserialize
functions provided by serde
.
use serde::{Deserialize, Serialize};
use watermill::{ewmean::EWMean, ...};
#[derive(Serialize, Deserialize)]
#[pyclass(module = "river.stats._rust_stats")]
pub struct RsEWMean {
ewmean: EWMean<f64>,
alpha: f64,
}
In addition to the #[derive(Serialize, Deserialize)]
annotation, we also need to add the following three methods to the struct’s implementation to enable proper serialization and deserialization from Python
:
use pyo3::prelude::*;
use pyo3::types::PyBytes;
use bincode::{deserialize, serialize};
#[pymethods]
impl RsEWMean {
// Other methods...
pub fn __setstate__(&mut self, state: &PyBytes) -> PyResult<()> {
// Deserialize the data contained in the PyBytes object
// and update the struct with the deserialized values.
*self = deserialize(state.as_bytes()).unwrap();
Ok(())
}
pub fn __getstate__<'py>(&self, py: Python<'py>) -> PyResult<&'py PyBytes> {
// Serialize the struct and return a PyBytes object
// containing the serialized data.
Ok(PyBytes::new(py, &serialize(&self).unwrap()))
}
pub fn __getnewargs__(&self) -> PyResult<(f64,)> {
// Return the arguments needed to create a new instance
// of the struct.
Ok((self.alpha,))
}
}
The __setstate__
method is called when our struct is deserialized and receives a PyBytes
object containing the serialized data. We use the deserialize
function from the bincode
crate to deserialize the data and update our struct with the deserialized values. The __getstate__
method is called when our struct is serialized and receives a Python
object. We use the serialize
function from the bincode
crate to serialize our struct and return a PyBytes
object containing the serialized data. The __getnewargs__
method is called when our struct is constructed from its deserialized state and returns the arguments needed to create a new instance of our struct. In this case, it returns the alpha
parameter.
These methods are necessary to avoid the following error when calling the new
method of our struct from Python
:
RsEWMean.__new__() missing 1 required positional argument: 'alpha'
This error is due to a known issue in PyO3
. By implementing these methods, we can avoid the error and properly deserialize and construct our struct from Python
.
The Rust
function defines a Python
module in the rust_src/lib.rs
file. This file is the main source code file for the crate and is located in the rust_src
directory.
To define a Python
module in Rust
, the lib.rs
file includes a function annotated with the #[pymodule]
attribute. This function takes two arguments: a Python
object and a reference to a PyModule
object. The Python
object represents the Python
interpreter and allows Rust
to access Python
objects and functions. The PyModule
object represents the Python
module and allows Rust
to add classes and functions to the module, making them accessible from Python
.
For example, to define a Python
module called _rust_stats
in Rust
, you can use the following code in the lib.rs
file:
use pyo3::prelude::*;
#[pymodule]
fn _rust_stats(_py: Python, m: &PyModule) -> PyResult<()> {
m.add_class::<RsEWMean>()?;
Ok(())
}
This code defines a Rust
function called _rust_stats
that is annotated with the #[pymodule]
attribute. The _rust_stats
function takes two arguments: a Python
object and a reference to a PyModule
object. The function returns a PyResult
object, which represents the result of a Rust-to-Python call. If the call is successful, the Ok
variant of the PyResult
is returned. If an error occurs, the Err
variant is returned.
To add a Rust
struct called RsEWMean
to the _rust_stats
module, call the add_class
method on the m
object, which is a reference to the _rust_stats
Python
module. The add_class
method takes a generic type parameter that specifies the Rust
class to be added to the module. In this case, the Rust
class is RsEWMean
. The ::<>
syntax is used to specify the type parameter. The add_class
method returns a PyResult
object, which represents the result of the Rust-to-Python call. If the call is successful, the Ok
variant of the PyResult
is returned. If an error occurs, the Err
variant is returned.
Overall, the _rust_stats
function defines a Python
module called _rust_stats
that can be imported and used from Python
. The module contains a Rust class called RsEWMean
, which can be accessed and used like any other Python
class.
The full implementation of lib.rs
is here.
Figure 1: Visual representation of the Rust
binding
Integration of the Rust binding in River
To integrate the Rust
binding into River
, we need to create a few files and modify some existing files.
First, we need to modify a MANIFEST.in
file which tells the setup.py
file what files to include in the distribution package.
The MANIFEST.in
file should look like this:
global-include *.pyx
global-include *.pxd
include river/datasets/*.csv
include river/datasets/*.gz
include river/datasets/*.zip
include river/stream/*.zip
include Cargo.toml
recursive-include rust_src *
To integrate the Rust
binding into River
, we need to create a few files and modify some existing files.
The recursive-include rust_src *
line in the MANIFEST.in
file tells the setup.py
file to include all files in the rust_src
directory and its subdirectories in the distribution package. This allows the Rust
binding code, which is located in the rust_src
directory, to be included in the distribution package and made available to users of the river
library.
Next, we need to create a stub file called stats/_rust_stats.pyi
which defines the types of the Python
objects that will be created from the Rust
structs.
The stub file can provide documentation and other information about the RsEWMean
class’s methods, which can be helpful for developers who are using the class. This point is optional.
The stats/_rust_stats.pyi
file should look like this:
class RsEWMean:
def __init__(self, alpha: float): ...
def update(self, x: float): ...
def get(self) -> float: ...
Finally, we need to create a wrapper for the Rust
binding in the stats
module. This wrapper is necessary because the stats in River are instances of stats.base.Univariate
, but the Rust
binding does not inherit from this class. To create a wrapper, we can define a new class called EWMean
which extends stats.base.Univariate
and contains an instance of RsEWMean
from the Rust
binding. The stats
module should contain the following code:
from river import stats
from river.stats import _rust_stats # <- Pay attention here
class EWMean(stats.base.Univariate):
"""Exponentially weighted mean.
Parameters
----------
alpha
The closer `alpha` is to 1 the more the statistic will adapt to recent values.
Attributes
----------
mean : float
The running exponentially weighted mean.
"""
def __init__(self, alpha=0.5):
if not 0 <= alpha <= 1:
raise ValueError("q is not comprised between 0 and 1")
self.alpha = alpha
self._ewmean = _rust_stats.RsEWMean(alpha) # <- Pay attention here
self.mean = 0
@property
def name(self):
return f"ewm_{self.alpha}"
def update(self, x):
self._ewmean.update(x)
return self
def get(self):
return self._ewmean.get()
To integrate the Rust
binding into River
, we need to modify the setup.py
file to include the Rust
extension in the package.
We can ensure that the Rust
binding is built and made available to users of the river
library when they install the package.
The setup.py
file should look like this:
import platform
...
import setuptools
from setuptools_rust import Binding, RustExtension
...
setuptools.setup(
...
ext_modules=cythonize(
module_list=[
setuptools.Extension(
"*",
sources=["**/*.pyx"],
include_dirs=[get_include()],
libraries=[] if platform.system() == "Windows" else ["m"],
define_macros=[("NPY_NO_DEPRECATED_API", "NPY_1_7_API_VERSION")],
)
],
compiler_directives={
"language_level": 3,
"binding": True,
"embedsignature": True,
},
),
rust_extensions=[RustExtension("river.stats._rust_stats", binding=Binding.PyO3)],
# rust extensions are not zip safe, just like C-extensions.
zip_safe=False,
)
The full code is here.
The setup.py
file is used to build and distribute the Python
package. The setuptools
and setuptools_rust
libraries are used to build and distribute the package. The setuptools.setup
function defines the package’s configuration, including the package’s name, version, dependencies, and other information.
The ext_modules
parameter is used to specify the Cython
extensions that will be built. Cython
is a language that allows us to write Python-like code that can be compiled to C
. In the setup.py
above, we use cythonize
to compile the Cython
extensions from all .pyx
files in the project.
The rust_extensions
parameter is used to specify the Rust
extensions that will be built. The RustExtension
class is used to define a Rust
extension, and the binding parameter is used to specify how the Rust
code will be called from Python
. In the setup.py
above, we use Binding.PyO3
which specifies that we will use the PyO3
library to call the Rust
code from Python
.
The command : python setup.py build_rust --inplace --release
is used to build a Rust
project that is being used from Python
. The setup.py
file is a Python
script that specifies the build instructions for the project, and the build_rust
argument tells the script to build the Rust
portion of the project. The --inplace
flag specifies that the Rust
project should be built in place, meaning that the built files will be placed in the same directory as the River
source code. The --release
flag tells the build tool to build the project in release mode, which enables optimizations and disables debugging information in the resulting binary.
Building Python wheels with Rust binding
One reason to build a Python
wheel is to make it easier for users to install the package. Instead of having to compile the package from source code, users can simply install the pre-built wheel file. This can be especially useful when the package includes compiled code or extensions, which can be difficult for users to build on their own.
Another reason to build a Python
wheel is to make it easier to distribute the package. Wheels are a standardized format, so users can be confident that the package will work on their system if they have a compatible version of Python
installed. This can be especially useful for packages that are intended to be used on a wide range of systems, as it allows the package maintainers to build and distribute a single package that many users can use.
Building Python
wheels that include compiled code can be a painful process for developers. It requires managing multiple build environments and ensuring that the compiled code is compatible with all the different systems the wheel may be installed on. This can be particularly challenging when building wheels for binary extensions that need to be compiled against different versions of the C
library or other system libraries.
To alleviate this pain, Pypa’s cibuildwheels
tool provides a continuous integration (CI) service that builds wheels for multiple platforms and Python
versions. By specifying your package’s dependencies and build requirements in a configuration file, developers can easily create wheels compatible with a wide range of systems without the burden of building and testing on multiple platforms themselves.
In short, cibuildwheels
is a game changer for developers as it significantly simplifies the often painful process of building Python
wheels with compiled code, saving time and resources.
Here is an example of a .yml
file that can be used with GitHub Actions to build Python
wheels with Rust
binding using cibuildwheel
.
jobs:
build_wheels:
name: Build wheels on ${{ matrix.os }}
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]
arch: [main, alt]
include:
# include the main arch for each os and the alt arch for each os
# ...
steps:
- uses: actions/checkout@v2
- name: set up rust
# use the rust toolchain
# add the target for the alt arch
# ...
- name: Build wheels
uses: pypa/cibuildwheel@v2.3.0
env:
CIBW_BUILD: "cp38-* cp39-* cp310-* cp311-*"
CIBW_BEFORE_BUILD: >
pip install setuptools-rust cython &&
rustup default nightly &&
rustup show
CIBW_SKIP: "*-musllinux_i686"
CIBW_ARCHS: ${{ matrix.alt_arch_name || 'auto' }}
CIBW_MANYLINUX_X86_64_IMAGE: "manylinux2014"
CIBW_MUSLLINUX_X86_64_IMAGE: "musllinux_1_1"
CIBW_MANYLINUX_AARCH64_IMAGE: "manylinux2014"
CIBW_MUSLLINUX_AARCH64_IMAGE: "musllinux_1_1"
CIBW_ENVIRONMENT: 'PATH="$HOME/.cargo/bin:$PATH"'
CIBW_ENVIRONMENT_LINUX: 'PATH="$HOME/.cargo/bin:$PATH" CARGO_NET_GIT_FETCH_WITH_CLI="true"'
CIBW_MANYLINUX_I686_IMAGE: "manylinux2014"
CIBW_ENVIRONMENT_WINDOWS: 'PATH="$UserProfile\.cargo\bin;$PATH"'
CIBW_BEFORE_BUILD_LINUX: >
pip install cython numpy setuptools wheel setuptools-rust &&
curl https://sh.rustup.rs -sSf | sh -s -- --default-toolchain=nightly --profile=minimal -y &&
rustup show
The full code is here.
Two notable environment variables in the configuration file are CIBW_SKIP
and CIBW_ENVIRONMENT_LINUX
.
The CIBW_SKIP
environment variable specifies that the *-musllinux_i686
combination of operating system and architecture should be skipped during the build process. This is because rust
is not available for the musllinux
operating system on the i686
architecture.
The CIBW_ENVIRONMENT_LINUX
environment variable sets the PATH
and CARGO_NET_GIT_FETCH_WITH_CLI
environment variables for Linux environments. The PATH
variable is set to include the $HOME/.cargo/bin directory
, as in the CIBW_ENVIRONMENT
variable. The CARGO_NET_GIT_FETCH_WITH_CLI
variable is set to true to fix the error:
cargo rustc --lib --message-format=json-render-diagnostics --manifest-path Cargo.toml --release -v --features pyo3/extension-module -- --crate-type cdylibfailed with code -9
which can occur when building Rust
extensions on Linux. This solution was found in the following GitHub issue.
ET VOILĂ€ ! It work’s !
Wrap-up
This blog post discussed how we extended the River
stats
module with Rust
using PyO3
. We motivated the need for better performance in the River
stats
module. We explained how we addressed this issue by creating the watermill crate, which contains the stats
module of River
implemented in Rust
.
We are excited to use Rust
in the River
codebase for several reasons. Rust
provides excellent performances, as demonstrated by the significant improvements we saw in the stats
module. This has allowed us to progress faster and achieve our goals more efficiently.
If you have any questions, please do not hesitate to contact me by email or on Twitter.
If you want discuss about river
, you can join the Discord channel.