CTranslate2 is a fast inference engine for OpenNMT-py and OpenNMT-tf models supporting both CPU and GPU execution. The goal is to provide comprehensive inference features and be the most efficient and cost-effective solution to deploy standard neural machine translation systems such as Transformer models.
The project is production-oriented and comes with backward compatibility guarantees, but it also includes experimental features related to model compression and inference acceleration.
Table of contents
Some of these features are difficult to achieve with standard deep learning frameworks and are the motivation for this project.
The translation API supports several decoding options:
See the Decoding documentation for examples.
The steps below assume a Linux OS and a Python installation (3.5 or above).
1. Install the Python package:
pip install --upgrade pip
pip install ctranslate2
2. Convert a model trained with OpenNMT-py or OpenNMT-tf, for example the pretrained Transformer model (choose one of the two models):
a. OpenNMT-py
pip install OpenNMT-py
wget https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
tar xf transformer-ende-wmt-pyOnmt.tar.gz
ct2-opennmt-py-converter --model_path averaged-10-epoch.pt --model_spec TransformerBase \
--output_dir ende_ctranslate2
b. OpenNMT-tf
pip install OpenNMT-tf
wget https://s3.amazonaws.com/opennmt-models/averaged-ende-export500k-v2.tar.gz
tar xf averaged-ende-export500k-v2.tar.gz
ct2-opennmt-tf-converter --model_path averaged-ende-export500k-v2 --model_spec TransformerBase \
--output_dir ende_ctranslate2
3. Translate tokenized inputs, for example with the Python API:
import ctranslate2
translator = ctranslate2.Translator("ende_ctranslate2/")
translator.translate_batch([["▁H", "ello", "▁world", "!"]])
Python packages are published on PyPI for Linux and macOS:
pip install ctranslate2
All software dependencies are included in the package, including CUDA libraries for GPU support on Linux. The macOS version only supports CPU execution.
Requirements:
The opennmt/ctranslate2
repository contains images for multiple Linux distributions, with or without GPU support:
docker pull opennmt/ctranslate2:latest-ubuntu18-cuda11.0
The images include:
libctranslate2.so
library development filesSee Building.
The core CTranslate2 implementation is framework agnostic. The framework specific logic is moved to a conversion step that serializes trained models into a simple binary format.
The following frameworks and models are currently supported:
OpenNMT-tf | OpenNMT-py | |
---|---|---|
Transformer (Vaswani et al. 2017) | ✓ | ✓ |
+ relative position representations (Shaw et al. 2018) | ✓ | ✓ |
If you are using a model that is not listed above, consider opening an issue to discuss future integration.
Conversion scripts are parts of the Python package and should be run in the same environment as the selected training framework:
ct2-opennmt-py-converter
ct2-opennmt-tf-converter
The converter Python API can also be used to convert Transformer models with any number of layers, hidden dimensions, and attention heads.
Models can also be converted directly from the supported training frameworks. See their documentation:
The converters support reducing the weights precision to save on space and possibly accelerate the model execution. The --quantization
option accepts the following values:
int8
int16
float16
When loading a quantized model, the library tries to use the same type for computation. If the current platform or backend do not support optimized execution for this computation type (e.g. int16
is not optimized on GPU), then the library converts the model weights to another optimized type. The tables below document the fallback types:
On CPU:
Model | int8 | int16 | float16 |
---|---|---|---|
Intel | int8 | int16 | float |
other | int8 | int8 | float |
(This table only applies for prebuilt binaries or when compiling with both Intel MKL and oneDNN backends.)
On GPU:
Compute Capability | int8 | int16 | float16 |
---|---|---|---|
>= 7.0 | int8 | float16 | float16 |
6.1 | int8 | float | float |
<= 6.0 | float | float | float |
Notes:
--compute_type
argument.Each converter should populate a model specification with trained weights coming from an existing model. The model specification declares the variable names and layout expected by the CTranslate2 core engine.
See the existing converters implementation which could be used as a template.
The examples use the English-German model converted in the Quickstart. This model requires a SentencePiece tokenization.
echo "▁H ello ▁world !" | docker run --gpus=all -i --rm -v $PWD:/data \
opennmt/ctranslate2:latest-ubuntu18-cuda11.0 --model /data/ende_ctranslate2 --device cuda
See docker run --rm opennmt/ctranslate2:latest-ubuntu18-cuda11.0 --help
for additional options.
import ctranslate2
translator = ctranslate2.Translator("ende_ctranslate2/")
translator.translate_batch([["▁H", "ello", "▁world", "!"]])
See the Python reference for more advanced usages.
#include <iostream>
#include <ctranslate2/translator.h>
int main() {
ctranslate2::Translator translator("ende_ctranslate2/");
ctranslate2::TranslationResult result = translator.translate({"▁H", "ello", "▁world", "!"});
for (const auto& token : result.output())
std::cout << token << ' ';
std::cout << std::endl;
return 0;
}
See the Translator class for more advanced usages, and the TranslatorPool class for running translations in parallel.
Some environment variables can be configured to customize the execution:
CT2_CUDA_ALLOW_FP16
: Allow using FP16 computation on GPU even if the device does not have efficient FP16 support.CT2_CUDA_CACHING_ALLOCATOR_CONFIG
: Tune the CUDA caching allocator (see Performance).CT2_FORCE_CPU_ISA
: Force CTranslate2 to select a specific instruction set architecture (ISA). Possible values are: GENERIC
, AVX
, AVX2
. Note: this does not impact backend libraries (such as Intel MKL) which usually have their own environment variables to configure ISA dispatching.CT2_TRANSLATORS_CORE_OFFSET
: If set to a non negative value, parallel translators are pinned to cores in the range [offset, offset + inter_threads]
. Requires compilation with -DOPENMP_RUNTIME=NONE
.CT2_USE_EXPERIMENTAL_PACKED_GEMM
: Enable the packed GEMM API for Intel MKL (see Performance).CT2_USE_MKL
: Force CTranslate2 to use (or not) Intel MKL. By default, the runtime automatically decides whether to use Intel MKL or not based on the CPU vendor.CT2_VERBOSE
: Enable some verbose logs to help debugging the run configuration.The Docker images build all translation clients presented in Translating. The build
command should be run from the project root directory, e.g.:
docker build -t opennmt/ctranslate2:latest-ubuntu18 -f docker/Dockerfile.ubuntu .
When building GPU images, the CUDA version can be selected with --build-arg CUDA_VERSION=11.0
.
See the docker/
directory for available images.
The project uses CMake for compilation. The following options can be set with -DOPTION=VALUE
:
CMake option | Accepted values (default in bold) | Description |
---|---|---|
CMAKE_CXX_FLAGS | compiler flags | Defines additional compiler flags |
ENABLE_CPU_DISPATCH | OFF, ON | Compiles CPU kernels for multiple ISA and dispatches at runtime (should be disabled when explicitly targetting an architecture with the -march compilation flag) |
ENABLE_PROFILING | OFF, ON | Enables the integrated profiler (usually disabled in production builds) |
LIB_ONLY | OFF, ON | Disables the translation client |
OPENMP_RUNTIME | INTEL, COMP, NONE | Selects or disables the OpenMP runtime (INTEL: Intel OpenMP; COMP: OpenMP runtime provided by the compiler; NONE: no OpenMP runtime) |
WITH_CUDA | OFF, ON | Compiles with the CUDA backend |
WITH_DNNL | OFF, ON | Compiles with the oneDNN backend (a.k.a. DNNL) |
WITH_MKL | OFF, ON | Compiles with the Intel MKL backend |
WITH_ACCELERATE | OFF, ON | Compiles with the Apple Accelerate backend |
WITH_OPENBLAS | OFF, ON | Compiles with the OpenBLAS backend |
WITH_TESTS | OFF, ON | Compiles the tests |
Some build options require external dependencies:
-DWITH_MKL=ON
requires:
-DWITH_DNNL=ON
requires:
-DWITH_ACCELERATE=ON
requires:
-DWITH_OPENBLAS=ON
requires:
-DWITH_CUDA=ON
requires:
Multiple backends can be enabled for a single build. When building with both Intel MKL and oneDNN, the backend will be selected at runtime based on the CPU information.
Use the following instructions to install Intel MKL:
wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
sudo apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
sudo sh -c 'echo "deb https://apt.repos.intel.com/oneapi all main" > /etc/apt/sources.list.d/oneAPI.list'
sudo apt-get update
sudo apt-get install intel-oneapi-mkl-devel
See the Intel MKL documentation for other installation methods.
See the NVIDIA documentation for information on how to download and install CUDA.
Under the project root, run the following commands:
git submodule update --init
mkdir build && cd build
cmake -DWITH_MKL=ON -DWITH_CUDA=ON ..
make -j4
(If you did not install one of Intel MKL or CUDA, set its corresponding flag to OFF
in the CMake command line.)
These steps should produce the cli/translate
binary. You can try it with the model converted in the Quickstart section:
$ echo "▁H ello ▁world !" | ./cli/translate --model ende_ctranslate2/ --device auto
▁Hallo ▁Welt !
To enable the tests, you should configure the project with cmake -DWITH_TESTS=ON
. The binary tests/ctranslate2_test
runs all tests using Google Test. It expects the path to the test data as argument:
./tests/ctranslate2_test ../tests/data
# Install the CTranslate2 library.
cd build && make install && cd ..
# Build and install the Python wheel.
cd python
pip install -r install_requirements.txt
python setup.py bdist_wheel
pip install dist/*.whl
# Run the tests with pytest.
pip install -r tests/requirements.txt
pytest tests/test.py
Depending on your build configuration, you might need to set LD_LIBRARY_PATH
if missing libraries are reported when running tests/test.py
.
We compare CTranslate2 with OpenNMT-py and OpenNMT-tf on their pretrained English-German Transformer models (available on the website). For this benchmark, CTranslate2 models are using the weights of the OpenNMT-py model.
Model size | |
---|---|
OpenNMT-py | 542MB |
OpenNMT-tf | 367MB |
CTranslate2 | 364MB |
- int16 | 187MB |
- float16 | 182MB |
- int8 | 100MB |
CTranslate2 models are generally lighter and can go as low as 100MB when quantized to int8. This also results in a fast loading time and noticeable lower memory usage during runtime.
We translate the test set newstest2014 and report:
Translations are running beam search with a size of 4 and a maximum batch size of 32.
See the directory tools/benchmark
for more details about the benchmark procedure and how to run it. Also see the Performance document to further improve CTranslate2 performance.
Please note that the results presented below are only valid for the configuration used during this benchmark: absolute and relative performance may change with different settings.
Tokens per second | Max. memory | BLEU | |
---|---|---|---|
OpenNMT-tf 2.14.0 (with TensorFlow 2.4.0) | 279.3 | 2308MB | 26.93 |
OpenNMT-py 2.0.0 (with PyTorch 1.7.0) | 292.9 | 1840MB | 26.77 |
- int8 | 383.3 | 1784MB | 26.86 |
CTranslate2 1.17.0 | 593.2 | 970MB | 26.77 |
- int16 | 777.2 | 718MB | 26.84 |
- int8 | 921.5 | 635MB | 26.92 |
- int8 + vmap | 1143.4 | 621MB | 26.75 |
Executed with 4 threads on a c5.metal Amazon EC2 instance equipped with an Intel(R) Xeon(R) Platinum 8275CL CPU.
Tokens per second | Max. GPU memory | Max. CPU memory | BLEU | |
---|---|---|---|---|
OpenNMT-tf 2.14.0 (with TensorFlow 2.4.0) | 1753.4 | 4958MB | 2525MB | 26.93 |
OpenNMT-py 2.0.0 (with PyTorch 1.7.0) | 1189.4 | 2838MB | 2666MB | 26.77 |
CTranslate2 1.17.0 | 2721.1 | 1164MB | 954MB | 26.77 |
- int8 | 3710.0 | 882MB | 541MB | 26.86 |
- float16 | 3965.8 | 924MB | 590MB | 26.75 |
- float16 + local sorting | 4869.4 | 1148MB | 591MB | 26.75 |
Executed with CUDA 11.0 on a g4dn.xlarge Amazon EC2 instance equipped with a NVIDIA T4 GPU (driver version: 450.80.02).
intra_threads
and inter_threads
?The original CTranslate project shares a similar goal which is to provide a custom execution engine for OpenNMT models that is lightweight and fast. However, it has some limitations that were hard to overcome:
CTranslate2 addresses these issues in several ways:
The implementation has been generously tested in production environment so people can rely on it in their application. The project versioning follows Semantic Versioning 2.0.0. The following APIs are covered by backward compatibility guarantees:
ctranslate2.Translator
ctranslate2.converters.OpenNMTPyConverter
ctranslate2.converters.OpenNMTTFConverter
ctranslate2::models::Model
ctranslate2::TranslationOptions
ctranslate2::TranslationResult
ctranslate2::Translator
ctranslate2::TranslatorPool
Other APIs are expected to evolve to increase efficiency, genericity, and model support.
Here are some scenarios where this project could be used:
However, you should probably not use this project when:
CPU
CTranslate2 supports x86-64 and ARM64 processors. It includes optimizations for AVX, AVX2, and NEON and supports multiple BLAS backends that should be selected based on the target platform (see Building).
Prebuilt binaries are designed to run on any x86-64 processors supporting at least SSE 4.2. The binaries implement runtime dispatch to select the best backend and instruction set architecture (ISA) for the platform. In particular, they are compiled with both Intel MKL and oneDNN so that Intel MKL is only used on Intel processors where it performs best, whereas oneDNN is used on other x86-64 processors such as AMD.
GPU
CTranslate2 supports NVIDIA GPUs with a Compute Capability greater or equal to 3.0 (Kepler). FP16 execution requires a Compute Capability greater or equal to 7.0.
The driver requirement depends on the CUDA version. See the CUDA Compatibility guide for more information.
The current approach only exports the weights from existing models and redefines the computation graph via the code. This implies a strong assumption of the graph architecture executed by the original framework.
We are actively looking to ease this assumption by supporting ONNX as model parts.
There are many ways to make this project better and even faster. See the open issues for an overview of current and planned features. Here are some things we would like to get to:
intra_threads
and inter_threads
?intra_threads
is the number of OpenMP threads that is used per translation: increase this value to decrease the latency.inter_threads
is the maximum number of CPU translations executed in parallel: increase this value to increase the throughput. Even though the model data are shared, this execution mode will increase the memory usage as some internal buffers are duplicated for thread safety.The total number of computing threads launched by the process is summarized by this formula:
num_threads = inter_threads * intra_threads
Note that these options are only defined for CPU translation and are forced to 1 when executing on GPU. Parallel translations on GPU require multiple GPUs. See the option device_index
that accepts multiple device IDs.
The OpenNMT-py REST server is able to serve CTranslate2 models. See the code integration to learn more.
See here.