Awesome Open Source
Awesome Open Source

Bert in Production

A small collection of resources on using BERT ( ) and related Language Models in production environments.


Implementations and production-ready tools related to BERT.

  • microsoft/onnxruntime This library was recently open-sourced by Microsoft; it contains several model-specific optimisations including one for transformer models. A model's architecture is compiled into the Open Neural Network Exchange (ONNX) standard and optionally optimised for a specific platform's hardware.

  • google-research/bert The original code. TensorFlow code and pre-trained models for BERT.

  • pytorch/fairseq Facebook AI Research Sequence-to-Sequence Toolkit written in Python. Contains the original code for RoBERTa.

  • google-research/google-research Google AI Research. Contains original code for Albert.

  • huggingface/transformers Transformers: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch. The transformers library is focussed on using publicly-available pretrained models and has wide support for many of the most popular varieties.

  • huggingface/tokenizers Fast State-of-the-Art Tokenizers optimized for Research and Production

  • spacy-transformers spaCy pipelines for pre-trained BERT, XLNet and GPT-2

  • codertimo/BERT-pytorch Google AI 2018 BERT pytorch implementation

  • kaushaltrivedi/fast-bert Super easy library for BERT based NLP models

  • CyberZHG/keras-bert Implementation of BERT that could load official pre-trained models for feature extraction and prediction

  • hanxiao/bert-as-service bert-as-service uses BERT as a sentence encoder and hosts it as a service via ZeroMQ, allowing you to map sentences into fixed-length representations in just two lines of code.

Descriptive Resources

Articles and papers describing how BERT works.

Deep Analysis

These papers do a deep analysis of the internals of BERT. Understanding the internals of a model can enable more efficient optimisations.

General Resources

Original papers describing architectures and methodologies intrisinc to a BERT-style language model.


One of the big problems with running BERT-like models in production is the time required to infer; a logical conclusion is that a faster model is a more production-ready model.

Knowledge Distillation

One way to make a model faster is to reduce the amount of computation required to generate its output - Knowledge Distillation is the process of training a smaller "student" model from a larger "teacher" network. The smaller model is then deployed to production.


  • Small and Practical BERT Models for Sequence Labeling Starting from a public multilingual BERT checkpoint, their final model is 6x smaller and 27x faster, and has higher accuracy than a state-of-the-art multilingual baseline.

  • ALBERT: A Lite BERT for Self-supervised Learning of Language Representations Albert primarily aims to reduce the number of trainable parameters in a BERT model. Albert shares all weights in the transformer encoder layers and decouples the dimension of the word embeddings from the dimensions of the transformer. The result is a model that has far fewer trainable parameters. Time to infer is not reduced.

  • Compression BERT for faster prediction Learn how to use pruning to speed up BERT.

  • Extreme Language Model Compression with Optimal Subwords and Shared Projections The authors utilise knowledge distillation to train the teacher and student models simultaneously to obtain optimal word embeddings for the student vocabulary. Their method compresses the BERT_BASE model by more than 60x and to under 7MB.

  • Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT The authors propose a new quantization scheme and achieve comparable performance to baseline even with up to 13× compression of the model parameters and up to 4× compression of the embedding table as well as activations.

  • PoWER-BERT: Accelerating BERT inference for Classification Tasks BERT has emerged as a popular model for natural language understanding. Given its compute intensive nature, even for inference, many recent studies have considered optimization of two important performance characteristics: model size and inference time. We consider classification tasks and propose a novel method, called PoWER-BERT, for improving the inference time for the BERT model without significant loss in the accuracy. The method works by eliminating word-vectors (intermediate vector outputs) from the encoder pipeline. We design a strategy for measuring the significance of the word-vectors based on the self-attention mechanism of the encoders which helps us identify the word-vectors to be eliminated. Experimental evaluation on the standard GLUE benchmark shows that PoWER-BERT achieves up to 4.5x reduction in inference time over BERT with < 1% loss in accuracy. We show that compared to the prior inference time reduction methods, PoWER-BERT offers better trade-off between accuracy and inference time. Lastly, we demonstrate that our scheme can also be used in conjunction with ALBERT (a highly compressed version of BERT) and can attain up to 6.8x factor reduction in inference time with < 1% loss in accuracy.

  • Q8BERT: Quantized 8Bit BERT Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even larger and more accurate models such as GPT2 and Megatron, suggest a trend of large pre-trained Transformer models. However, using these large models in production environments is a complex task requiring a large amount of compute, memory and power resources. In this work we show how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by 4× with minimal accuracy loss. Furthermore, the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.

  • TinyBERT: Distilling BERT for Natural Language Understanding Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive and memory intensive, so it is difficult to effectively execute them on some resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we firstly propose a novel transformer distillation method that is a specially designed knowledge distillation (KD) method for transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be well transferred to a small student TinyBERT. Moreover, we introduce a new two-stage learning framework for TinyBERT, which performs transformer distillation at both the pre-training and task-specific learning stages. This framework ensures that TinyBERT can capture both the general-domain and task-specific knowledge of the teacher BERT.TinyBERT is empirically effective and achieves more than 96% the performance of teacher BERTBASE on GLUE benchmark while being 7.5x smaller and 9.4x faster on inference. TinyBERT is also significantly better than state-of-the-art baselines on BERT distillation, with only about 28% parameters and about 31% inference time of them.

Other Resources

Get A Weekly Email With Trending Projects For These Topics
No Spam. Unsubscribe easily at any time.
python (51,899
tensorflow (2,131
awesome (1,340
tutorial (953
language (422
bert (248
deployment (224
paper (217
resources (193
model (110
transformers (96
natural-language (54
knowledge-distillation (35
production (33
implementation (23

Find Open Source By Browsing 7,000 Topics Across 59 Categories