Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Transformers | 102,489 | 64 | 911 | 13 hours ago | 91 | June 21, 2022 | 739 | apache-2.0 | Python | |
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX. | ||||||||||
Deepspeech | 21,963 | 29 | 11 | 9 days ago | 100 | December 19, 2020 | 128 | mpl-2.0 | C++ | |
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers. | ||||||||||
Deeplearningexamples | 10,994 | 2 days ago | 237 | Jupyter Notebook | ||||||
State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure. | ||||||||||
Deep Learning Drizzle | 10,767 | 5 months ago | 6 | HTML | ||||||
Drench yourself in Deep Learning, Reinforcement Learning, Machine Learning, Computer Vision, and NLP by learning from these exciting lectures!! | ||||||||||
Nemo | 6,837 | 2 | 5 | 14 hours ago | 58 | July 01, 2022 | 94 | apache-2.0 | Python | |
NeMo: a toolkit for conversational AI | ||||||||||
Espnet | 6,622 | 3 | 14 hours ago | 27 | May 28, 2022 | 469 | apache-2.0 | Python | ||
End-to-End Speech Processing Toolkit | ||||||||||
Wav2letter | 6,238 | 10 days ago | 105 | other | C++ | |||||
Facebook AI Research's Automatic Speech Recognition Toolkit | ||||||||||
Asrt_speechrecognition | 6,236 | 4 months ago | 1 | October 23, 2020 | 94 | gpl-3.0 | Python | |||
A Deep-Learning-Based Chinese Speech Recognition System 基于深度学习的中文语音识别系统 | ||||||||||
Speechbrain | 6,040 | 19 hours ago | 171 | apache-2.0 | Python | |||||
A PyTorch-based Speech Toolkit | ||||||||||
Vosk Api | 5,684 | 16 | a day ago | 36 | May 26, 2022 | 394 | apache-2.0 | Jupyter Notebook | ||
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node |
Made in Vancouver, Canada by Picovoice
This repo is a minimalist and extensible framework for benchmarking different speech-to-text engines.
Word error rate (WER) is the ratio of edit distance between words in a reference transcript and the words in the output of the speech-to-text engine to the number of words in the reference transcript.
Real-time factor (RTF) is the ratio of CPU (processing) time to the length of the input speech file. A speech-to-text engine with lower RTF is more computationally efficient. We omit this metric for cloud-based engines.
The aggregate size of models (acoustic and language), in MB. We omit this metric for cloud-based engines.
This benchmark has been developed and tested on Ubuntu 20.04
.
pip3 install -r requirements.txt
Replace ${DATASET}
with one of the supported datasets, ${DATASET_FOLDER}
with path to dataset, and ${AWS_PROFILE}
with the name of AWS profile you wish to use.
python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine AMAZON_TRANSCRIBE \
--aws-profile ${AWS_PROFILE}
Replace ${DATASET}
with one of the supported datasets, ${DATASET_FOLDER}
with path to dataset,
${AZURE_SPEECH_KEY}
and ${AZURE_SPEECH_LOCATION}
information from your Azure account.
python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine AZURE_SPEECH_TO_TEXT \
--azure-speech-key ${AZURE_SPEECH_KEY}
--azure-speech-location ${AZURE_SPEECH_LOCATION}
Replace ${DATASET}
with one of the supported datasets, ${DATASET_FOLDER}
with path to dataset, and
${GOOGLE_APPLICATION_CREDENTIALS}
with credentials download from Google Cloud Platform.
python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine GOOGLE_SPEECH_TO_TEXT \
--google-application-credentials ${GOOGLE_APPLICATION_CREDENTIALS}
Replace ${DATASET}
with one of the supported datasets, ${DATASET_FOLDER}
with path to dataset, and
${WATSON_SPEECH_TO_TEXT_API_KEY}
/${${WATSON_SPEECH_TO_TEXT_URL}}
with credentials from your IBM account.
python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine IBM_WATSON_SPEECH_TO_TEXT \
--watson-speech-to-text-api-key ${WATSON_SPEECH_TO_TEXT_API_KEY}
--watson-speech-to-text-url ${WATSON_SPEECH_TO_TEXT_URL}
Replace ${DATASET}
with one of the supported datasets, ${DATASET_FOLDER}
with path to dataset,
${DEEP_SPEECH_MODEL}
with path to DeepSpeech model file (.pbmm
), and ${DEEP_SPEECH_SCORER}
with path to DeepSpeech
scorer file (.scorer
).
python3 benchmark.py \
--engine MOZILLA_DEEP_SPEECH \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--deepspeech-pbmm ${DEEP_SPEECH_MODEL} \
--deepspeech-scorer ${DEEP_SPEECH_SCORER}
Replace ${DATASET}
with one of the supported datasets, ${DATASET_FOLDER}
with path to dataset, and
${PICOVOICE_ACCESS_KEY}
with AccessKey obtained from Picovoice Console.
python3 benchmark.py \
--engine PICOVOICE_CHEETAH \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--picovoice-access-key ${PICOVOICE_ACCESS_KEY}
Replace ${DATASET}
with one of the supported datasets, ${DATASET_FOLDER}
with path to dataset, and
${PICOVOICE_ACCESS_KEY}
with AccessKey obtained from Picovoice Console.
python3 benchmark.py \
--engine PICOVOICE_LEOPARD \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--picovoice-access-key ${PICOVOICE_ACCESS_KEY}
Engine | LibriSpeech test-clean | LibriSpeech test-other | TED-LIUM | CommonVoice | Average |
---|---|---|---|---|---|
Amazon Transcribe | 5.20% | 9.58% | 4.25% | 15.94% | 8.74% |
Azure Speech-to-Text | 4.96% | 9.66% | 4.99% | 12.09% | 7.93% |
Google Speech-to-Text | 11.23% | 24.94% | 15.00% | 30.68% | 20.46% |
Google Speech-to-Text (Enhanced) | 6.62% | 13.59% | 6.68% | 18.39% | 11.32% |
IBM Watson Speech-to-Text | 11.08% | 26.38% | 11.89% | 38.81% | 22.04% |
Mozilla DeepSpeech | 7.27% | 21.45% | 18.90% | 43.82% | 22.86% |
Picovoice Cheetah | 7.08% | 16.28% | 10.89% | 23.10% | 14.34% |
Picovoice Leopard | 5.39% | 12.45% | 9.04% | 17.13% | 11.00% |
Measurement is carried on an Ubuntu 20.04 machine with Intel CPU (Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
), 64 GB of
RAM, and NVMe storage.
Engine | RTF | Model Size |
---|---|---|
Mozilla DeepSpeech | 0.46 | 1142 MB |
Picovoice Cheetah | 0.07 | 19 MB |
Picovoice Leopard | 0.05 | 19 MB |