Data scientists spend the majority of their time building infrastructure, transferring data, and writing boilerplate code. Hub streamlines these tasks so that users can focus on building amazing machine learning models 💻.
Hub enables users to stream unlimited amounts of data from the cloud to any machine without sacrificing performance compared to local storage 🚀. In addition, Hub connects datasets to PyTorch and TensorFlow with minimal boilerplate code, and we are currently adding powerful tools for dataset version control, building machine learning pipelines, and running distributed workloads.
Hub is best suited for unstructured datasets such as images, videos, point clouds, or text. It works locally or on any cloud.
Google, Waymo, Red Cross, Omdena, and Rarebase use Hub.
Databases, data lakes, and data warehouses are best suited for tabular data and are not optimized for deep-learning applications using data such as images, videos, and text. Hub is a Data 2.0 solution that stores datasets as chunked compressed arrays, which significantly increases data transfer speeds between network-connected machines. This eliminates the need to download entire datasets before running code, because computations and data streaming can occur simultaneously without increasing the total runtime.
Hub also significantly reduces the time to build machine learning workflows, because its API eliminates boilerplate code that is typically required for data wrangling ✌️.
Hub is written in 100% python and can be quickly installed using pip.
pip3 install hub
Accessing datasets in Hub requires a single line of code. Run this snippet to get the first image in the MNIST database in the numpy array format:
import hub mnist = hub.load("hub://activeloop/mnist-train") mnist_np = mnist.images.numpy()
To access and train a classifier on your own Hub dataset stored in cloud, run:
import hub my_dataset = hub.load("s3://bucket_name/dataset_folder") my_dataloader = my_dataset.pytorch(batch_size = 16, num_workers = 4) for batch in my_dataloader: print(batch) ## Training Loop Here ##
Getting started guides, examples, tutorials, API reference, and other usage information can be found on our documentation page.
Hub users can access and visualize a variety of popular datasets through a free integration with Activeloop's Platform. Users can also create and store their own datasets and make them available to the public. Free storage of up to 300 GB is available.
Hub and DVC offer dataset version control similar to git for data, but their methods for storing data differ significantly. Hub converts and stores data as chunked compressed arrays, which enables rapid streaming to ML models, whereas DVC operates on top of data stored in less efficient traditional file structures. The Hub format makes dataset versioning significantly easier compared to a traditional file structures by DVC when datasets are composed of many files (i.e. many images). An additional distinction is that DVC primarily uses a command line interface, where as Hub is a python package. Lastly, Hub offers an API to easily connect datasets to ML frameworks and other common ML tools.
Hub and TFDS seamlessly connect popular datasets to ML frameworks. Hub datasets are compatible with both PyTorch and TensorFlow, whereas TFDS are only compatible with TensorFlow. A key difference between Hub and TFDS is that Hub datasets are designed for streaming from the cloud, whereas TFDS must be downloaded locally prior to use. In addition to providing access to popular publicly-available datasets, Hub also offers powerful tools for creating custom datasets, storing them on a variety of cloud storage providers, and collaborating with others. TFDS is primarily focused on giving the public easy access to commonly available datasets, and management of custom datasets is not the primary focus.
Hub and HuggingFace offer access to popular datasets, but Hub primarily focuses on computer vision, whereas HuggingFace primarily focuses on natural language processing. HuggingFace Transforms and other computational tools for NLP are not analogous to features offered by Hub.
Join our Slack community to learn more about unstructured dataset management using Hub and to get help from the Activeloop team and other users.
We'd love your feedback by completing our 3-minute survey.
As always, thanks to our amazing contributors!
Made with contributors-img.
Please read CONTRIBUTING.md to get started with making contributions to Hub.
Using Hub? Add a README badge to let everyone know:
Hub users may have access to a variety of publicly available datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the datasets. It is your responsibility to determine whether you have permission to use the datasets under their license.
If you're a dataset owner and do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thank you for your contribution to the ML community!
By default, we collect anonymous usage data using Bugout (here's the code that does it). It does not collect user data and it only logs the Hub library's own actions. This helps our team understand how the tool is used and how to build features that matter to you! After you register with Activeloop, data is no longer anonymous, but you can opt-out of reporing using the CLI command below:
activeloop reporting --off
This technology was inspired by our research work at Princeton University. We would like to thank William Silversmith @SeungLab for his awesome cloud-volume tool.