BitMagic was created as a Algebra of Sets toolkit for Information Retrieval but currently evolved into a more general Data Science components library forn memory compact big data structures and algorithms on succinct data vectors. BitMagic implements compressed bit-vectors and containers (vectors) based on ideas of bit-plane transform and Rank-Select compression. All containers are serializable (with compression) for efficient storage and network transfer.
BitMagic offers sets of method to architect your applications to use HPC techniques to save memory (thus be able to fit more data in one compute unit) and improve storage and traffic patterns when storing data vectors and models in files or object stores (SQL or noSQL).
BitMagic facilitates two big classes of scenarios:
BitMagic was used as a building blocks for:
Please visit our use-case notes: http://bitmagic.io/use-case.html
BitMagic library is a high performance library, implementing custom optimizations for variety of platforms and build targets:
BitMagic uses a data-parallel vectorized design with a goal not just provide a best single threaded performance but to facilitate highly parallel compute on many-core systems.
BitMagic uses a suite of compression algorithms, filters and transformations to reduce memory footprint, storage costs and network data transfer. http://bitmagic.io/design.html
Please visit our tech.notes: http://bitmagic.io/articles.html
BitMagic supports re-interpretation of bit-vectors as collections of non-overlapping ranges of 1s flanked with 0s (for example: 011110110). Regular set functions provide set intersect / unions intreval operations implement interval iterator and searches for interval boundaries. Ranges and intervals has great utility in bioinformatics, because genomics data are often annotated as coordinate ranges. BitMagic offers building blocks for effciient oprations on intervals encoded as bit-vectors (find interval start/end, check if range is an inetrval, iterate intervals
BitMagic uses contept of two-stage serialization-deserialization. The focus is on fast deserialization. BitMagic implements API for fast vector range deserialization and gather deserialization of compressed BLOBs. The ultimate feature of BitMagic is ability to work with compressed data.
This is the main in-RAM operational state, where vectors are kept in memory compact form. Succinct is NOT a compression. It is possible to access random elements in containers, decode blocks, iterate vectors, make updates, run search algorithms. Stage One offers transparent use, it vectors look much like STL. Succinct is memory compact but not fully compressed.
BitMagic can serialize all containers and vectors with additional compression based on block of heuristics and codecs. The workhorse coding techniques are: Binary Interpolative Coding (BIC) and Elias Gamma.
BitMagic containers are called "sparse" vectors but in fact its compression schemes works well for both sparse and dense data.
BitMagic is tested on Gov2 benchmark set of inverted lists and number of proprietory data sets. http://bitmagic.io/bm5-cmpr.html
Deserialization always go back to Stage One, so data are not completely decoded but instead
succinct in RAM. Our goal here is to both reduce application memory footprint and improve deserialization latency. Decompression algorithms support deserialization of arbitrary ranges or even gather deserializatin of elements.
BitMagic supports succinct (memory compact) vectors based on bit-transposition transform
also known as bit-plane compression (BPC) or bit-slicing and Rank-Select compression. Succint vectors are labeled "sparse" but they also work for dense vectors.
Bit transposition solves two purposes: free unused bit plains and isolate regularity and entropy into separate (sparse) bit-vectors. Compression on bit-planes offers both superior memory performance and fast search. One of the design golas was to be able to perform searches on succinct data using vectorized logical operations.
Succinct bit-transposed implementation works well for both integer vectors and string vectors and it rivals other succinct schemes like prefix trees. Succinct vectors can be both sorted and unsorted. The idea here is similar to Apache Arrow-Parquet, but it takes it farther with bit-plane compression and extensive use of accelerated Rank-Select compression.
BitMagic supports serialization evolution - if serialization format changes, old saved data remains readable by the new code. Old code will NOT be able to read new BLOBs. BitMagic changes major version number when serialization format changes.
BitMagic implements memory profiling calls for all vectors. Any vector can be sampled for memory footprint so the top level system can adapt memory managemnet based on the runtime memory profiling. Typical use case is memory cache of objects with compression to RAM and then to eviction to disk based on resource consumption and costs (dynamic balance of demand and supply).
Yes! BitMagic supports 64-bit, can be used with 32-bit address space (less overhead) or full 64-bit address space. 32-bit address space is the default mode 2^31-1 elements should be a good fit for short to medium range IR and data science systems. 64-bit address mode is available using #define BM64ADDR or #include "bm64.h". Current 64-bit implementation allows 2^48-1 vector elements for large scale systems.
BitMagic compiles and work with WebAssmbly (emscripten). Latest versions includes multiple tweaks, specific for the platform. Performance numbers are close to native code without SIMD (sometimes afster). Sample compile line would look like:
emcc -std=c++11 -s ALLOW_MEMORY_GROWTH=1 -O2 -s WASM=1 ...
BitMagic fully supports ARM CPU. All releases are stress tested with Raspberry Pi 4. BitMagic implements some algorithmic tweaks and improvements specific for ARM (like use of LZCNT instruction). BitMagic succinct containers can be very useful on embedded systems for edge computing with limited amount of available memory.
ARM NEON SIMD support is a work in progress.
compressed binary relational and adjacency matrixes and operations on matrixes for Entity-Relationship acceleration, graph operations, social analyticsm materialized RDBMS joins, etc
succinct data structures and containers based on bit-transposed data representation and rank-select compression
memory efficient dictionaries of strings as an alternative to suffix trees
BitMagic C++ is a header only library (easy to build and use in your project) and it comes with a set of examples. It is NOT advised to use tests as a code example to study library usage. Tests do not illustate the best usage patterns and models and often intentionally inefficient.
API documentation and examples: http://www.bitmagic.io/apis.html
Tutorial for Algebra of Sets: http://bitmagic.io/set-algebra.html
Use Cases and Application notes: http://bitmagic.io/use-case.html
Technical notes on performance optmization: http://bitmagic.io/articles.html
Important! We ask you to explicitly mention BitMagic project in any derived work or our published materials. Proper reference on your product/project page is a REQUIREMENT for using BitMagic Library.
BitMagic library pays serious attention to code quality and test coverage.
As a building blocks library BitMagic needs to be stable and conformant to be useful.
We do not rely on unit tests alone, our tests often use "chaos testing" (aka fuzzing) where stress tests are based on randomized, generated sets and randomized operations. We regularly build and run test suits for Release and Debug mode for various combinations of SIMD optimizations.
All variants of test builds take days to run, so the working master branch is NOT guaranteed to be perfect all the time. For production please use stable github release branches or distributions from SourceForge: https://sourceforge.net/projects/bmagic/files/
GitHub master accepts patch requests Our branching policy is that master cannot be considered fully stable between the releases. (for production stability please use release versions)
Need help with mappings to Python and other languages (BitMagic has C bindings)
BitMagic C++ is a header-only software package and you probably can just take the sources and put it into your project directly. All C++ library sources/headers are in src directory.
However if you want to use our makefiles you need to follow the next simple instructions:
Apply a few environment variables by runing bmenv.sh in the project root directory:
$ source ./bmenv.sh
(please use "." "./bmenv.sh" to apply root environment variable)
use GNU make (gmake) to build installation.
or (DEBUG version)
$gmake DEBUG=YES rebuild
The default compiler on Unix and CygWin is g++. If you want to change the default you can do that in makefile.in (should be pretty easy to do)
If you use cygwin installation please follow general Unix recommendations. MSVC - solution and projects are available via CMAKE.
XCODE - project files are available via CMAKE.
BitMagic library for C and JNI mappings.
BitMagic library is available for C language (this is work in progress). The main objective of C build is to bridge BitMagic into other programming languages. C build is in the subdirectory "lang-maps".
C build creates versions of BitMagic build for SSE and AVX and adds CPU identification, so the upper level system can support dynamic CPU identification and code dispatch.
C build uses C++ compiler, but does not use RTTI, exceptions (simulated with long jump) and C++ memory management, so it is C++ language neutral, without runtime dependencies. Algorithms and behavior are shared between C and C++.
Current state of development:
Python support is pending and we need help here. If you are enthusiastic about Python and think you can help please contact: anatoliy.kuznetsov @ yahoo dot com
All BM fine tuning parameters are controlled by the preprocessor defines (and compiler keys).
BM library supports CXX-11. Move semantics, noexept, initalizer lists.
To turn on SSE2 optimization #define BMSSE2OPT in your build environment. To use SSE4.2 #define BMSSE42OPT SSE42 optimization automatically assumes SSE2 as a subset of SSE4.2. (you don’t have to use both BMSSE2OPT and BMSSE42OPT at the same time).
To turn on AVX2 - #define BMAVX2OPT This will automatically enable AVX2 256-bit SIMD, popcount (SSE4.2) and other compatible hardware instructions.
BM library does NOT support multiple code paths and runtime CPU identification. You have to build specifically for your target system or use default portable build.
To correctly build for the target SIMD instruction set - please set correct code generation flags for the build environment.
BitMagic examples and tests can be build with GCC using cmd-line settings:
make BMOPTFLAGS=-DBMAVX2OPT rebuild
make BMOPTFLAGS=-DBMSSE42OPT rebuild
It automatically applies the right set of compiler (GCC) flags for the target build.
CMAKE cd build cmake -DBMOPTFLAGS:STRING=BMSSE42OPT .. make
cmake -DBMOPTFLAGS:STRING=BMAVX2OPT ..
BM library supports "restrict" keyword, some compilers (for example Intel C++) generate better code (out of order load-stores) when restrict keyword is helping. This option is turned OFF by default since most of the C++ compilers does not support it. To turn it ON please #define BM_HASRESTRICT in your project. Some compilers use "__restrict" keyword for this purpose. To correct it define BMRESTRICT macro to correct keyword.
If you want to use BM library in a "no-STL project" (like embedded systems) define BM_NO_STL.
This rule only applies to the core bm::bvector<> methods. Auxiliary algorithms, examples, etc would still use STL.
Follow us on twitter: https://twitter.com/bitmagicio
Thank you for using BitMagic library!
e-mail: [email protected]
WEB site: http://bitmagic.io