Fuzzymatcher

Fast fuzzy regex matcher: specify max edit distance to find approximate matches
Alternatives To Fuzzymatcher
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Leaderf2,042
3 days ago52apache-2.0Python
An efficient fuzzy finder that helps to locate files, buffers, mrus, gtags, etc. on the fly for both vim and neovim.
Fuzzy1,22112974 months ago6September 01, 20185mitGo
Go library that provides fuzzy string matching optimized for filenames and code symbols in the style of Sublime Text, VSCode, IntelliJ IDEA et al.
Fuzzymatcher31
8 days agobsd-3-clauseC++
Fast fuzzy regex matcher: specify max edit distance to find approximate matches
Fz4
6 years agoJavaScript
Fast, simple, fuzzy string searching for JavaScript.
Bitap3
4 years agoRust
Fuzzy string search algorithm in Rust
Alternatives To Fuzzymatcher
Select To Compare


Alternative Project Comparisons
Readme

FuzzyMatcher

A C++ class extension of the RE/flex Matcher class for efficient fuzzy matching and fuzzy search with regex patterns.

  • specify max error as a parameter, i.e. the max edit distance or Levenshstein distance

  • regex patterns are compiled into DFA VM opcodes for speed

  • practically linear execution time in the length of the input, using DFA-based matching with minimal backtracking limited by the specified max error parameter

  • supports the full RE/flex regex pattern syntax, which is POSIX-based with many additions: https://www.genivia.com/doc/reflex/html/#reflex-patterns

  • no group captures (yet), except for top-level sub-pattern group captures, e.g. (foo)|(bar)|(baz) but not (foo(bar))

  • newlines (\n) and NUL (\0) characters are never deleted or substituted to ensure that fuzzy matches do not extend the pattern match beyond the number of lines specified by the regex pattern

  • quote regex patterns with \Q and \E for fuzzy string matching and search

  • FuzzyMatcher is used in the ugrep project

Requires

RE-Flex version 3.4.0 or greater.

Examples

pattern max fuzzy find() matches but not
abc 1 abc, ab, ac, axc, axbc a, axx, axbxc, bc
año 1 año, ano, ao anno, ño
ab_cd 2 ab_cd, ab-cd, ab Cd, abCd ab\ncd, Ab_cd, Abcd
a[0-9]+z 1 a1z, a123z, az, axz axxz, A123z, 123z

Note that the first character of the pattern must match when searching a corpus with the find() method. By contrast, the matches() method to match a corpus from start to end does not impose this requirement:

pattern max fuzzy matches() matches but not
abc 1 abc, ab, ac, Abc, xbc bc, axc, axbc a, axx, Ab, axbxc
año 1 año, Año, ano, ao, ño anno
ab_cd 2 ab_cd, Ab_Cd, ab-cd, ab Cd, Ab_cd, abCd ab\ncd, AbCd
a[0-9]+z 1 a1z, A1z, a123z, az, Az, axz, 123z axxz

Optimizations

Fuzzy find() and split() make a second pass over a fuzzy-matched pattern when the match has a nonzero error. This second pass checks if an exact match exists or if a better match exists that overlaps with the first pattern found. For example, the pattern abc is found to fuzzy match all of the text aabc with one error (an extra a). The second pass of find() detects an exact match after skipping the first a. Likewise, the pattern abc is found to fuzzy match ababc with a match for aba with one error (substitution of c by an a). The second pass of find() detects an exact match after skipping ab in the text. This approach is faster than minimizing the edit distance when searching text, while returning exact matches when possible.

Usage

Fuzzy searching

#include "fuzzymatcher.h"

// MAX:   optional maximum edit distance, default is 1, up to 255
// INPUT: a string, wide string, FILE*, or std::istream object
reflex::FuzzyMatcher matcher("PATTERN", [MAX,] INPUT);

// find all pattern matches in the input
while (matcher.find())
{
  std::cout << matcher.text() << '\n'  // show each fuzzy match
  std::cout << matcher.edits() << '\n' // edit distance (when > 0 not guaranteed minimal)
}

See the RE/flex user guide for the full list of Matcher class methods available to extract match info. The edits() method is a FuzzyMatcher extension of the Matcher class.

Fuzzy matching

#include "fuzzymatcher.h"

// match the whole input (here in one go with a temporary fuzzy matcher object)
if (reflex::FuzzyMatcher("PATTERN", [MAX,] INPUT).matches())
{
  std::cout << "fuzzy pattern matched\n";
}

Fuzzy splitting (text between matches)

#include "fuzzymatcher.h"

reflex::FuzzyMatcher matcher("PATTERN", [MAX,] INPUT);

// split the input into parts separated by pattern matches
while (matcher.split())
{
  std::cout << matcher.text() << '\n' // show text between fuzzy matches
}

Character insertion, deletion and substitution

The MAX parameter may be combined with one or more of the following flags:

  • reflex::FuzzyMatcher::INS insertions allow extra character(s) in the input
  • reflex::FuzzyMatcher::DEL deletions allow missing character(s) in the input
  • reflex::FuzzyMatcher::SUB substitutions count as one edit
  • reflex::FuzzyMatcher::BIN ASCII/binary fuzzy matching (default is Unicode with Unicode pattern converter, see below)

For example, to allow approximate pattern matches to include up to three character insertions, but no deletions or substitutions (allowing insertions only is actually the most efficient fuzzy matching possible):

reflex::FuzzyMatcher matcher(regex, 3 | reflex::FuzzyMatcher::INS, INPUT);

To allow up to three insertions or deletions (note that a substitution counts as two edits: one insertion and one deletion):

reflex::FuzzyMatcher matcher(regex, 3 | reflex::FuzzyMatcher::INS | reflex::FuzzyMatcher::DEL, INPUT);

When no flags are specified with MAX, fuzzy matching is performed with insertions, deletions, and substitutions, each counting as one edit.

Full Unicode support

To support full Unicode pattern matching, such as \p Unicode character classes, convert the regex pattern before using it as follows:

std::string regex(reflex::Matcher::convert("PATTERN", reflex::convert_flag::unicode));
reflex::FuzzyMatcher matcher(regex, [MAX,] INPUT);

Static regex patterns

Fixed patterns should be constructed (and optionally Unicode converted) just once statically to avoid repeated construction, e.g. in the body of loops and in frequently executed functions:

static const reflex::Pattern pattern(reflex::Matcher::convert("PATTERN", reflex::convert_flag::unicode));
reflex::FuzzyMatcher matcher(pattern, [MAX,] INPUT);

Requires

RE/flex downloaded and locally built or globally installed to access the reflex/include and reflex/lib files.

Compiling

Assuming reflex dir with source code is locally built in the project dir:

c++ -o myapp myapp.cpp -Ireflex/include reflex/lib/libreflex.a

Or when the libreflex library is installed:

c++ -o myapp myapp.cpp -lreflex

Testing

$ make ftest
$ ./ftest 'ab_cd' 'abCd' 2
matches(): match (2 edits)
find():    'abCd' at 0 (2 edits)
split():   '' at 0 (2 edits)
split():   '' at 4 (0 edits)

License

BSD-3

Popular Fuzzy Search Projects
Popular Character Projects
Popular Data Processing Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
C Plus Plus
Character
Regular Expression
Fuzzy Search
Fuzzy Matching