Awesome Open Source
Awesome Open Source
Combined Topics
tokenizer
x
Advertising
📦 10
All Projects
Application Programming Interfaces
📦 124
Applications
📦 192
Artificial Intelligence
📦 78
Blockchain
📦 73
Build Tools
📦 113
Cloud Computing
📦 80
Code Quality
📦 28
Collaboration
📦 32
Command Line Interface
📦 49
Community
📦 83
Companies
📦 60
Compilers
📦 63
Computer Science
📦 80
Configuration Management
📦 42
Content Management
📦 175
Control Flow
📦 213
Data Formats
📦 78
Data Processing
📦 276
Data Storage
📦 135
Economics
📦 64
Frameworks
📦 215
Games
📦 129
Graphics
📦 110
Hardware
📦 152
Integrated Development Environments
📦 49
Learning Resources
📦 166
Legal
📦 29
Libraries
📦 129
Lists Of Projects
📦 22
Machine Learning
📦 347
Mapping
📦 64
Marketing
📦 15
Mathematics
📦 55
Media
📦 239
Messaging
📦 98
Networking
📦 315
Operating Systems
📦 89
Operations
📦 121
Package Managers
📦 55
Programming Languages
📦 245
Runtime Environments
📦 100
Science
📦 42
Security
📦 396
Social Media
📦 27
Software Architecture
📦 72
Software Development
📦 72
Software Performance
📦 58
Software Quality
📦 133
Text Editors
📦 49
Text Processing
📦 136
User Interface
📦 330
User Interface Components
📦 514
Version Control
📦 30
Virtualization
📦 71
Web Browsers
📦 42
Web Servers
📦 26
Web User Interface
📦 210
The Top 50 Tokenizer Open Source Projects
Categories
>
Compilers
>
Tokenizer
Tokenizer
⭐
4,633
A small library for converting tokenized PHP source code into XML (and potentially other formats)
Chevrotain
⭐
1,584
Parser Building Toolkit for JavaScript
Natasha
⭐
796
Solves basic Russian NLP tasks, API for lower level Natasha projects
Mustard
⭐
691
🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.
Soynlp
⭐
622
한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.
Kagome
⭐
556
Self-contained Japanese Morphological Analyzer written in pure Go
Smoothnlp
⭐
449
专注于可解释的NLP技术 An NLP Toolset With A Focus on Explainable Inference
Open Korean Text
⭐
446
Open Korean Text Processor - An Open-source Korean Text Processor
Ekphrasis
⭐
446
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Moo
⭐
442
Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
Php Parser
⭐
407
🌿 NodeJS PHP Parser - extract AST or tokens (PHP5 and PHP7)
Jflex
⭐
390
The fast scanner generator for Java™ with full Unicode support
Lexmachine
⭐
340
Lex machinary for go.
Friso
⭐
318
High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.
Sacremoses
⭐
308
Python port of Moses tokenizer, truecaser and normalizer
Sentences
⭐
294
A multilingual command line sentence tokenizer in Golang
Jumanpp
⭐
257
Juman++ (a Morphological Analyzer Toolkit)
Js Tokens
⭐
171
Tiny JavaScript tokenizer.
Bitextor
⭐
169
Bitextor generates translation memories from multilingual websites
Query Translator
⭐
166
Query Translator is a search query translator with AST representation
Udpipe
⭐
163
R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
Tokenizers
⭐
162
Fast, Consistent Tokenization of Natural Language Text
Lex
⭐
137
Replaced by foonathan/lexy
Tokenizer
⭐
136
Fast and customizable text tokenization library with BPE and SentencePiece support
Works For Me
⭐
132
Collection of developer toolkits
Fugashi
⭐
132
A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
Syntok
⭐
127
Text tokenization and sentence segmentation (segtok v2)
Tokenizer
⭐
121
Source code tokenizer
Japanesetokenizers
⭐
120
aim to use JapaneseTokenizer as easy as possible
Kadot
⭐
108
Kadot, the unsupervised natural language processing library.
Megamark
⭐
100
😻 Markdown with easy tokenization, a fast highlighter, and a lean HTML sanitizer
Somajo
⭐
88
A tokenizer and sentence splitter for German and English web and social media texts.
Djurl
⭐
85
Simple yet helpful library for writing Django urls by an easy, short and intuitive way.
Sentence Splitter
⭐
84
Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.
Hippo
⭐
82
PHP standards checker.
Cols Agent Tasks
⭐
70
Colin's ALM Corner Custom Build Tasks
Wirb
⭐
69
Ruby Object Inspection for IRB
String Calc
⭐
59
PHP calculator library for mathematical terms (expressions) passed as strings
Thot
⭐
53
Thot toolkit for statistical machine translation
Greynir
⭐
48
The greynir.is natural language processing website for Icelandic
Py Nltools
⭐
47
A collection of basic python modules for spoken natural language processing
Talismane
⭐
40
NLP framework: sentence detector, tokeniser, pos-tagger and dependency parser
Sharpmath
⭐
36
A small .NET math library.
Nlp Js Tools French
⭐
32
POS Tagger, lemmatizer and stemmer for french language in javascript
Omnicat Bayes
⭐
30
Naive Bayes text classification implementation as an OmniCat classifier strategy. (#ruby #naivebayes)
Lfuzzer
⭐
28
Fuzzing Parsers with Tokens
Lisp Esque Language
⭐
24
💠The Lel programming language
Snl Compiler
⭐
21
SNL(Small Nested Language) Compiler. Maven jUnit Tokenizer Lexer Syntax Parser. 编译原理 词法分析 语法分析
Laravel Token
⭐
10
Laravel token management
React Input Tags
⭐
10
React component for tagging inputs.
1-50 of 50 projects
Advertising
📦 10
All Projects
Application Programming Interfaces
📦 124
Applications
📦 192
Artificial Intelligence
📦 78
Blockchain
📦 73
Build Tools
📦 113
Cloud Computing
📦 80
Code Quality
📦 28
Collaboration
📦 32
Command Line Interface
📦 49
Community
📦 83
Companies
📦 60
Compilers
📦 63
Computer Science
📦 80
Configuration Management
📦 42
Content Management
📦 175
Control Flow
📦 213
Data Formats
📦 78
Data Processing
📦 276
Data Storage
📦 135
Economics
📦 64
Frameworks
📦 215
Games
📦 129
Graphics
📦 110
Hardware
📦 152
Integrated Development Environments
📦 49
Learning Resources
📦 166
Legal
📦 29
Libraries
📦 129
Lists Of Projects
📦 22
Machine Learning
📦 347
Mapping
📦 64
Marketing
📦 15
Mathematics
📦 55
Media
📦 239
Messaging
📦 98
Networking
📦 315
Operating Systems
📦 89
Operations
📦 121
Package Managers
📦 55
Programming Languages
📦 245
Runtime Environments
📦 100
Science
📦 42
Security
📦 396
Social Media
📦 27
Software Architecture
📦 72
Software Development
📦 72
Software Performance
📦 58
Software Quality
📦 133
Text Editors
📦 49
Text Processing
📦 136
User Interface
📦 330
User Interface Components
📦 514
Version Control
📦 30
Virtualization
📦 71
Web Browsers
📦 42
Web Servers
📦 26
Web User Interface
📦 210