Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for tokenizer
tokenizer
x
661 search results found
Sentencepiece
⭐
8,851
Unsupervised text tokenizer for Neural Network-based text generation.
Tokenizers
⭐
8,056
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
Gpt2 Chinese
⭐
7,249
Chinese version of GPT2 training code, using BERT tokenizer.
Php Token Stream
⭐
6,457
Wrapper around PHP's tokenizer extension.
Tokenizer
⭐
5,084
A small library for converting tokenized PHP source code into XML (and potentially other formats)
File Type
⭐
3,366
Detect the file type of a Buffer/Uint8Array/ArrayBuffer
Tntsearch
⭐
3,004
A fully featured full text search engine written in PHP
Chevrotain
⭐
2,381
Parser Building Toolkit for JavaScript
Text
⭐
1,172
Making text a first-class citizen in TensorFlow.
Hazm
⭐
1,102
Persian NLP Toolkit
Natasha
⭐
1,085
Solves basic Russian NLP tasks, API for lower level Natasha projects
Kobert
⭐
1,035
Korean BERT pre-trained cased (KoBERT)
Nlp With Ruby
⭐
1,002
Curated List: Practical Natural Language Processing done in Ruby
Autophrase
⭐
978
AutoPhrase: Automated Phrase Mining from Massive Text Corpora
Superpower
⭐
848
A C# parser construction toolkit with high-quality error reporting
Soynlp
⭐
801
한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.
Kagome
⭐
769
Self-contained Japanese Morphological Analyzer written in pure Go
Moo
⭐
763
Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
Libinjection
⭐
759
SQL / SQLI tokenizer parser analyzer
Parsimmon
⭐
714
Parsimmon is a wee linguistics toolkit for iOS written in Swift.
Elasticsearch Inquisitor
⭐
712
Site plugin for Elasticsearch to help understand and debug queries.
Mustard
⭐
686
🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.
Goro
⭐
673
PHP in Go
React Typeahead
⭐
660
Pure react-based typeahead and typeahead-tokenizer
Wordless
⭐
649
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Ru_transformers
⭐
627
Laracms
⭐
598
LaraCMS 是在学习 laravel ( web 开发实战进阶 + 实战构架 API 服务器) 过程中产生的一个业余作品,试图通过简单的方式,快速构建一套基本的企业站同时保留很灵活的扩展能力和优雅 也是一个学习Laravel 不错的参考示例。
Ekphrasis
⭐
583
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Open Korean Text
⭐
552
Open Korean Text Processor - An Open-source Korean Text Processor
Jflex
⭐
523
The fast scanner generator for Java™ with full Unicode support
Parsekit
⭐
503
Objective-C Tokenizer and Parser Generator. Supports Grammars.
Php Parser
⭐
500
🌿 NodeJS PHP Parser - extract AST or tokens
Bayes
⭐
494
Naive-Bayes Classifier for node.js
Gretel Synthetics
⭐
490
Synthetic data generators for structured and unstructured text, featuring differentially private learning.
Php Text Analysis
⭐
484
PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language
Transformers.jl
⭐
479
Julia Implementation of Transformer models
Sacremoses
⭐
476
Python port of Moses tokenizer, truecaser and normalizer
Elasticsearch Analysis Vietnamese
⭐
470
Vietnamese Analysis Plugin for Elasticsearch
Friso
⭐
449
High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.
Cogcomp Nlp
⭐
448
CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
Ckip Transformers
⭐
439
CKIP Transformers
Node Question Answering
⭐
418
Fast and production-ready question answering in Node.js
Bert Japanese
⭐
415
BERT with SentencePiece for Japanese text.
Simple
⭐
411
支持中文和拼音的 SQLite fts5 全文搜索扩展 | A SQLite3 fts5 tokenizer which supports Chinese and PinYin
Js Tokens
⭐
410
Tiny JavaScript tokenizer.
Mmseg4j Solr
⭐
403
mmseg4j for lucene or solr analyzer
Tokenmonster
⭐
399
Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
Tiktokenizer
⭐
397
Online playground for OpenAPI tokenizers
Sentences
⭐
391
A multilingual command line sentence tokenizer in Golang
Kogpt2
⭐
382
Korean GPT-2 pretrained cased (KoGPT2)
Spyglass
⭐
378
A library for mentions on Android
Kldns
⭐
371
快乐二级域名分发系统
Lexmachine
⭐
370
Lex machinary for go.
Nagisa
⭐
365
A Japanese tokenizer based on recurrent neural networks
Tabloid
⭐
365
A minimal programming language inspired by clickbait headlines
Php Short Array Syntax Converter
⭐
352
Command-line script to convert PHP's array() syntax to PHP 5.4's short array syntax []
Kcbert
⭐
344
🤗 Pretrained BERT model & WordPiece tokenizer trained on Korean Comments 한국어 댓글로 프리트레이닝한 BERT 모델과 데이터셋
Structural Probes
⭐
340
Codebase for testing whether hidden states of neural networks encode discrete structures.
Fugashi
⭐
339
A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
Melusine
⭐
335
Melusine is a high-level library for emails classification and feature extraction "dédiée aux courriels français".
Jumanpp
⭐
334
Juman++ (a Morphological Analyzer Toolkit)
Lindera
⭐
326
A morphological analysis library.
Kobart
⭐
320
Korean BART
Smoothnlp
⭐
320
专注于可解释的NLP技术 An NLP Toolset With A Focus on Explainable Inference
Deepcut
⭐
319
A Thai word tokenization library using Deep Neural Network
Sudachipy
⭐
318
Python version of Sudachi, a Japanese tokenizer.
Mail Parser
⭐
311
Tokenizer for raw mails
Nmt Chatbot
⭐
309
NMT Chatbot
Gpt Tokenizer
⭐
309
JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT-2 / GPT-3 / GPT-4. Port of OpenAI's tiktoken with additional features.
Vscode Blockman
⭐
304
VSCode extension to highlight nested code blocks
Ts Parsec
⭐
301
Writing a custom parser is a fairly common need. Although there are already parser combinators in others languages, TypeScript provides a powerful and well-structured foundation for building this. Common parser combinators’ weakness are error handling and ambiguity resolving, but these are ts-parsec’s important features. Additionally, ts-parsec provides a very easy to use programming interface, that could help people to build programming-language-scale parsers in just a few hours. This technolog
Elasticsearch Analysis Jieba
⭐
296
The plugin includes the `jieba` analyzer, `jieba` tokenizer, and `jieba` token filter, and have two mode you can choose. one is `index` which means it will be used when you want to index a document. another is `search` mode which used when you want to search something.
Coccoc Tokenizer
⭐
295
high performance tokenizer for Vietnamese language
Tsql Parser
⭐
278
Library Written in C# For Parsing SQL Server T-SQL Scripts in .Net
Vibrato
⭐
275
🎤 vibrato: Viterbi-based accelerated tokenizer
Text2text
⭐
268
Text2Text: Crosslingual NLP/G toolkit
Bitextor
⭐
260
Bitextor generates translation memories from multilingual websites
Llama Tokenizer Js
⭐
250
JS tokenizer for LLaMA
Rust Tokenizers
⭐
232
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models
Dt Sql Parser
⭐
228
SQL Parsers for BigData, built with antlr4.
Tokenizer
⭐
224
Fast and customizable text tokenization library with BPE and SentencePiece support
Opennlp
⭐
221
Open source NLP tools (sentence splitter, tokenizer, chunker, coref, NER, parse trees, etc.) in C#
Segmentit
⭐
208
任何 JS 环境可用的中文分词包,fork from leizongmin/node-segment
Vaporetto
⭐
206
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer
Elasticsearch Analysis Hao
⭐
201
一个非常hao用的elasticsearch(es)中文分词器插件
Konoha
⭐
200
🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.
Udpipe
⭐
198
R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
Kobert Transformers
⭐
185
KoBERT on 🤗 Huggingface Transformers 🤗 (with Bug Fixed)
Mpmd
⭐
181
Magento Project Mess Detector (for n98-magerun)
Query Translator
⭐
176
Query Translator is a search query translator with AST representation
Tokenizers
⭐
170
Fast, Consistent Tokenization of Natural Language Text
Peast
⭐
165
JavaScript parser written in PHP that generates AST from your code according to ECMAScript specification
Tiktoken Rs
⭐
163
Ready-made tokenizer library for working with GPT and tiktoken
Syntok
⭐
158
Text tokenization and sentence segmentation (segtok v2)
Naive Bayes Classifier
⭐
157
yet another general purpose naive bayesian classifier.
Vntk
⭐
155
Vietnamese NLP Toolkit for Node
Transformer Lm
⭐
155
Transformer language model (GPT-2) with sentencepiece tokenizer
Gruut
⭐
153
A tokenizer, text cleaner, and phonemizer for many human languages.
Segtok
⭐
151
Segtok v2 is here: https://github.com/fnl/syntok -- A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features.
Nlc
⭐
150
Neural Language Correction implemented on Tensorflow
1-100 of 661 search results
Next >
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.