Whatlanggo

Natural language detection library for Go
Alternatives To Whatlanggo
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Fastnlp2,8501225 days ago10February 04, 201959apache-2.0Python
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.
Text_classification1,621
5 months ago1mitPython
Text Classification Algorithms: A Survey
Lingua Go862217 days ago8December 28, 20215apache-2.0Go
The most accurate natural language detection library for Go, suitable for long and short text alike
Ekphrasis583
76 months ago54May 17, 202218mitPython
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Whatlanggo5804215 days ago2March 06, 201912mitGo
Natural language detection library for Go
Open Korean Text5526612 days ago14August 07, 201813apache-2.0Scala
Open Korean Text Processor - An Open-source Korean Text Processor
Pynlpl4061634 years ago102March 13, 20192gpl-3.0Python
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).
Pykospacing305
2 months ago2gpl-3.0Python
Automatic Korean word spacing with Python
Textpipe290
12 years ago39January 25, 202124mitPython
Textpipe: clean and extract metadata from text
Stringi263
2 months ago46otherC++
Fast and portable character string processing in R (with the Unicode ICU)
Alternatives To Whatlanggo
Select To Compare


Alternative Project Comparisons
Readme

Whatlanggo

Build Status Go Report Card GoDoc Coverage Status

Natural language detection for Go.

Features

  • Supports 84 languages
  • 100% written in Go
  • No external dependencies
  • Fast
  • Recognizes not only a language, but also a script (Latin, Cyrillic, etc)

Getting started

Installation:

    go get -u github.com/abadojack/whatlanggo

Simple usage example:

package main

import (
	"fmt"

	"github.com/abadojack/whatlanggo"
)

func main() {
	info := whatlanggo.Detect("Foje funkcias kaj foje ne funkcias")
	fmt.Println("Language:", info.Lang.String(), " Script:", whatlanggo.Scripts[info.Script], " Confidence: ", info.Confidence)
}

Blacklisting and whitelisting

package main

import (
	"fmt"

	"github.com/abadojack/whatlanggo"
)

func main() {
	//Blacklist
	options := whatlanggo.Options{
		Blacklist: map[whatlanggo.Lang]bool{
			whatlanggo.Ydd: true,
		},
	}

	info := whatlanggo.DetectWithOptions("האקדמיה ללשון העברית", options)

	fmt.Println("Language:", info.Lang.String(), "Script:", whatlanggo.Scripts[info.Script])

	//Whitelist
	options1 := whatlanggo.Options{
		Whitelist: map[whatlanggo.Lang]bool{
			whatlanggo.Epo: true,
			whatlanggo.Ukr: true,
		},
	}

	info = whatlanggo.DetectWithOptions("Mi ne scias", options1)
	fmt.Println("Language:", info.Lang.String(), " Script:", whatlanggo.Scripts[info.Script])
}

For more details, please check the documentation.

Requirements

Go 1.8 or higher

How does it work?

How does the language recognition work?

The algorithm is based on the trigram language models, which is a particular case of n-grams. To understand the idea, please check the original whitepaper Cavnar and Trenkle '94: N-Gram-Based Text Categorization'.

How IsReliable calculated?

It is based on the following factors:

  • How many unique trigrams are in the given text
  • How big is the difference between the first and the second(not returned) detected languages? This metric is called rate in the code base.

Therefore, it can be presented as 2d space with threshold functions, that splits it into "Reliable" and "Not reliable" areas. This function is a hyperbola and it looks like the following one:

Language recognition whatlang rust

For more details, please check a blog article Introduction to Rust Whatlang Library and Natural Language Identification Algorithms.

License

MIT

Derivation

whatlanggo is a derivative of Franc (JavaScript, MIT) by Titus Wormer.

Acknowledgements

Thanks to greyblake (Potapov Sergey) for creating whatlang-rs from where I got the idea and algorithms.

Popular Natural Language Processing Projects
Popular Text Processing Projects
Popular Machine Learning Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Go
Language
Nlp
Text Processing