Awesome Open Source

Programming Languages

Search results for language corpus

99 search results found

Nltk ⭐ 12,699

Bert Pytorch ⭐ 5,605

Google AI 2018 BERT pytorch implementation

Corpora ⭐ 4,757

A collection of small corpuses of interesting data for the creation of bots and similar stuff.

Duckling ⭐ 3,974

Language, engine, and tooling for expressing, testing, and evaluating composable language rules on input strings.

Laser ⭐ 3,460

Language-Agnostic SEntence Representations

Nlp_tasks ⭐ 2,904

Natural Language Processing Tasks and References

Chatterbot Corpus ⭐ 1,219

A multilingual dialog corpus

The Classical Language Toolkit

Lingua Rs ⭐ 774

The most accurate natural language detection library for Rust, suitable for short text and mixed-language text

Indicnlp_catalog ⭐ 487

A collaborative catalog of NLP resources for Indic languages

This repository contains source code for the TaBERT model, a pre-trained language model for learning joint representations of natural language utterances and (semi-)structured tables for semantic parsing. TaBERT is pre-trained on a massive corpus of 26M Web tables and their associated natural language context, and could be used as a drop-in replacement of a semantic parsers original encoder to compute representations for utterances and table schemas (columns).

Fast_align ⭐ 377

Simple, fast unsupervised word aligner

Evaluating Cross-lingual Sentence Representations

Bicleaner ⭐ 134

Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.

Bible Corpus ⭐ 134

A multilingual parallel corpus created from translations of the Bible.

A Corpus for Multilingual Document Classification in Eight Languages.

Pre Modern_chinese_corpus_dataset ⭐ 132

近代汉语语料库数据集自然语言处理语料库古代汉语古汉语文言文数字人文计算语言

How2 Dataset ⭐ 125

This repository contains code and metadata of How2 dataset

Natural Language Preprocessings ⭐ 123

Some recipes of natural language pre-processing

A fast LSTM Language Model for large vocabulary language like Japanese and Chinese

Lingtrain Aligner ⭐ 98

Lingtrain Aligner — ML powered library for the accurate texts alignment.

Opusfilter ⭐ 88

OpusFilter - Parallel corpus processing toolkit

Nlp Resources ⭐ 85

A useful list of NLP(Natural Language Processing) resources

Afl Compiler Fuzzer ⭐ 82

Variation of american fuzzy lop for testing compilers

Translit Rnn ⭐ 78

Automatic transliteration with LSTM

Greek Bert ⭐ 74

A Greek edition of BERT pre-trained language model

Kneser Ney ⭐ 61

Kneser-Ney implementation in Python

Wikiedits ⭐ 61

Automatic extraction of edited sentences from text edition histories.

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas,

Nlp For Hindi ⭐ 59

State of the Art Language models and Classifier for Hindi language (spoken in Indian sub-continent)

Pretraining For Language Understanding ⭐ 59

Pre-training of Language Models for Language Understanding

Wikipedia Parallel Titles ⭐ 53

Tools for extracting parallel corpora from article titles across languages in Wikipedia

List of research and engineering of NLP for American Native/Indigenous Languages.

Parallel Corpora Tools ⭐ 39

Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.

Autocorpus ⭐ 38

AutoCorpus is a set of utilities that enable automatic extraction of language corpora and language models from publicly available datasets. Autocorpus utilities follow the Unix design philosophy and integrate easily into custom data processing pipelines.

Textprep ⭐ 33

Textprep is an analyzing tool for both parallel and non-parallel corpus and its down-stream Natural Language Processing and Machine Translation tasks. It is designed especially for logographic languages such as Chinese and Japanese.

Streamcorpus ⭐ 33

common data interchange format for document processing pipelines that apply natural language processing tools to large streams of text

Proiel Treebank ⭐ 31

Official releases of the PROIEL treebank of ancient Indo-European languages

Bigfatlm ⭐ 30

Hadoop MapReduce training of modified Kneser-Ney smoothed language models

Generating Text Small Corpus ⭐ 29

Generating style-specific text from a small corpus of 2.5k sentences using a pre-trained language model. Code in PyTorch

Russian Ulmfit ⭐ 27

AWD-LSTM language model trained on newspaper corpora with fast.ai

Odia Nlp Resource Catalog ⭐ 26

Nlp For Tamil ⭐ 26

State of the Art Language models and Classifier for Tamil language (spoken in India, and few other South Asian countries)

Spacy_russian_tokenizer ⭐ 26

Custom Russian tokenizer for spaCy

Awesome Kurdish ⭐ 25

A curated list of awesome resources and tools for Kurdish language technology

Awesome Azeri Nlp ⭐ 24

Azerbaijani language processing software, models and datasets.

Community Playbook ⭐ 24

Mozilla Voice Community Playbook

Ukuxhumana ⭐ 24

Neural Machine Translation for South African Languages

How I Extracted Ted Talks For Parallel Corpus ⭐ 22

Aelius is a suite of Python, NLTK-based modules and language data for training and evaluating POS-taggers for Brazilian Portuguese and annotating corpora in this language variety.

Cmusphinx Models ⭐ 19

Acoustic and language models for minorised languages.

Quran And Arabic Language Repository ⭐ 18

Projects & Libraries related to Quran & Arabic Language

En Az Parallel Corpus ⭐ 18

English-Azerbaijani parallel language corpus

Thai Language ⭐ 18

computer tools for thai language

Farsinlp.github.io ⭐ 17

Datasets for Farsi (Persian) Natural Language Processing (NLP)

Mboshi French Parallel Corpus ⭐ 17

Covid19 Datashare ⭐ 17

A repo for sharing language resources related to the outbreak (in machine readable format)

Languagetool Neural Network ⭐ 16

A small statistical segmenter for any language.

Thailmcut ⭐ 15

Cnn Ld Tf ⭐ 13

Convolutional Neural Network for Language Detection in Tensorflow

Word2vec Embeddings For Nepali Language ⭐ 13

Word Embeddings (Word2Vec) for Nepali Language

German2vec ⭐ 13

Language Model and Text Classification for German Language using Deep Learning

Lknlp.github.io ⭐ 13

Language Detector ⭐ 13

Detect the language of text

Mslt Corpus ⭐ 13

Microsoft Speech Language Translation (MSLT) Corpus

Course on Language Technologies and NLP

Arabic Language Model based on Bert

Ngram Language Model ⭐ 11

An implementation of a HMM Ngram language model.

Malay Nlp Dataset ⭐ 11

A collection of NLP resources for Malay

Kaldifordummies ⭐ 11

Simple automatic speech recognition system based on digits corpora (Polish language), created in Kaldi toolkit. Despite of the language difference, this is an effect of 'Kaldi for dummies' tutorial published in kaldi-help discussion group. No audio data - this is just an example.

Textsummarizer ⭐ 11

A text summarization tool for Marathi implemented as a project for course Adavanced NLP (CSCI 544)

Distributed Translation Infrastructure ⭐ 11

The distributed statistical machine translation infrastructure consisting of load balancing, text pre/post-processing and translation services. Written in C++ 11 and utilises multicore CPUs by employing multi-threading, allows for secure SSL/TLS communications.

Asosoft Text Corpus ⭐ 11

AsoSoft Text Corpus is the first large scale text corpus for the Kurdish language.

Everyfinnishword ⭐ 10

Every Finnish word

Langdist ⭐ 10

Multilingual Language Modeling Toolkit

Gachalign ⭐ 10

Gale-Church sentence aligner with options for variable parameters

Nlp Tools ⭐ 10

Tools for Natural Language Processing

Language Modeling ⭐ 9

Language modeling on the Penn Treebank (PTB) corpus using a trigram model with linear interpolation, a neural probabilistic language model, and a regularized LSTM.

Mylist_thainlp_group ⭐ 9

📝 Translation of query languages to serialized KoralQuery protocol

Corpus_similarity ⭐ 9

Measure the similarity of text corpora for 74 languages

Nlp For Malyalam ⭐ 8

State of the Art Language models and Classifier for Malayalam, which is spoken by the Malayali people in the Indian state of Kerala and the union territories of Lakshadweep and Puducherry

More Stoplists ⭐ 8

stoplists for African languages generated from the ASP corpus

Token Rnn Tensorflow ⭐ 8

Multi-layer Recurrent Neural Networks (LSTM, RNN) for token-level language models in Python using Tensorflow

Building and Using A Seed Corpus for the Human Language Project

Constructiveness ⭐ 8

Identifying constructive language in online communication

Translation without parallel corpora.

Ted Dataset ⭐ 7

Apertiumpp ⭐ 7

Poio Corpus ⭐ 7

The Poio Corpus is a freely available collection of language resources for the lesser-used languages. The data is extracted from free sources like Wikipedia, dictionaries, documents, websites and others.

Neural_language_model_bangla ⭐ 7

A neural language model trained from the bangla wiki corpus

Gsoc Akkadian ⭐ 7

This code is being made for Google's 2018 Summer of Code on behalf of CLTK.

Facebookdecadecorpora ⭐ 7

Two large language corpora extracted from Facebook, focused primarily on Sinhala text. Timestamped statuses with origin markers. Rudimentary stopwords list included.

Perplexity ⭐ 6

Language Distance Measure

Cuneiform ⭐ 6

Machine translation and word embeddings of cuneiform corpuses

Hindi Nli Data ⭐ 6

a repository containing the details of natural language inference dataset in Hindi

Fork of Presage (http://presage.sourceforge.net/)

Detect Program Language from Source Code Using Naive Bayes Classifier

Related Searches

Python Language (4,480)

Javascript Language (4,116)

Python Corpus (2,447)

Java Language (2,399)

C Plus Plus Language (1,971)

Php Language (1,760)

Language Translation (1,672)

C Language (1,627)

Golang Language (1,282)

Typescript Language (1,185)

1-99 of 99 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.