Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for language corpus
corpus
x
language
x
99 search results found
Nltk
⭐
12,699
NLTK Source
Bert Pytorch
⭐
5,605
Google AI 2018 BERT pytorch implementation
Corpora
⭐
4,757
A collection of small corpuses of interesting data for the creation of bots and similar stuff.
Duckling
⭐
3,974
Language, engine, and tooling for expressing, testing, and evaluating composable language rules on input strings.
Laser
⭐
3,460
Language-Agnostic SEntence Representations
Nlp_tasks
⭐
2,904
Natural Language Processing Tasks and References
Chatterbot Corpus
⭐
1,219
A multilingual dialog corpus
Cltk
⭐
810
The Classical Language Toolkit
Lingua Rs
⭐
774
The most accurate natural language detection library for Rust, suitable for short text and mixed-language text
Indicnlp_catalog
⭐
487
A collaborative catalog of NLP resources for Indic languages
Tabert
⭐
436
This repository contains source code for the TaBERT model, a pre-trained language model for learning joint representations of natural language utterances and (semi-)structured tables for semantic parsing. TaBERT is pre-trained on a massive corpus of 26M Web tables and their associated natural language context, and could be used as a drop-in replacement of a semantic parsers original encoder to compute representations for utterances and table schemas (columns).
Fast_align
⭐
377
Simple, fast unsupervised word aligner
Xnli
⭐
334
Evaluating Cross-lingual Sentence Representations
Bicleaner
⭐
134
Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.
Bible Corpus
⭐
134
A multilingual parallel corpus created from translations of the Bible.
Mldoc
⭐
132
A Corpus for Multilingual Document Classification in Eight Languages.
Pre Modern_chinese_corpus_dataset
⭐
132
近代汉语语料库数据集 自然语言处理 语料库 古代汉语 古汉语 文言文 数字人文 计算语言
How2 Dataset
⭐
125
This repository contains code and metadata of How2 dataset
Natural Language Preprocessings
⭐
123
Some recipes of natural language pre-processing
Jlm
⭐
99
A fast LSTM Language Model for large vocabulary language like Japanese and Chinese
Lingtrain Aligner
⭐
98
Lingtrain Aligner — ML powered library for the accurate texts alignment.
Opusfilter
⭐
88
OpusFilter - Parallel corpus processing toolkit
Nlp Resources
⭐
85
A useful list of NLP(Natural Language Processing) resources
Afl Compiler Fuzzer
⭐
82
Variation of american fuzzy lop for testing compilers
Translit Rnn
⭐
78
Automatic transliteration with LSTM
Greek Bert
⭐
74
A Greek edition of BERT pre-trained language model
Kneser Ney
⭐
61
Kneser-Ney implementation in Python
Wikiedits
⭐
61
Automatic extraction of edited sentences from text edition histories.
Folia
⭐
60
FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas,
Nlp For Hindi
⭐
59
State of the Art Language models and Classifier for Hindi language (spoken in Indian sub-continent)
Pretraining For Language Understanding
⭐
59
Pre-training of Language Models for Language Understanding
Wikipedia Parallel Titles
⭐
53
Tools for extracting parallel corpora from article titles across languages in Wikipedia
Naki
⭐
49
List of research and engineering of NLP for American Native/Indigenous Languages.
Parallel Corpora Tools
⭐
39
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
Autocorpus
⭐
38
AutoCorpus is a set of utilities that enable automatic extraction of language corpora and language models from publicly available datasets. Autocorpus utilities follow the Unix design philosophy and integrate easily into custom data processing pipelines.
Textprep
⭐
33
Textprep is an analyzing tool for both parallel and non-parallel corpus and its down-stream Natural Language Processing and Machine Translation tasks. It is designed especially for logographic languages such as Chinese and Japanese.
Streamcorpus
⭐
33
common data interchange format for document processing pipelines that apply natural language processing tools to large streams of text
Proiel Treebank
⭐
31
Official releases of the PROIEL treebank of ancient Indo-European languages
Bigfatlm
⭐
30
Hadoop MapReduce training of modified Kneser-Ney smoothed language models
Generating Text Small Corpus
⭐
29
Generating style-specific text from a small corpus of 2.5k sentences using a pre-trained language model. Code in PyTorch
Russian Ulmfit
⭐
27
AWD-LSTM language model trained on newspaper corpora with fast.ai
Odia Nlp Resource Catalog
⭐
26
Nlp For Tamil
⭐
26
State of the Art Language models and Classifier for Tamil language (spoken in India, and few other South Asian countries)
Spacy_russian_tokenizer
⭐
26
Custom Russian tokenizer for spaCy
Awesome Kurdish
⭐
25
A curated list of awesome resources and tools for Kurdish language technology
Awesome Azeri Nlp
⭐
24
Azerbaijani language processing software, models and datasets.
Community Playbook
⭐
24
Mozilla Voice Community Playbook
Ukuxhumana
⭐
24
Neural Machine Translation for South African Languages
How I Extracted Ted Talks For Parallel Corpus
⭐
22
Aelius
⭐
20
Aelius is a suite of Python, NLTK-based modules and language data for training and evaluating POS-taggers for Brazilian Portuguese and annotating corpora in this language variety.
Cmusphinx Models
⭐
19
Acoustic and language models for minorised languages.
Quran And Arabic Language Repository
⭐
18
Projects & Libraries related to Quran & Arabic Language
En Az Parallel Corpus
⭐
18
English-Azerbaijani parallel language corpus
Thai Language
⭐
18
computer tools for thai language
Farsinlp.github.io
⭐
17
Datasets for Farsi (Persian) Natural Language Processing (NLP)
Mboshi French Parallel Corpus
⭐
17
Covid19 Datashare
⭐
17
A repo for sharing language resources related to the outbreak (in machine readable format)
Languagetool Neural Network
⭐
16
Konlp
⭐
16
KoNLTK source
Maxixe
⭐
16
A small statistical segmenter for any language.
Thailmcut
⭐
15
Cnn Ld Tf
⭐
13
Convolutional Neural Network for Language Detection in Tensorflow
Word2vec Embeddings For Nepali Language
⭐
13
Word Embeddings (Word2Vec) for Nepali Language
German2vec
⭐
13
Language Model and Text Classification for German Language using Deep Learning
Lknlp.github.io
⭐
13
Language Detector
⭐
13
Detect the language of text
Mslt Corpus
⭐
13
Microsoft Speech Language Translation (MSLT) Corpus
Lt1
⭐
13
Course on Language Technologies and NLP
Arabert
⭐
12
Arabic Language Model based on Bert
Ngram Language Model
⭐
11
An implementation of a HMM Ngram language model.
Malay Nlp Dataset
⭐
11
A collection of NLP resources for Malay
Kaldifordummies
⭐
11
Simple automatic speech recognition system based on digits corpora (Polish language), created in Kaldi toolkit. Despite of the language difference, this is an effect of 'Kaldi for dummies' tutorial published in kaldi-help discussion group. No audio data - this is just an example.
Textsummarizer
⭐
11
A text summarization tool for Marathi implemented as a project for course Adavanced NLP (CSCI 544)
Distributed Translation Infrastructure
⭐
11
The distributed statistical machine translation infrastructure consisting of load balancing, text pre/post-processing and translation services. Written in C++ 11 and utilises multicore CPUs by employing multi-threading, allows for secure SSL/TLS communications.
Asosoft Text Corpus
⭐
11
AsoSoft Text Corpus is the first large scale text corpus for the Kurdish language.
Everyfinnishword
⭐
10
Every Finnish word
Langdist
⭐
10
Multilingual Language Modeling Toolkit
Gachalign
⭐
10
Gale-Church sentence aligner with options for variable parameters
Nlp Tools
⭐
10
Tools for Natural Language Processing
Language Modeling
⭐
9
Language modeling on the Penn Treebank (PTB) corpus using a trigram model with linear interpolation, a neural probabilistic language model, and a regularized LSTM.
Mylist_thainlp_group
⭐
9
Koral
⭐
9
📝 Translation of query languages to serialized KoralQuery protocol
Corpus_similarity
⭐
9
Measure the similarity of text corpora for 74 languages
Nlp For Malyalam
⭐
8
State of the Art Language models and Classifier for Malayalam, which is spoken by the Malayali people in the Indian state of Kerala and the union territories of Lakshadweep and Puducherry
More Stoplists
⭐
8
stoplists for African languages generated from the ASP corpus
Token Rnn Tensorflow
⭐
8
Multi-layer Recurrent Neural Networks (LSTM, RNN) for token-level language models in Python using Tensorflow
Seedling
⭐
8
Building and Using A Seed Corpus for the Human Language Project
Constructiveness
⭐
8
Identifying constructive language in online communication
Babel
⭐
7
Translation without parallel corpora.
Ted Dataset
⭐
7
Apertiumpp
⭐
7
Apertium++!
Poio Corpus
⭐
7
The Poio Corpus is a freely available collection of language resources for the lesser-used languages. The data is extracted from free sources like Wikipedia, dictionaries, documents, websites and others.
Neural_language_model_bangla
⭐
7
A neural language model trained from the bangla wiki corpus
Gsoc Akkadian
⭐
7
This code is being made for Google's 2018 Summer of Code on behalf of CLTK.
Facebookdecadecorpora
⭐
7
Two large language corpora extracted from Facebook, focused primarily on Sinhala text. Timestamped statuses with origin markers. Rudimentary stopwords list included.
Perplexity
⭐
6
Language Distance Measure
Cuneiform
⭐
6
Machine translation and word embeddings of cuneiform corpuses
Hindi Nli Data
⭐
6
a repository containing the details of natural language inference dataset in Hindi
Presage
⭐
6
Fork of Presage (http://presage.sourceforge.net/)
Polyglot
⭐
6
Detect Program Language from Source Code Using Naive Bayes Classifier
Related Searches
Python Language (4,480)
Javascript Language (4,116)
Python Corpus (2,447)
Java Language (2,399)
C Plus Plus Language (1,971)
Php Language (1,760)
Language Translation (1,672)
C Language (1,627)
Golang Language (1,282)
Typescript Language (1,185)
1-99 of 99 search results
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.