Awesome Open Source

Programming Languages

Search results for natural language processing corpus

natural-language-processing x

122 search results found

Nltk ⭐ 12,699

Nlp_chinese_corpus ⭐ 8,344

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

Bert Pytorch ⭐ 5,605

Google AI 2018 BERT pytorch implementation

Nlp Datasets ⭐ 5,235

Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP)

Nlp_tasks ⭐ 2,904

Natural Language Processing Tasks and References

Uer Py ⭐ 2,802

Open Source Pre-training Model Framework in PyTorch & Pre-trained Model Zoo

Cluedatasetsearch ⭐ 2,778

搜索所有中文NLP数据集，附常用英文NLP数据集

Awesome Deeplearning Resources ⭐ 2,739

Deep Learning and deep reinforcement learning research papers and some codes

Trafilatura ⭐ 2,447

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

Gpt2 Ml ⭐ 1,674

GPT2 for Multiple Languages, including pretrained models. GPT2 多语言支持, 15亿参数中文预训练模型

Tensorflow 1.4 Billion Password Analysis ⭐ 1,657

Deep Learning model to analyze a large corpus of clear text passwords.

Chinese Annotator ⭐ 1,431

Annotator for Chinese Text Corpus (UNDER DEVELOPMENT) 中文文本标注工具

Entity Recognition Datasets ⭐ 1,386

A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.

Insuranceqa Corpus Zh ⭐ 989

🚁 保险行业语料库，聊天机器人

Seq2seq Chatbot ⭐ 826

Chatbot in 200 lines of code using TensorLayer

Memn2n Tensorflow ⭐ 820

"End-To-End Memory Networks" in Tensorflow

Quanteda ⭐ 818

An R package for the Quantitative Analysis of Textual Data

The Classical Language Toolkit

Lingua Rs ⭐ 774

The most accurate natural language detection library for Rust, suitable for short text and mixed-language text

Bookcorpus ⭐ 698

Crawl BookCorpus

Awesome Persian Nlp Ir ⭐ 658

Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources

Ekphrasis ⭐ 583

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

Deep neural network framework for multi-label text classification

Ner Lstm ⭐ 528

Named Entity Recognition using multilayered bidirectional LSTM

Cluecorpus2020 ⭐ 517

Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料

Efaqa Corpus Zh ⭐ 505

❤️Emotional First Aid Dataset, 心理咨询问答、聊天机器人语料库

Awesome Korean Nlp ⭐ 495

A curated list of resources for NLP (Natural Language Processing) for Korean

Indicnlp_catalog ⭐ 487

A collaborative catalog of NLP resources for Indic languages

Awesome Bangla ⭐ 472

A collection of tools, datasets and resources on Bangla computing

Chinese Nlp Corpus ⭐ 378

Collections of Chinese NLP corpus

Chinesenlpcorpus ⭐ 362

An collection of Chinese nlp corpus including basic Chinese syntatic wordset, semantic wordset, historic corpus and evaluate corpus. 中文自然语言处理的语料集合，包括语义词、领域共时、历时语料库、评测语料库等。

German Nlp ⭐ 360

Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German

Pykospacing ⭐ 348

Automatic Korean word spacing with Python

A Neural Framework for MT Evaluation

Pycantonese ⭐ 290

Cantonese Linguistics and NLP

Nlp_bahasa_resources ⭐ 260

A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia

Multi Criteria Cws ⭐ 260

Simple Solution for Multi-Criteria Chinese Word Segmentation

Links to Russian corpora + Python functions for loading and parsing

Cornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.

Spanish Word Embeddings ⭐ 248

Spanish word embeddings computed with different methods and from different corpora

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

PYthon Automated Term Extraction

Mishkal ⭐ 232

Mishkal is an arabic text vocalization software

Germanwordembeddings ⭐ 224

Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets

Parsbert ⭐ 222

🤗 ParsBERT: Transformer-based Model for Persian Language Understanding

Naturalcc ⭐ 220

NaturalCC: An Open-Source Toolkit for Code Intelligence

Id Nlp Resource ⭐ 211

A list of Indonesian NLP resources.

Awesome Hungarian Nlp ⭐ 192

A curated list of NLP resources for Hungarian

Unify Emotion Datasets ⭐ 189

A Survey and Experiments on Annotated Corpora for Emotion Classification in Text

Fakenewscorpus ⭐ 184

A dataset of millions of news articles scraped from a curated list of data sources.

Bi Lstm Crf ⭐ 180

A PyTorch implementation of the BI-LSTM-CRF model.

Robbert ⭐ 180

A Dutch RoBERTa-based language model

Awesome Nlp Polish ⭐ 169

A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.

Wordgcn ⭐ 167

ACL 2019: Incorporating Syntactic and Semantic Information in Word Embeddings using Graph Convolutional Networks

Pubmed Rct ⭐ 166

PubMed 200k RCT dataset: a large dataset for sequential sentence classification.

A command-line toolkit to extract text content and category data from Wikipedia dump files

Pre Modern_chinese_corpus_dataset ⭐ 132

近代汉语语料库数据集自然语言处理语料库古代汉语古汉语文言文数字人文计算语言

Id Cnn Cws ⭐ 130

Source codes and corpora of paper "Iterated Dilated Convolutions for Chinese Word Segmentation"

Natural Language Preprocessings ⭐ 123

Some recipes of natural language pre-processing

Colibri Core ⭐ 122

Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.

Hubot Natural ⭐ 121

Natural Language Processing Chatbot for RocketChat

Lexicon Thai ⭐ 119

คลังศัพท์ภาษาไทย

Lawcrimemining ⭐ 117

Law Crime Mining Based on Corpus build and content analysis by NLP methods. 基于领域语料库构建与NLP方法的裁判文书与犯罪案例文本挖掘项目

Open Korean Corpora ⭐ 117

Open Korean NLP Dataset Curation for the Users All Around the Globe

A tool that locates, downloads, and extracts machine translation corpora

A comparison tool of Japanese tokenizers

Crfsharp ⭐ 109

CRFSharp is Conditional Random Fields implemented by .NET(C#), a machine learning algorithm for learning from labeled sequences of examples.

Treebankpreprocessing ⭐ 106

Python scripts preprocessing Penn Treebank and Chinese Treebank

Chinese_nlu_by_using_rasa_nlu ⭐ 106

使用 RASA NLU 来构建中文自然语言理解系统（NLU）| Use RASA NLU to build a Chinese Natural Language Understanding System (NLU)

Japanese word embedding with Sudachi and NWJC 🌿

Prosody ⭐ 104

Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text

Syntactic ⭐ 102

Lexical categorization engine for large datasets. Good for NLP and Data Mining.

Clustype ⭐ 100

Automatic Entity Recognition and Typing for Domain-Specific Corpora (KDD'15)

Wpcorpus ⭐ 98

wpcorpus - NLP corpus based on Wikipedia's full article dump

Indonesian Nlp Resources ⭐ 98

data resource untuk NLP bahasa indonesia

Awesome Speech Translation ⭐ 98

Weak Supervision For Ner ⭐ 97

Framework to learn Named Entity Recognition models without labelled data using weak supervision.

Opusfilter ⭐ 88

OpusFilter - Parallel corpus processing toolkit

Day-by-day line-by-line Keras-based Korean NLP

Self_dialogue_corpus ⭐ 86

The Self-dialogue Corpus - a collection of self-dialogues across music, movies and sports

Tutorialbank ⭐ 85

Nlp Resources ⭐ 85

A useful list of NLP(Natural Language Processing) resources

Phrase At Scale ⭐ 84

Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English

An automated ingestion service for blogs to construct a corpus for NLP research.

Sadedegel ⭐ 81

A General Purpose NLP library for Turkish

Word2vec ⭐ 81

word2vec++ is a Distributed Representations of Words (word2vec) library and tools implementation, written in C++11 from the scratch

Arabic Bert ⭐ 80

Arabic edition of BERT pretrained language models

Translit Rnn ⭐ 78

Automatic transliteration with LSTM

Germalemma ⭐ 77

A lemmatizer for German language text

Russian_news_corpus ⭐ 76

Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ

Greek Bert ⭐ 74

A Greek edition of BERT pre-trained language model

Ja.text8 ⭐ 74

Japanese text8 corpus for word embedding.

Jrte Corpus ⭐ 73

Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)

A Java package for the LDA and DMM topic models

Transition-based UCCA Parser

Rdrsegmenter ⭐ 67

A Fast and Accurate Vietnamese Word Segmenter (LREC 2018)

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

Nlp Corpus ⭐ 65

varied english texts for modern NLP testing

Query Wellformedness ⭐ 63

25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are well-formed natural language questions.

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas,

Related Searches

Python Natural Language Processing (7,915)

Jupyter Notebook Natural Language Processing (4,405)

Machine Learning Natural Language Processing (3,939)

Python Corpus (2,447)

Deep Learning Natural Language Processing (2,414)

Pytorch Natural Language Processing (1,212)

Dataset Natural Language Processing (1,010)

Artificial Intelligence Natural Language Processing (1,010)

Tensorflow Natural Language Processing (909)

Javascript Natural Language Processing (843)

1-100 of 122 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.