Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for natural language processing corpus
corpus
x
natural-language-processing
x
122 search results found
Nltk
⭐
12,699
NLTK Source
Nlp_chinese_corpus
⭐
8,344
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Bert Pytorch
⭐
5,605
Google AI 2018 BERT pytorch implementation
Nlp Datasets
⭐
5,235
Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP)
Nlp_tasks
⭐
2,904
Natural Language Processing Tasks and References
Uer Py
⭐
2,802
Open Source Pre-training Model Framework in PyTorch & Pre-trained Model Zoo
Cluedatasetsearch
⭐
2,778
搜索所有中文NLP数据集,附常用英文NLP数据集
Awesome Deeplearning Resources
⭐
2,739
Deep Learning and deep reinforcement learning research papers and some codes
Trafilatura
⭐
2,447
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Gpt2 Ml
⭐
1,674
GPT2 for Multiple Languages, including pretrained models. GPT2 多语言支持, 15亿参数中文预训练模型
Tensorflow 1.4 Billion Password Analysis
⭐
1,657
Deep Learning model to analyze a large corpus of clear text passwords.
Chinese Annotator
⭐
1,431
Annotator for Chinese Text Corpus (UNDER DEVELOPMENT) 中文文本标注工具
Entity Recognition Datasets
⭐
1,386
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
Insuranceqa Corpus Zh
⭐
989
🚁 保险行业语料库,聊天机器人
Seq2seq Chatbot
⭐
826
Chatbot in 200 lines of code using TensorLayer
Memn2n Tensorflow
⭐
820
"End-To-End Memory Networks" in Tensorflow
Quanteda
⭐
818
An R package for the Quantitative Analysis of Textual Data
Cltk
⭐
810
The Classical Language Toolkit
Lingua Rs
⭐
774
The most accurate natural language detection library for Rust, suitable for short text and mixed-language text
Bookcorpus
⭐
698
Crawl BookCorpus
Awesome Persian Nlp Ir
⭐
658
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Ekphrasis
⭐
583
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Magpie
⭐
574
Deep neural network framework for multi-label text classification
Ner Lstm
⭐
528
Named Entity Recognition using multilayered bidirectional LSTM
Cluecorpus2020
⭐
517
Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
Efaqa Corpus Zh
⭐
505
❤️Emotional First Aid Dataset, 心理咨询问答、聊天机器人语料库
Awesome Korean Nlp
⭐
495
A curated list of resources for NLP (Natural Language Processing) for Korean
Indicnlp_catalog
⭐
487
A collaborative catalog of NLP resources for Indic languages
Awesome Bangla
⭐
472
A collection of tools, datasets and resources on Bangla computing
Chinese Nlp Corpus
⭐
378
Collections of Chinese NLP corpus
Chinesenlpcorpus
⭐
362
An collection of Chinese nlp corpus including basic Chinese syntatic wordset, semantic wordset, historic corpus and evaluate corpus. 中文自然语言处理的语料集合,包括语义词、领域共时、历时语料库、评测语料库等。
German Nlp
⭐
360
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
Pykospacing
⭐
348
Automatic Korean word spacing with Python
Comet
⭐
346
A Neural Framework for MT Evaluation
Pycantonese
⭐
290
Cantonese Linguistics and NLP
Nlp_bahasa_resources
⭐
260
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Multi Criteria Cws
⭐
260
Simple Solution for Multi-Criteria Chinese Word Segmentation
Corus
⭐
254
Links to Russian corpora + Python functions for loading and parsing
Nlvr
⭐
250
Cornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.
Spanish Word Embeddings
⭐
248
Spanish word embeddings computed with different methods and from different corpora
Ua Gec
⭐
246
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Pyate
⭐
242
PYthon Automated Term Extraction
Mishkal
⭐
232
Mishkal is an arabic text vocalization software
Germanwordembeddings
⭐
224
Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets
Parsbert
⭐
222
🤗 ParsBERT: Transformer-based Model for Persian Language Understanding
Naturalcc
⭐
220
NaturalCC: An Open-Source Toolkit for Code Intelligence
Id Nlp Resource
⭐
211
A list of Indonesian NLP resources.
Awesome Hungarian Nlp
⭐
192
A curated list of NLP resources for Hungarian
Unify Emotion Datasets
⭐
189
A Survey and Experiments on Annotated Corpora for Emotion Classification in Text
Fakenewscorpus
⭐
184
A dataset of millions of news articles scraped from a curated list of data sources.
Bi Lstm Crf
⭐
180
A PyTorch implementation of the BI-LSTM-CRF model.
Robbert
⭐
180
A Dutch RoBERTa-based language model
Awesome Nlp Polish
⭐
169
A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.
Wordgcn
⭐
167
ACL 2019: Incorporating Syntactic and Semantic Information in Word Embeddings using Graph Convolutional Networks
Pubmed Rct
⭐
166
PubMed 200k RCT dataset: a large dataset for sequential sentence classification.
Wp2txt
⭐
160
A command-line toolkit to extract text content and category data from Wikipedia dump files
Pre Modern_chinese_corpus_dataset
⭐
132
近代汉语语料库数据集 自然语言处理 语料库 古代汉语 古汉语 文言文 数字人文 计算语言
Id Cnn Cws
⭐
130
Source codes and corpora of paper "Iterated Dilated Convolutions for Chinese Word Segmentation"
Natural Language Preprocessings
⭐
123
Some recipes of natural language pre-processing
Colibri Core
⭐
122
Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
Hubot Natural
⭐
121
Natural Language Processing Chatbot for RocketChat
Lexicon Thai
⭐
119
คลังศัพท์ภาษาไทย
Lawcrimemining
⭐
117
Law Crime Mining Based on Corpus build and content analysis by NLP methods. 基于领域语料库构建与NLP方法的裁判文书与犯罪案例文本挖掘项目
Open Korean Corpora
⭐
117
Open Korean NLP Dataset Curation for the Users All Around the Globe
Mtdata
⭐
115
A tool that locates, downloads, and extracts machine translation corpora
Toiro
⭐
110
A comparison tool of Japanese tokenizers
Crfsharp
⭐
109
CRFSharp is Conditional Random Fields implemented by .NET(C#), a machine learning algorithm for learning from labeled sequences of examples.
Treebankpreprocessing
⭐
106
Python scripts preprocessing Penn Treebank and Chinese Treebank
Chinese_nlu_by_using_rasa_nlu
⭐
106
使用 RASA NLU 来构建中文自然语言理解系统(NLU)| Use RASA NLU to build a Chinese Natural Language Understanding System (NLU)
Chive
⭐
105
Japanese word embedding with Sudachi and NWJC 🌿
Prosody
⭐
104
Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Syntactic
⭐
102
Lexical categorization engine for large datasets. Good for NLP and Data Mining.
Clustype
⭐
100
Automatic Entity Recognition and Typing for Domain-Specific Corpora (KDD'15)
Wpcorpus
⭐
98
wpcorpus - NLP corpus based on Wikipedia's full article dump
Indonesian Nlp Resources
⭐
98
data resource untuk NLP bahasa indonesia
Awesome Speech Translation
⭐
98
Weak Supervision For Ner
⭐
97
Framework to learn Named Entity Recognition models without labelled data using weak supervision.
Opusfilter
⭐
88
OpusFilter - Parallel corpus processing toolkit
Dlk2nlp
⭐
87
Day-by-day line-by-line Keras-based Korean NLP
Self_dialogue_corpus
⭐
86
The Self-dialogue Corpus - a collection of self-dialogues across music, movies and sports
Tutorialbank
⭐
85
Nlp Resources
⭐
85
A useful list of NLP(Natural Language Processing) resources
Phrase At Scale
⭐
84
Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English
Baleen
⭐
82
An automated ingestion service for blogs to construct a corpus for NLP research.
Sadedegel
⭐
81
A General Purpose NLP library for Turkish
Word2vec
⭐
81
word2vec++ is a Distributed Representations of Words (word2vec) library and tools implementation, written in C++11 from the scratch
Arabic Bert
⭐
80
Arabic edition of BERT pretrained language models
Translit Rnn
⭐
78
Automatic transliteration with LSTM
Germalemma
⭐
77
A lemmatizer for German language text
Russian_news_corpus
⭐
76
Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ
Greek Bert
⭐
74
A Greek edition of BERT pre-trained language model
Ja.text8
⭐
74
Japanese text8 corpus for word embedding.
Jrte Corpus
⭐
73
Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
Jldadmm
⭐
68
A Java package for the LDA and DMM topic models
Tupa
⭐
67
Transition-based UCCA Parser
Rdrsegmenter
⭐
67
A Fast and Accurate Vietnamese Word Segmenter (LREC 2018)
Danes
⭐
65
DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)
Nlp Corpus
⭐
65
varied english texts for modern NLP testing
Query Wellformedness
⭐
63
25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are well-formed natural language questions.
Folia
⭐
60
FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas,
Related Searches
Python Natural Language Processing (7,915)
Jupyter Notebook Natural Language Processing (4,405)
Machine Learning Natural Language Processing (3,939)
Python Corpus (2,447)
Deep Learning Natural Language Processing (2,414)
Pytorch Natural Language Processing (1,212)
Dataset Natural Language Processing (1,010)
Artificial Intelligence Natural Language Processing (1,010)
Tensorflow Natural Language Processing (909)
Javascript Natural Language Processing (843)
1-100 of 122 search results
Next >
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.