Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for natural language processing corpus
corpus
x
natural-language-processing
x
98 search results found
Nltk
⭐
12,699
NLTK Source
Nlp_chinese_corpus
⭐
8,344
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Bert Pytorch
⭐
5,605
Google AI 2018 BERT pytorch implementation
Nlp Datasets
⭐
5,235
Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP)
Nlp_tasks
⭐
2,904
Natural Language Processing Tasks and References
Uer Py
⭐
2,802
Open Source Pre-training Model Framework in PyTorch & Pre-trained Model Zoo
Cluedatasetsearch
⭐
2,778
搜索所有中文NLP数据集,附常用英文NLP数据集
Awesome Deeplearning Resources
⭐
2,739
Deep Learning and deep reinforcement learning research papers and some codes
Trafilatura
⭐
2,447
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Gpt2 Ml
⭐
1,674
GPT2 for Multiple Languages, including pretrained models. GPT2 多语言支持, 15亿参数中文预训练模型
Entity Recognition Datasets
⭐
1,386
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
Insuranceqa Corpus Zh
⭐
1,011
🚁 保险行业语料库,聊天机器人
Quanteda
⭐
828
An R package for the Quantitative Analysis of Textual Data
Memn2n Tensorflow
⭐
820
"End-To-End Memory Networks" in Tensorflow
Cltk
⭐
810
The Classical Language Toolkit
Bookcorpus
⭐
698
Crawl BookCorpus
Magpie
⭐
574
Deep neural network framework for multi-label text classification
Ner Lstm
⭐
528
Named Entity Recognition using multilayered bidirectional LSTM
Cluecorpus2020
⭐
517
Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
Efaqa Corpus Zh
⭐
505
❤️Emotional First Aid Dataset, 心理咨询问答、聊天机器人语料库
Awesome Korean Nlp
⭐
495
A curated list of resources for NLP (Natural Language Processing) for Korean
Awesome Bangla
⭐
472
A collection of tools, datasets and resources on Bangla computing
Chinese Nlp Corpus
⭐
378
Collections of Chinese NLP corpus
Chinesenlpcorpus
⭐
362
An collection of Chinese nlp corpus including basic Chinese syntatic wordset, semantic wordset, historic corpus and evaluate corpus. 中文自然语言处理的语料集合,包括语义词、领域共时、历时语料库、评测语料库等。
German Nlp
⭐
360
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
Pykospacing
⭐
348
Automatic Korean word spacing with Python
Comet
⭐
346
A Neural Framework for MT Evaluation
Pycantonese
⭐
290
Cantonese Linguistics and NLP
Nlp_bahasa_resources
⭐
260
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Multi Criteria Cws
⭐
260
Simple Solution for Multi-Criteria Chinese Word Segmentation
Corus
⭐
254
Links to Russian corpora + Python functions for loading and parsing
Nlvr
⭐
249
Cornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.
Spanish Word Embeddings
⭐
248
Spanish word embeddings computed with different methods and from different corpora
Ua Gec
⭐
246
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Pyate
⭐
242
PYthon Automated Term Extraction
Mishkal
⭐
232
Mishkal is an arabic text vocalization software
Germanwordembeddings
⭐
224
Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets
Naturalcc
⭐
220
NaturalCC: An Open-Source Toolkit for Code Intelligence
Id Nlp Resource
⭐
211
A list of Indonesian NLP resources.
Awesome Hungarian Nlp
⭐
192
A curated list of NLP resources for Hungarian
Unify Emotion Datasets
⭐
189
A Survey and Experiments on Annotated Corpora for Emotion Classification in Text
Fakenewscorpus
⭐
184
A dataset of millions of news articles scraped from a curated list of data sources.
Bi Lstm Crf
⭐
180
A PyTorch implementation of the BI-LSTM-CRF model.
Robbert
⭐
180
A Dutch RoBERTa-based language model
Awesome Nlp Polish
⭐
169
A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.
Wordgcn
⭐
167
ACL 2019: Incorporating Syntactic and Semantic Information in Word Embeddings using Graph Convolutional Networks
Pubmed Rct
⭐
166
PubMed 200k RCT dataset: a large dataset for sequential sentence classification.
Wp2txt
⭐
160
A command-line toolkit to extract text content and category data from Wikipedia dump files
Pre Modern_chinese_corpus_dataset
⭐
132
近代汉语语料库数据集 自然语言处理 语料库 古代汉语 古汉语 文言文 数字人文 计算语言
Natural Language Preprocessings
⭐
123
Some recipes of natural language pre-processing
Colibri Core
⭐
122
Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
Hubot Natural
⭐
121
Natural Language Processing Chatbot for RocketChat
Lexicon Thai
⭐
119
คลังศัพท์ภาษาไทย
Lawcrimemining
⭐
117
Law Crime Mining Based on Corpus build and content analysis by NLP methods. 基于领域语料库构建与NLP方法的裁判文书与犯罪案例文本挖掘项目
Open Korean Corpora
⭐
117
Open Korean NLP Dataset Curation for the Users All Around the Globe
Mtdata
⭐
115
A tool that locates, downloads, and extracts machine translation corpora
Toiro
⭐
110
A comparison tool of Japanese tokenizers
Crfsharp
⭐
109
CRFSharp is Conditional Random Fields implemented by .NET(C#), a machine learning algorithm for learning from labeled sequences of examples.
Treebankpreprocessing
⭐
106
Python scripts preprocessing Penn Treebank and Chinese Treebank
Chinese_nlu_by_using_rasa_nlu
⭐
106
使用 RASA NLU 来构建中文自然语言理解系统(NLU)| Use RASA NLU to build a Chinese Natural Language Understanding System (NLU)
Chive
⭐
105
Japanese word embedding with Sudachi and NWJC 🌿
Prosody
⭐
104
Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Syntactic
⭐
102
Lexical categorization engine for large datasets. Good for NLP and Data Mining.
Wpcorpus
⭐
98
wpcorpus - NLP corpus based on Wikipedia's full article dump
Indonesian Nlp Resources
⭐
98
data resource untuk NLP bahasa indonesia
Awesome Speech Translation
⭐
98
Weak Supervision For Ner
⭐
97
Framework to learn Named Entity Recognition models without labelled data using weak supervision.
Opusfilter
⭐
88
OpusFilter - Parallel corpus processing toolkit
Dlk2nlp
⭐
87
Day-by-day line-by-line Keras-based Korean NLP
Nlp Resources
⭐
85
A useful list of NLP(Natural Language Processing) resources
Phrase At Scale
⭐
84
Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English
Baleen
⭐
82
An automated ingestion service for blogs to construct a corpus for NLP research.
Sadedegel
⭐
81
A General Purpose NLP library for Turkish
Word2vec
⭐
81
word2vec++ is a Distributed Representations of Words (word2vec) library and tools implementation, written in C++11 from the scratch
Germalemma
⭐
77
A lemmatizer for German language text
Russian_news_corpus
⭐
76
Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ
Ja.text8
⭐
74
Japanese text8 corpus for word embedding.
Greek Bert
⭐
74
A Greek edition of BERT pre-trained language model
Jrte Corpus
⭐
73
Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
Jldadmm
⭐
68
A Java package for the LDA and DMM topic models
Tupa
⭐
67
Transition-based UCCA Parser
Rdrsegmenter
⭐
67
A Fast and Accurate Vietnamese Word Segmenter (LREC 2018)
Danes
⭐
65
DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)
Folia
⭐
60
FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas,
Pretraining For Language Understanding
⭐
59
Pre-training of Language Models for Language Understanding
Awesome Nlp Chinese Corpus
⭐
59
A curated list of resources of chinese corpora for NLP(Natural Language Processing)
Vietnamese Electra
⭐
59
Electra pre-trained model using Vietnamese corpus
Video_music_book_datasets
⭐
57
NLP NER datasets video/music/book bio
Nlp Corpora
⭐
57
A community-built high-quality repository of NLP corpora
Mypos
⭐
55
myPOS (Myanmar Part-of-Speech) Corpus for Myanmar NLP Research and Developments
Streusle
⭐
55
STREUSLE: a corpus with comprehensive lexical semantic annotation (multiword expressions, supersenses)
Arabicnlptoolslist
⭐
54
Arabic NLP tools List inventory
Taiga_site
⭐
54
Nerus
⭐
51
Large silver standart Russian corpus with NER, morphology and syntax markup
Tamil Nlp Catalog
⭐
51
Awesome List of Tamil NLP & AI Resources
Chinesehumorsentiment
⭐
51
ChineseHumorSentiment, chinese humor sentiment mining including corpus build and mining nlp methods.中文文本幽默情绪计算项目,项目包括幽默文本语料库的构建,幽默计算模型,包括幽默等级识
Poemmining
⭐
50
Chinese Classic Poem Mining Project including corpus buiding by spyder and content analysis by nlp methods, 基于爬虫与nlp的中国古代诗词文本挖掘项目
Language Models
⭐
48
Build unigram and bigram language models, implement Laplace smoothing and use the models to compute the perplexity of test corpora.
Cvpr_paper_search_tool
⭐
45
Automatic paper clustering and search tool by fastext from Facebook Research
Oa Stm Corpus
⭐
44
Corpus of Open Access articles from multiple fields in Science, Technology, and Medicine.
Related Searches
Python Natural Language Processing (7,915)
Jupyter Notebook Natural Language Processing (4,405)
Machine Learning Natural Language Processing (3,939)
Python Corpus (2,447)
Deep Learning Natural Language Processing (2,414)
Pytorch Natural Language Processing (1,212)
Dataset Natural Language Processing (1,010)
Artificial Intelligence Natural Language Processing (1,010)
Tensorflow Natural Language Processing (909)
Javascript Natural Language Processing (843)
1-98 of 98 search results
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.