Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for corpus
corpus
x
2,239 search results found
Nltk
⭐
12,699
NLTK Source
Nlp_chinese_corpus
⭐
8,344
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Asrt_speechrecognition
⭐
7,253
A Deep-Learning-Based Chinese Speech Recognition System 基于深度学习的中文语音识别系统
Glove
⭐
6,480
Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Bert Pytorch
⭐
5,605
Google AI 2018 BERT pytorch implementation
Tensorflow Wavenet
⭐
5,362
A TensorFlow implementation of DeepMind's WaveNet paper
Nlp Datasets
⭐
5,235
Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP)
Vespa
⭐
5,115
AI + Data, online. https://vespa.ai
Pycorrector
⭐
4,928
pycorrector is a toolkit for text error correction. 文本纠错,实现了Kenlm,T5,MacBERT,ChatGLM3,LLaMA等模型应用在纠错场景,
Corpora
⭐
4,757
A collection of small corpuses of interesting data for the creation of bots and similar stuff.
Go Fuzz
⭐
4,674
Randomized testing for Go
Duckling
⭐
3,974
Language, engine, and tooling for expressing, testing, and evaluating composable language rules on input strings.
Chinese Names Corpus
⭐
3,719
中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词
Speech To Text Wavenet
⭐
3,586
Speech-to-Text-WaveNet : End-to-end sentence level English speech recognition based on DeepMind's WaveNet and tensorflow
Chinese_chatbot_corpus
⭐
3,550
中文公开聊天语料库
Google 10000 English
⭐
3,537
This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus.
Laser
⭐
3,460
Language-Agnostic SEntence Representations
Clue
⭐
3,345
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Markovify
⭐
3,168
A simple, extensible Markov chain generator.
Nlp_tasks
⭐
2,904
Natural Language Processing Tasks and References
Deepqa
⭐
2,878
My tensorflow implementation of "A neural conversational model", a Deep learning based chatbot
Uer Py
⭐
2,802
Open Source Pre-training Model Framework in PyTorch & Pre-trained Model Zoo
Cluedatasetsearch
⭐
2,778
搜索所有中文NLP数据集,附常用英文NLP数据集
Awesome Deeplearning Resources
⭐
2,739
Deep Learning and deep reinforcement learning research papers and some codes
Trafilatura
⭐
2,447
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Weibo_terminater
⭐
2,265
Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator
Awesome Chatbot
⭐
1,977
Awesome Chatbot Projects,Corpus,Papers,Tutorials.Chinese Chatbot =>:
Fuzzilli
⭐
1,731
A JavaScript Engine Fuzzer
Gpt2 Ml
⭐
1,674
GPT2 for Multiple Languages, including pretrained models. GPT2 多语言支持, 15亿参数中文预训练模型
Tensorflow 1.4 Billion Password Analysis
⭐
1,657
Deep Learning model to analyze a large corpus of clear text passwords.
Chatbot Retrieval
⭐
1,545
Dual LSTM Encoder for Dialog Response Generation
Yake
⭐
1,522
Single-document unsupervised keyword extraction
Dialog_corpus
⭐
1,487
用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
Rasa_nlu_chi
⭐
1,466
Turn Chinese natural language into structured data 中文自然语言理解
Chinese Annotator
⭐
1,431
Annotator for Chinese Text Corpus (UNDER DEVELOPMENT) 中文文本标注工具
Entity Recognition Datasets
⭐
1,386
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
Rc Data
⭐
1,221
Question answering dataset featured in "Teaching Machines to Read and Comprehend
Chatterbot Corpus
⭐
1,219
A multilingual dialog corpus
Glove Python
⭐
1,171
Toy Python implementation of http://www-nlp.stanford.edu/projects/glove/
Company Names Corpus
⭐
1,106
公司名语料库。机构名语料库。公司简称,缩写,品牌词,企业名。可用于中文分词、机构名实体识别。
Insuranceqa Corpus Zh
⭐
989
🚁 保险行业语料库,聊天机器人
Autophrase
⭐
978
AutoPhrase: Automated Phrase Mining from Massive Text Corpora
Cdial Gpt
⭐
944
A Large-scale Chinese Short-Text Conversation Dataset and Chinese pre-training dialog models
Nlp Datasets
⭐
871
A list of datasets/corpora for NLP tasks, in reverse chronological order.
Voice_datasets
⭐
846
🔊 A comprehensive list of open-source datasets for voice and sound computing (95+ datasets).
Seq2seq Chatbot
⭐
826
Chatbot in 200 lines of code using TensorLayer
Quanteda
⭐
824
An R package for the Quantitative Analysis of Textual Data
Memn2n Tensorflow
⭐
820
"End-To-End Memory Networks" in Tensorflow
Pisa
⭐
820
PISA: Performant Indexes and Search for Academia
Cltk
⭐
810
The Classical Language Toolkit
Lingua Rs
⭐
774
The most accurate natural language detection library for Rust, suitable for short text and mixed-language text
Lexvec
⭐
700
This is an implementation of the LexVec word embedding model (similar to word2vec and GloVe) that achieves state of the art results in multiple NLP tasks
Bookcorpus
⭐
698
Crawl BookCorpus
Awesome Persian Nlp Ir
⭐
658
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Wordless
⭐
649
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Ngram2vec
⭐
638
Four word embedding models implemented in Python. Supporting arbitrary context features
S2orc
⭐
634
S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447
Wiki2vec
⭐
587
Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps. Questions? https://gitter.im/idio-opensource/Lobby
Ekphrasis
⭐
583
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Cpu_rec
⭐
578
Recognize cpu instructions in an arbitrary binary file
Magpie
⭐
574
Deep neural network framework for multi-label text classification
Ubuntu Ranking Dataset Creator
⭐
570
A script that creates train, valid and test datasets for the ranking task from Ubuntu corpus dialogs.
Bytenet
⭐
570
A tensorflow implementation of French-to-English machine translation using DeepMind's ByteNet .
Afl
⭐
558
american fuzzy lop (copy of the source code for easy access)
Dl_eventextractionpapers
⭐
555
2015年以来基于深度学习方法的事件抽取论文整理
Text_renderer
⭐
543
Bertweet
⭐
542
BERTweet: A pre-trained language model for English Tweets (EMNLP-2020)
Exbert
⭐
541
A Visual Analysis Tool to Explore Learned Representations in Transformers Models
Jsfuzz
⭐
537
coverage guided fuzz testing for javascript
Cluepretrainedmodels
⭐
536
高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型
Multiturnresponseselection
⭐
534
This repo contains our ACL 2017 paper data and source code
Ner Lstm
⭐
528
Named Entity Recognition using multilayered bidirectional LSTM
Cluecorpus2020
⭐
517
Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
Cblue
⭐
515
中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
Efaqa Corpus Zh
⭐
505
❤️Emotional First Aid Dataset, 心理咨询问答、聊天机器人语料库
Korpora
⭐
500
Korean corpus repository
Awesome Korean Nlp
⭐
495
A curated list of resources for NLP (Natural Language Processing) for Korean
Gensim Data
⭐
492
Data repository for pretrained NLP models and NLP corpora.
Language Style Transfer
⭐
491
Indicnlp_catalog
⭐
487
A collaborative catalog of NLP resources for Indic languages
Fuzzdata
⭐
486
Fuzzing resources for feeding various fuzzers with input. 🔧
Awesome Bangla
⭐
472
A collection of tools, datasets and resources on Bangla computing
Commoncrawl
⭐
466
Common Crawl support library to access 2008-2012 crawl archives (ARC files)
Og Search Engineering
⭐
462
Want to build or improve a search experience? Start here.
Ba Dls Deepspeech
⭐
457
Markov
⭐
441
Markov chain text generator, as used for KingJamesProgramming
Document_cluster
⭐
440
A guide to document clustering in Python
Corpus
⭐
436
Yet another CSS toolkit. Basically the stuff I use for most projects.
Tabert
⭐
436
This repository contains source code for the TaBERT model, a pre-trained language model for learning joint representations of natural language utterances and (semi-)structured tables for semantic parsing. TaBERT is pre-trained on a massive corpus of 26M Web tables and their associated natural language context, and could be used as a drop-in replacement of a semantic parsers original encoder to compute representations for utterances and table schemas (columns).
Chinesewordsegmentation
⭐
427
Chinese word segmentation algorithm without corpus(无需语料库的中文分词)
Mimic Recording Studio
⭐
425
Mimic Recording Studio is a Docker-based application you can install to record voice samples, which can then be trained into a TTS voice with Mimic2
Undreamt
⭐
421
Unsupervised Neural Machine Translation
Fact Extractor
⭐
413
Fact Extraction from Wikipedia Text
Paws
⭐
403
This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification.
Cec Corpus
⭐
399
📚中文突发事件语料库(Chinese Emergency Corpus)-上海大学-语义智能实验室
Corpus
⭐
396
自然语言处理,知识图谱相关语料。按照Task细分,欢迎PR。
Insuranceqa
⭐
379
A question answering corpus in insurance domain
Kdconv
⭐
378
KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation
Chinese Nlp Corpus
⭐
378
Collections of Chinese NLP corpus
Fast_align
⭐
377
Simple, fast unsupervised word aligner
Related Searches
Python Corpus (2,447)
Natural Language Processing Corpus (510)
Dataset Corpus (342)
Java Corpus (308)
Language Corpus (261)
1-100 of 2,239 search results
Next >
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.