Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Nlp Progress | 21,649 | a day ago | 50 | mit | Python | |||||
Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks. | ||||||||||
Datasets | 16,335 | 9 | 208 | 15 hours ago | 52 | June 15, 2022 | 616 | apache-2.0 | Python | |
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools | ||||||||||
Nlp_chinese_corpus | 8,245 | 15 days ago | 20 | mit | ||||||
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP | ||||||||||
Doccano | 7,807 | a day ago | 28 | May 19, 2022 | 244 | mit | Python | |||
Open source annotation tool for machine learning practitioners. | ||||||||||
Text | 3,305 | 341 | 102 | 17 hours ago | 22 | June 28, 2022 | 296 | bsd-3-clause | Python | |
Models, data loaders and abstractions for language processing, powered by PyTorch | ||||||||||
Cluedatasetsearch | 2,778 | 6 months ago | 6 | Python | ||||||
搜索所有中文NLP数据集,附常用英文NLP数据集 | ||||||||||
Textattack | 2,344 | 7 | 16 days ago | 44 | May 25, 2022 | 41 | mit | Python | ||
TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP https://textattack.readthedocs.io/en/master/ | ||||||||||
Awesome Pretrained Chinese Nlp Models | 2,165 | 2 days ago | mit | Python | ||||||
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型集合 | ||||||||||
Pytorch Nlp | 1,929 | 9 | 8 | 2 years ago | 19 | November 04, 2019 | 16 | bsd-3-clause | Python | |
Basic Utilities for PyTorch Natural Language Processing (NLP) | ||||||||||
Codesearchnet | 1,887 | a year ago | 7 | mit | Jupyter Notebook | |||||
Datasets, tools, and benchmarks for representation learning of code. |
Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph", which is implemented based on the UER framework.
News
Software:
Python3
Pytorch >= 1.0
argparse == 1.1
google_model.bin
from here, and save it to the models/
directory.CnDbpedia.spo
from here, and save it to the brain/kgs/
directory.datasets/
directory.The directory tree of K-BERT:
K-BERT
├── brain
│ ├── config.py
│ ├── __init__.py
│ ├── kgs
│ │ ├── CnDbpedia.spo
│ │ ├── HowNet.spo
│ │ └── Medical.spo
│ └── knowgraph.py
├── datasets
│ ├── book_review
│ │ ├── dev.tsv
│ │ ├── test.tsv
│ │ └── train.tsv
│ ├── chnsenticorp
│ │ ├── dev.tsv
│ │ ├── test.tsv
│ │ └── train.tsv
│ ...
│
├── models
│ ├── google_config.json
│ ├── google_model.bin
│ └── google_vocab.txt
├── outputs
├── uer
├── README.md
├── requirements.txt
├── run_kbert_cls.py
└── run_kbert_ner.py
Run example on Book review with CnDbpedia:
CUDA_VISIBLE_DEVICES='0' nohup python3 -u run_kbert_cls.py \
--pretrained_model_path ./models/google_model.bin \
--config_path ./models/google_config.json \
--vocab_path ./models/google_vocab.txt \
--train_path ./datasets/book_review/train.tsv \
--dev_path ./datasets/book_review/dev.tsv \
--test_path ./datasets/book_review/test.tsv \
--epochs_num 5 --batch_size 32 --kg_name CnDbpedia \
--output_model_path ./outputs/kbert_bookreview_CnDbpedia.bin \
> ./outputs/kbert_bookreview_CnDbpedia.log &
Results:
Best accuracy in dev : 88.80%
Best accuracy in test: 87.69%
Options of run_kbert_cls.py
:
useage: [--pretrained_model_path] - Path to the pre-trained model parameters.
[--config_path] - Path to the model configuration file.
[--vocab_path] - Path to the vocabulary file.
--train_path - Path to the training dataset.
--dev_path - Path to the validating dataset.
--test_path - Path to the testing dataset.
[--epochs_num] - The number of training epoches.
[--batch_size] - Batch size of the training process.
[--kg_name] - The name of knowledge graph, "HowNet", "CnDbpedia" or "Medical".
[--output_model_path] - Path to the output model.
Accuracy (dev/test %) on different dataset:
Dataset | HowNet | CnDbpedia |
---|---|---|
Book review | 88.75/87.75 | 88.80/87.69 |
ChnSentiCorp | 95.00/95.50 | 94.42/95.25 |
Shopping | 97.01/96.92 | 96.94/96.73 |
98.22/98.33 | 98.29/98.33 | |
LCQMC | 88.97/87.14 | 88.91/87.20 |
XNLI | 77.11/77.07 | 76.99/77.43 |
Run an example on the msra_ner dataset with CnDbpedia:
CUDA_VISIBLE_DEVICES='0' nohup python3 -u run_kbert_ner.py \
--pretrained_model_path ./models/google_model.bin \
--config_path ./models/google_config.json \
--vocab_path ./models/google_vocab.txt \
--train_path ./datasets/msra_ner/train.tsv \
--dev_path ./datasets/msra_ner/dev.tsv \
--test_path ./datasets/msra_ner/test.tsv \
--epochs_num 5 --batch_size 16 --kg_name CnDbpedia \
--output_model_path ./outputs/kbert_msraner_CnDbpedia.bin \
> ./outputs/kbert_msraner_CnDbpedia.log &
Results:
The best in dev : precision=0.957, recall=0.962, f1=0.960
The best in test: precision=0.953, recall=0.959, f1=0.956
Options of run_kbert_ner.py
:
useage: [--pretrained_model_path] - Path to the pre-trained model parameters.
[--config_path] - Path to the model configuration file.
[--vocab_path] - Path to the vocabulary file.
--train_path - Path to the training dataset.
--dev_path - Path to the validating dataset.
--test_path - Path to the testing dataset.
[--epochs_num] - The number of training epoches.
[--batch_size] - Batch size of the training process.
[--kg_name] - The name of knowledge graph.
[--output_model_path] - Path to the output model.
Experimental results on domain-specific tasks (Precision/Recall/F1 %):
KG | Finance_QA | Law_QA | Finance_NER | Medicine_NER |
---|---|---|---|---|
HowNet | 0.805/0.888/0.845 | 0.842/0.903/0.871 | 0.860/0.888/0.874 | 0.935/0.939/0.937 |
CN-DBpedia | 0.814/0.881/0.846 | 0.814/0.942/0.874 | 0.860/0.887/0.873 | 0.935/0.937/0.936 |
MedicalKG | -- | -- | -- | 0.944/0.943/0.944 |
This work is a joint study with the support of Peking University and Tencent Inc.
If you use this code, please cite this paper:
@inproceedings{weijie2019kbert,
title={{K-BERT}: Enabling Language Representation with Knowledge Graph},
author={Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, Ping Wang},
booktitle={Proceedings of AAAI 2020},
year={2020}
}