Xqa

Dataset and baseline for ACL 2019 paper "XQA: A Cross-lingual Open-domain Question Answering Dataset"
Alternatives To Xqa
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Mrc For Flat Nested Ner392
a year ago50apache-2.0Python
Code for ACL 2020 paper `A Unified MRC Framework for Named Entity Recognition`
Text2sql Data351
5 months ago2otherPython
A collection of datasets that pair questions with SQL queries.
Peerread297
3 years ago4Python
Data and code for Kang et al., NAACL 2018's paper titled "A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications"
Articlepairmatching175
2 years ago10otherPython
The code of ACL 2019 paper: Matching Article Pairs with Graphical Decomposition and Convolutions
Triggerner157
2 years agoPython
TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition (ACL 2020)
Neusum118
4 years ago3Python
Code for the ACL 2018 paper "Neural Document Summarization by Jointly Learning to Score and Select Sentences"
Rumor_rvnn109
2 years ago7mitPython
Source Codes: Rumor Detection on Twitter with Tree-structured Recursive Neural Networks -- ACL 2018
Copymtl86
2 years agoPython
AAAI20 "CopyMTL: Copy Mechanism for Joint Extraction of Entities and Relations with Multi-Task Learning"
Xqa74
a year ago2mitPython
Dataset and baseline for ACL 2019 paper "XQA: A Cross-lingual Open-domain Question Answering Dataset"
Writing Editing Network72
2 years agomitPython
Code for Paper Abstract Writing through Editing Mechanism
Alternatives To Xqa
Select To Compare


Alternative Project Comparisons
Readme

XQA

This repo contains data and baseline implementation for ACL 2019 paper "XQA: A Cross-lingual Open-domain Question Answering Dataset".

Setup

Data

The XQA dataset (questions, answers, and top-10 relevant articles) can be downloaded with the following link: The XQA dataset.

We also provide preprocessed wiki dumps for each language at Wiki Dump for XQA. If you are going to use your own retrival module, please use them as the text corpus.

Dependencies

Our implementation bases on DocumentQA and BERT and we use them as submodules.

After you clone our repo, fetch the submodules with:

git submodule init
git submodule update

We require python >= 3.5, tensorflow, and other supporting libraries for DocumentQA and BERT.

To install the dependencies for DocumentQA other than tensorflow, use

pip install -r documentqa/requirements.txt

The stopword corpus and punkt sentence tokenizer for nltk are needed and can be fetched with:

python -m nltk.downloader punkt stopwords

It should be noted that DocumentQA and BERT require different versions of Tensorflow.

To train and validate models with DocumentQA, use:

pip install tensorflow-gpu==1.3.0

To train and validate models with DocumentQA, use:

pip install tensorflow-gpu==1.11.0

The easiest way to run this code is to use:

export PYTHONPATH=${PYTHONPATH}:`pwd`/documentqa

Word Vectors

The DocumentQA models use the common crawl 840 billion token GloVe word vectors from here. They are expected to exist in "~/data/glove/glove.840B.300d.txt" or "~/data/glove/glove.840B.300d.txt.gz".

For example:

mkdir -p ~/data
mkdir -p ~/data/glove
cd ~/data/glove
wget http://nlp.stanford.edu/data/glove.840B.300d.zip
unzip glove.840B.300d.zip
rm glove.840B.300d.zip

Data Preprocessing

First, set "DATA_DIR" in config.py to path which stores XQA data.

To preprocess the data, we can run the following code for each corpus (en, de, fr, pl, pt, ru, ta, uk, zh):

python preprocess_data.py <corpus_name>
python evidence_corpus.py --corpus <corpus_name> --n_processes 8
python build_span_corpus.py <corpus_name> --n_processes 8

Training Model with DocumentQA

After data preprocessing, use "ablate_xqa.py" to train DocumentQA models, for example: python ablate_xqa.py <corpus_name> shared-norm <model_dir>

Evaluating Model with DocumentQA

To evaluate DocumentQA models, use "document_qa_eval.py", for example: python document_qa_eval.py --n_processes 8 -c <corpus_name> --tokens 400 -o <question_output> -p <paragraph_output> <model_dir> --n_paragraphs 5

Training Model with BERT

To handle multiple paragraphs for a single question, following Clark and Gardner, we adopt shared-normalization as the training objective on sampling paragraphs as training object for BERT model. We use code in DocumentQA to sample paragraphs and transform the data format for BERT, for example:

python cache_train.py en shared-norm
python dump_preprocessed_train.py --input_file train_data.pkl --output_train_file en_train_output.json --num_epoch 10
python cache_dev.py en shared-norm
python dump_preprocessed_dev.py --input_file dev_data_en.pkl --output_dev_file en_dev_output.json

Then we could train BERT model, for example:

python run_bert_open_qa_train.py --vocab_file=multi_cased_L-12_H-768_A-12/vocab.txt --bert_config_file=multi_cased_L-12_H-768_A-12/bert_config.json --init_checkpoint=multi_cased_L-12_H-768_A-12/bert_model.ckpt --train_file=en_train_output.json --eval_file=en_dev_output.json --train_batch_size=2 --num_gpus 2 --learning_rate=3e-5 --num_train_epochs=1 --max_seq_length=512 --max_query_length=128 --output_dir=<model_dir> --do_lower_case=False

Evaluating Model with BERT

To evaluate BERT model, we first generate test file, for example:

python cache_test.py --corpus de_test --n_paragraphs 5 --tokens 400
python dump_preprocessed_eval.py --input_file de_test_5.pkl --output_file test_output_de_5.json

Next, we run evaluation and get metrics (EM & F1 score), for example:

python run_bert_open_qa_eval.py --vocab_file=multi_cased_L-12_H-768_A-12/vocab.txt --bert_config_file=multi_cased_L-12_H-768_A-12/bert_config.json --init_checkpoint=multi_cased_L-12_H-768_A-12/bert_model.ckpt --predict_file=test_output_de_5.json --predict_batch_size=4 --max_seq_length=512 --max_query_length=128 --model_dir=<model_dir> --do_lower_case=False
python get_evaluation_metric_for_bert_result.py --input_file test_output_de_5.json --prediction_file <model_dir>/test-question-de-5-output.txt

Cite

If you use the code, please cite this paper:

@inproceedings{liu2019xqa,
  title={{XQA}: A Cross-lingual Open-domain Question Answering Dataset},
  author={Liu, Jiahua and Lin, Yankai and Liu, Zhiyuan and Sun, Maosong},
  booktitle={Proceedings of ACL 2019},
  year={2019}
}
Popular Acl Projects
Popular Dataset Projects
Popular Security Categories

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Python
Dataset
Corpus
Acl
Glove