The AmazonQA dataset is a large review-based Question Answering dataset (paper).
This repository comprises:
06/16 - We have uploaded the test sets for all the dataset formats below.
The dataset can be downloaded from the following links:
The dataset is .jsonl
format, where each line in the file is a json
string that corresponds to a question, existing answers to the question and the extracted review snippets (relevant to the question).
Each json
string has many fields. Here are the fields that the QA training pipeline uses:
Here are some other fields that we use for evaluation and analysis:
Our dataset consists of 923k questions, 3.6M ansheers and 14M reviews across 156k products. We build on the well-known Amazon dataset -
Additionally, we collect additional annotations, marking each question as either answerable or unanswerable based on the available reviews.
The src/prepro/ folder contains all the scripts for generating raw and different processed datsets.
The script generates the raw train/val/test product splits by combining the well known amazon reviews and questions dataset for all the categories.
The script creates question-answers pairs with query-relevant review snippets and is_answerable annotation by a trained classifier. More details regarding this step are mentioned in the section 3.1 Data Processing.
We also provide the scripts to convert our dataset to other question answering dataset formats like squad and ms-marco.
The script converts our dataset to squad format by extracting snippets using different span-heuristics. More details regarding this step are mentioned in the section 5.2 Span-based QA model.
The script converts our dataset MSMARCO format.
Binary classifier and related files can be found at link