Text2sql Data

A collection of datasets that pair questions with SQL queries.
Alternatives To Text2sql Data
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Superset52,36028 hours ago3April 29, 20221,345apache-2.0TypeScript
Apache Superset is a Data Visualization and Data Exploration Platform
Datasette7,832351308 days ago120May 02, 2022512apache-2.0Python
An open source multi-tool for exploring and publishing data
Dataset4,5275145118 days ago55December 16, 202134mitPython
Easy-to-use data handling for SQL data stores with support for implicit table creation, bulk loading, and transactions.
10 days ago17March 20, 202235apache-2.0Rust
Create full-fledged APIs for slowly moving datasets without writing a single line of code.
Sql Translator2,821
2 months agomitTypeScript
SQL Translator is a tool for converting natural language queries into SQL code using artificial intelligence. This project is 100% free and open source.
15 hours ago4June 07, 202233apache-2.0Java
esProc SPL is a scripting language for data processing, with well-designed rich library functions and powerful syntax, which can be executed in a Java program through JDBC interface and computing independently.
Goqu1,846463 months ago25October 16, 2021104mitGo
SQL builder and query library for golang
17 days ago27bsd-3-clauseHTML
A large annotated semantic parsing corpus for developing natural language interfaces.
Text2sql Data351
8 months ago2otherPython
A collection of datasets that pair questions with SQL queries.
2 months ago5mitJupyter Notebook
🍻 An open-source dataset of breweries, cideries, brewpubs, and bottleshops.
Alternatives To Text2sql Data
Select To Compare

Alternative Project Comparisons


This repository contains data and code for building and evaluating systems that map sentences to SQL, developed as part of:

For a range of domains, we provide:

  • Sentences with annotated variables
  • SQL queries
  • A database schema
  • A database

These are improved forms of prior datasets and a new dataset we developed. We have separate files describing the datasets, systems, and tools.

Version Description
4 Data fixes
3 Data fixes and addition of data from Spider and WikiSQL
2 Data with fixes for variables incorrectly defined in questions
1 Data used in the ACL 2018 paper

Citing this work

If you use this data in your work, please cite our ACL paper and the appropriate original sources, and list the version number of the data. For example, in your paper you could write (using the BibTeX below):

In this work, we use version 4 of the modified SQL datasets from \citet{data-advising}, based on \citet{data-academic,data-atis-original,data-geography-original,data-atis-geography-scholar,data-imdb-yelp,data-restaurants-logic,data-restaurants-original,data-restaurants,data-spider,data-wikisql}

If you are only using one dataset, here are example citation commands:

Data Cite
Academic \citet{data-advising,data-academic}
Advising \citet{data-advising}
ATIS \citet{data-advising,data-atis-original,data-atis-geography-scholar}
Geography \citet{data-advising,data-geography-original,data-atis-geography-scholar}
Restaurants \citet{data-advising,data-restaurants-logic,data-restaurants-original,data-restaurants}
Scholar \citet{data-advising,data-atis-geography-scholar}
Spider \citet{data-advising,data-spider}
IMDB \citet{data-advising,data-imdb-yelp}
Yelp \citet{data-advising,data-imdb-yelp}
WikiSQL \citet{data-advising,data-wikisql}
  dataset   = {Advising},
  author    = {Catherine Finegan-Dollak, Jonathan K. Kummerfeld, Li Zhang, Karthik Ramanathan, Sesh Sadasivam, Rui Zhang, and Dragomir Radev},
  title     = {Improving Text-to-SQL Evaluation Methodology},
  booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  month     = {July},
  year      = {2018},
  location  = {Melbourne, Victoria, Australia},
  pages     = {351--360},
  url       = {http://aclweb.org/anthology/P18-1033},

  dataset   = {IMDB and Yelp},
  author    = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
  title     = {SQLizer: Query Synthesis from Natural Language},
  booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
  month     = {October},
  year      = {2017},
  pages     = {63:1--63:26},
  url       = {http://doi.org/10.1145/3133887},

  dataset   = {Academic},
  author    = {Fei Li and H. V. Jagadish},
  title     = {Constructing an Interactive Natural Language Interface for Relational Databases},
  journal   = {Proceedings of the VLDB Endowment},
  volume    = {8},
  number    = {1},
  month     = {September},
  year      = {2014},
  pages     = {73--84},
  url       = {http://dx.doi.org/10.14778/2735461.2735468},

  dataset   = {Scholar, and Updated ATIS and Geography},
  author    = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
  title     = {Learning a Neural Semantic Parser from User Feedback},
  booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  year      = {2017},
  pages     = {963--973},
  location  = {Vancouver, Canada},
  url       = {http://www.aclweb.org/anthology/P17-1089},

  dataset   = {ATIS, original},
  author    = {Deborah A. Dahl, Madeleine Bates, Michael Brown, William Fisher, Kate Hunicke-Smith, David Pallett, Christine Pao, Alexander Rudnicky, and Elizabeth Shriber},
  title     = {{Expanding the scope of the ATIS task: The ATIS-3 corpus}},
  journal   = {Proceedings of the workshop on Human Language Technology},
  year      = {1994},
  pages     = {43--48},
  url       = {http://dl.acm.org/citation.cfm?id=1075823},

  dataset   = {Geography, original},
  author    = {John M. Zelle and Raymond J. Mooney},
  title     = {Learning to Parse Database Queries Using Inductive Logic Programming},
  booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
  year      = {1996},
  pages     = {1050--1055},
  location  = {Portland, Oregon},
  url       = {http://dl.acm.org/citation.cfm?id=1864519.1864543},

  author    = {Lappoon R. Tang and Raymond J. Mooney},
  title     = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
  booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
  year      = {2000},
  pages     = {133--141},
  location  = {Hong Kong, China},
  url       = {http://www.aclweb.org/anthology/W00-1317},

 author    = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
 title     = {Towards a Theory of Natural Language Interfaces to Databases},
 booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
 year      = {2003},
 location  = {Miami, Florida, USA},
 pages     = {149--157},
 url       = {http://doi.acm.org/10.1145/604045.604070},

  author    = {Alessandra Giordani and Alessandro Moschitti},
  title     = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
  booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
  year      = {2012},
  location  = {Montpellier, France},
  pages     = {59--76},
  url       = {https://doi.org/10.1007/978-3-642-45260-4_5},

  author    = {Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev},
  title     = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
  year      = {2018},
  location  = {Brussels, Belgium},
  pages     = {3911--3921},
  url       = {http://aclweb.org/anthology/D18-1425},

  author    = {Victor Zhong, Caiming Xiong, and Richard Socher},
  title     = {Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning},
  year      = {2017},
  journal   = {CoRR},
  volume    = {abs/1709.00103},


We put substantial effort into fixing bugs in the datasets, but none of them are perfect. If you find a bug, please submit a pull request with a fix. We will be merging fixes into a development branch and only infrequently merging all of those changes into the master branch (at which point this page will be adjusted to note that it is a new release). This approach is intended to balance the need for clear comparisons between systems, while also improving the data.

For some ideas of issues to address, see our list of known issues.


This material is based in part upon work supported by IBM under contract 4915012629. Any opinions, findings, conclusions or recommendations expressed are those of the authors and do not necessarily reflect the views of IBM.

Popular Sql Projects
Popular Dataset Projects
Popular Data Processing Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Natural Language Processing
Neural Network