Table Extractor

Extract Data from Wikipedia Tables
Alternatives To Table Extractor
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Mediawiki3,6128420 hours ago158March 30, 2023otherPHP
🌻 The collaborative editing software that runs Wikipedia. Mirror from See for contributing.
Search Deflector813
2 years ago134mitD
A small program that forwards searches from Cortana to your preferred browser and search engine.
5 days ago159gpl-3.0Python
Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2023, WikiTeam has preserved more than 350,000 wikis.
3 days ago282mitRuby
Wiki Education Foundation's Wikipedia course dashboard system
Wiki294236558 months ago93November 07, 202217mitJavaScript
Wikipedia Interface for Node.js
4 years ago2November 29, 20194mitGo
Command line tool to fetch summaries from MediaWiki wikis, like Wikipedia
2 months ago2mitJupyter Notebook
Current and Historical Lists of S&P 500 components since 1996
a year ago53September 02, 20204mitJavaScript
Wikipedia summaries from the command line
Git Wiki Theme191
3 months ago2mitSCSS
A revolutionary full-featured wiki for github pages and jekyll. You don't need to compile it!
Huggle3 Qt Lx168
9 days ago17gpl-3.0C++
Huggle is an anti-vandalism tool for use on MediaWiki based projects
Alternatives To Table Extractor
Select To Compare

Alternative Project Comparisons

The table extractor - GSoC 2017 project


Wikipedia is full of data hidden in tables. The aim of this project is to explore the possibilities of exploiting all the data represented with the appearance of tables in Wiki pages, in order to populate the different chapters of DBpedia through new data of interest. The Table Extractor has to be the engine of this data “revolution”: it would achieve the final purpose of extracting the semi structured data from all those tables now scattered in most of the Wiki pages.

Get ready


Idea's project is to analyze resources chosen by user and to create related RDF triples. First of all you have to run pyDomainExplorer, passing right arguments to it. This script will create a settings file (named in domain_explorer folder) that you have to fill: it is commented to help you in this work. Then run pyTableExtractor that read previous filled file and start to map all resources so that you can obtain RDF triples saved in Extractions folder.


First of all you must have libraries that I used to develop code. You can install requirements using requirements.txt pip install -r requirements.txt

Get started

How to run table-extractor

  • Clone repository.

  • Choose a language (--chapter, e.g. en, it, fr ...).

  • Choose a set of resources to analyze, that could be:

  • Choose a value for output format, that could be 1 or 2. See below to understand how this value influence file.

  • Now you can run

python [--chapter --output_format (--where|--single|--topic)]

This module will take resources in language defined by user and will analyze each table that are in wikipedia pages. At the end of execution, it creates a file named in domain_explorer folder.

What is this file for? contains all sections and headers found in exploration of the domain. You will observe a dictionary structure and some fields that have to be filled. Below there is an example of output.

  • Next step is to fill Remember that you are writing mapping rules, so you are making an association between a table's header (or table's section) with a dbpedia ontology property.

  • When you have compiled, you can easily run in this way:


This script read all parameters in domain_explorer/ and print a .ttl file that contains RDF triples obtained by domain's analysis.

If it goes well, you will get a dataset in Extraction folder!

Read below something more about arguments passed to pyDomainExplorer.

Usage examples

  • python -c it -f 1 -w "?s a <>" ---> chapter = 'it', output_format= '1', tries to collect resources (soccer players) which answer to this sparql query from DBpedia.

  • python -c en -f 2 -t BasketballPlayer ---> chapter='en', output_format='2', topic='BasketballPlayer', collect resources that are in DBpedia ontology class 'BasketballPlayer'.

  • python -c it -f 2 -s "Kobe_Bryant" ---> the script will works only one wiki page of 'it' chapter. It's important to use the same name of wikipedia page.


  • If you choose a topic (-t) or you pass to the script a custom where clause, a list of resources (.txt files) are created in /Resource_lists .
  • If everything is ok, three files are created in /Extractions : two log file (one for pyDomainExplorer and one for pyTableExtractor) and a .ttl file containing the serialized rdf data set.

pyDomainExplorer arguments

There are three arguments that has to be passed to pyDomainExplorer.

  • -c, --chapter : Required. 2 letter long string representing the desidered endpoint/Wikipedia language (e.g. en, it, fr ...) Default value: 'en'.

  • -f, --output_format : Required. One number that can be 1 or 2. Each value correspond to a different organization of output file.

  • Required one of these arguments:

    • -t, --topic : Represents a DBpedia ontology class that you want to explore and analyze. It's important to preserve the camelcase form. Eg. "BasketballPlayer".

    • -w, --where : A SPARQL where clause. Eg. "?film ?film ?s" is used to collect all film directors of a wiki chapter. Note: please ensure that the set you want to collect is titled as ?s.

    • -s, --single : can be used to select a wiki page at a time. Eg. -s 'Channel_Tunnel' takes only the wiki page representing the European channel tunnel between France and UK. [-s]Note: please use only the name of a wiki page without spaces ( substitued by underscores) Eg. Use -s German_federal_election,_1874 and not,_1874 or German federal election, 1874 .

Small digression on -f parameter

Filling all fields in file like could be a problem for user. So I have to bring ways to facilitate his work. Some of these ways are research over DBpedia ontology and check if a header has already a property. Another way that I provided is through -f parameter.

Suppose that you have to analyze domain like basketball player and you read a table's header like points.

In all sections you will observe that this header is always associated to totalPoints of dbpedia ontology. For this reason, I think that print only one time this header in will help user that hasn't to rewrite a property n times.

However you can put -f to 1, so same header will be printed several times over

In a nutshell, output organization equal to:

  • 1 - Output file will contain same header repeated for all sections where it is.
  • 2 - Each header is unique, so you won't observe same header in different sections.

Below there is a little example that could explain better how -f parameter works.

Example of output organization parameter usage

In a domain like basketball player, you can observe these files. The first one refers to -f equal to 1 while the second one is related to -f equal to 2. You can use this parameter to simplify your work over all different domains.

# Example page where it was found this section: Kobe_Bryant
SECTION_Playoffs = {
'sectionProperty': '', 
'Year': 'Year', 
'Team': 'team', 
'GamesPlayed': '', 
'GamesStarted': ''
# Example page where it was found this section: Kobe_Bryant
SECTION_Regular_season = {
'sectionProperty': '', 
'Year': 'Year', 
'Team': 'team', 
'GamesPlayed': '', 
'GamesStarted': ''

# Example page where it was found this section: Kobe_Bryant
SECTION_Playoffs = {
'sectionProperty': '', 
'Year': 'Year', 
'Team': 'team', 
'GamesPlayed': '', 
'GamesStarted': ''
# Example page where it was found this section: Kobe_Bryant
SECTION_Regular_season = {
'sectionProperty': '',


As you can see above, headers like GamesPlayed and GamesStarted are printed twice in with -f equal to 1, while on second with -f equal to 2, you can see that GamesPlayed and GamesStarted are printed only one time. In this way you can write only one mapping rules instead of two.


In this page you can observe dataset (english and italian) extracted using table extractor . Furthermore you can read log file created in order to see all operations made up for creating RDF triples.

I recommend to also see this page, that contains idea behind project and an example of result extracted from log files and .ttl dataset.

Note that effectiveness of the mapping operation mostly depends on how many rules user has wrote in

This script, written by Simone Papalini (@github/papalinis) and Federica Baiocchi (@github/Feddie), is useful to know the number of tables or lists contained in Wikipedia pages from a given topic, and was created in collaboration with Feddie who is working on the List Extractor. We both used it in the beginning of our projects to choose a domain to start from.

Get started

How to run

python language struct_type topic

  • language : a two letter long prefix representing the desidered endpoint/Wikipedia language to search (e.g. en, it, fr ...)
  • struct_type : t for tables, l for lists
  • topic : can be either a where clause of a sparql query specifying the requested features of a ?s subject, or a keyword from the following: dir for all DBpedia directors with a Wikipedia pages, act for actors, soccer for soccer players,writer for writers

Usage examples

*python it t "?s a <>.?s <> ?f" *python en l writer

Progress pages

For any questions please feel to contact:

Popular Wikipedia Projects
Popular Wiki Projects
Popular Companies Categories

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Wiki Page