Wikipedia Extractor

This is a mirror of the script by Giuseppe Attardi, and contains history before the official repo started: https://github.com/attardi/wikiextractor --- Extracts and cleans text from Wikipedia database dump and stores output in a number of files of similar size in a given directory.
Alternatives To Wikipedia Extractor
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Wikiteam661
4 months ago159gpl-3.0Python
Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2023, WikiTeam has preserved more than 350,000 wikis.
Wikipedia Extractor247
8 years ago1Python
This is a mirror of the script by Giuseppe Attardi, and contains history before the official repo started: https://github.com/attardi/wikiextractor --- Extracts and cleans text from Wikipedia database dump and stores output in a number of files of similar size in a given directory.
Json Wikipedia244
2 years ago6apache-2.0Java
Json Wikipedia, contains code to convert the Wikipedia xml dump into a json/avro dump
Dumpster Dive214128 months ago34July 04, 20238otherJavaScript
roll a wikipedia dump into mongo
Go Xml Parse117
9 years agoNovember 26, 20232Go
Streaming XML parser example in go
Annotated Wikiextractor88
13 years agogpl-3.0Python
Simple Wikipedia plain text extractor with article link annotations and Hadoop support.
Xs4s5013 years ago6July 27, 20211otherScala
XML Streaming for Scala including FS2/cats support
Wikidump41
11 years ago4April 10, 20134gpl-3.0Python
Tools to manipulate and extract data from wikipedia dumps
Wikihistoryflow39
7 years ago1PHP
Visualise Wikipedia page edits using History Flow
Wikiforia31
7 years ago9gpl-2.0Java
A Utility Library for Wikipedia dumps
Alternatives To Wikipedia Extractor
Select To Compare


Alternative Project Comparisons
Popular Wikipedia Projects
Popular Xml Projects
Popular Companies Categories

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Python
Xml
Wikipedia