a graph-based knowledge search engine powered by Wikipedia
The first link in the main body text identifies a hierarchical relationship between articles: banana links to fruit, piano to musical instruments, and so on.
The search engine constructs a directed graph connecting the 11 million English articles (6 million redirects) using the first link. For those curious about the network's topology, peek at the research inspiring this project.
|musical instruments||Music box, Violin family, Glass harmonica||Piano Music, Piano music, Grand Piano, Lily Maisky, William Merrigan Daly|
note matching titles between the available hourly page view data and displayed title is imperfect (see match_views.py)
In addition to the graph, the first 2000 characters of the main body text are indexed for fuzzy title searching.
Example: "paper" --> "Pulp (paper)"
lower-case "paper" is matched to the correct Wikipedia article title
To install dependencies:
pip install -r requirements.txt
For distributed computations, the program also requires Spark, Java > 7, and Scala.
Program expects configurations in configs.py which sets environment variables for database and spark nodes:
import os os.environ["master_node_dns"] = "ec2-xx-xx-xx.compute-1.amazonaws.com" os.environ["elasticsearch_node_dns"] = "ec2-xx-xx-xx.compute-1.amazonaws.com" os.environ["neo4j_pass"] = "xxxx" os.environ["neo4j_ip"] = "xx.xxx.xx"
Flow based on search term:
The front-end serves this result in a directed D3 graph.