Awesome Open Source
Awesome Open Source

Aspen

Aspen lets you search a large corpus of plain text files via the browser.

license Docker Build Statu

example

  • Powerful search query support through Elasticsearch query string syntax
  • Performs some basic cleanup of plaintext data and can extract document titles
  • Responsive UI that works on mobile
  • Runs in Docker

Getting Started using Docker Compose

1. Collect your documents

Put all your files in one place, like ~/ebooks/:

$ tree ~/ebooks
/Users/ian/ebooks
└── Project\ Gutenberg/
    ├── Beowulf.txt
    ├── Dracula.txt
    ├── Frankenstein.txt

2. Run Aspen & Elasticsearch

$ docker-compose up -d
Creating network "aspen_default" with the default driver
Creating elasticsearch ... done
Creating aspen         ... done

3. Convert any non-plaintext (PDFs, MS Word) documents to plaintext

Use the included convert utility, which wraps Apache Tika, to convert them to plaintext. Pass it a filename relative to your data directory:

$ ls ~/ebooks
Project Gutenberg Test.docx

$ docker-compose run aspen convert Test.docx
Starting elasticsearch ... done
Test.docx doesn't exist, trying /data/Test.docx
Creating /data/Test.txt...
...
OK

$ ls ~/ebooks
Project Gutenberg Test.docx         Test.txt

4. Import content into Elasticsearch

Start by resetting Elasticsearch to make sure everything is working:

$ docker-compose run aspen es-reset
Starting elasticsearch ... done
Results from DELETE: { acknowledged: true }
✓ Done.

Now import all .txt documents. The import script will try to figure out the title of the document automatically:

$ docker-compose run aspen import
Starting elasticsearch ... done
→ Base directory is /aspen/static/data
▲ Ignoring non-text path: Test.docx
→ Test.txt → Test Document
→ Project Gutenberg/Beowulf.txt → The Project Gutenberg EBook of Beowulf
→ Project Gutenberg/Dracula.txt → The Project Gutenberg EBook of Dracula, by Bram Stoker
→ Project Gutenberg/Frankenstein.txt → Project Gutenberg's Frankenstein, by Mary Wollstonecraft (Godwin) Shelley
✓ Done!

You can also run import with a directory or file name relative to the data directory. For example, import Project\ Gutenberg or import Project\ Gutenberg\Dracula.txt.

Sometimes plaintext documents act strangely. Maybe bin/import can't extract a title or maybe the search highlights are off. The file might have the wrong line endings or one of those annoying UTF-8 BOM headers. Try running dos2unix on your text files to fix them.

5. Done!

Go to http://localhost:3000/ and start searching!

Development Setup

1. Install dependencies

It's easiest to use Elasticsearch via Docker.

You can get Node and Yarn via Homebrew on Mac, or you can download Node.js v8.5 or later and npm install -g yarn to get Yarn.

For document conversation (bin/convert) you'll want:

  1. Apache Tika
  2. UnRTF
  3. Par

On macOS you can brew install node tika unrtf par.

2. Clone the repo

$ git clone [email protected]:statico/aspen.git
$ cd aspen
$ yarn install

3. Set up Elasticsearch and import your data

See steps 1-4 in the above "Using Docker" section. In short, get your text files together in one place, set up Elasticsearch, and import them with the bin/import command.

4. Start the web app

Aspen is built using Next.js, which is Node + ES6 + Express + React + hot reloading + lots more. Simply run:

$ yarn run dev

...and go to http://localhost:3000

If you are working on server.js and want automatic server restarting, do:

$ yarn global add nodemon
$ nodemon -w server.js -w lib -x yarn -- run dev

Development Notes

  • This started as an Angular 1 + CoffeeScript example. I recently migrated it to use Next.js, ES6 and React. You can view a full diff here.
  • I'm still using Elasticsearch 1.7 because I haven't bothered to learn the newer versions.

Links


Get A Weekly Email With Trending Projects For These Topics
No Spam. Unsubscribe easily at any time.
Javascript (1,552,087) 
Docker (33,815) 
Nextjs (6,097) 
Es6 (5,441) 
Elasticsearch (3,821) 
Search (2,682) 
Search Engine (1,289) 
Corpus (417) 
Es7 (293) 
Related Projects
Advertising 📦 9
All Projects
Application Programming Interfaces 📦 120
Applications 📦 181
Artificial Intelligence 📦 72
Blockchain 📦 70
Build Tools 📦 111
Cloud Computing 📦 79
Code Quality 📦 28
Collaboration 📦 30
Command Line Interface 📦 48
Community 📦 81
Companies 📦 60
Compilers 📦 60
Computer Science 📦 74
Configuration Management 📦 39
Content Management 📦 167
Control Flow 📦 197
Data Formats 📦 77
Data Processing 📦 266
Data Storage 📦 132
Economics 📦 60
Frameworks 📦 198
Games 📦 122
Graphics 📦 103
Hardware 📦 148
Integrated Development Environments 📦 47
Learning Resources 📦 147
Legal 📦 28
Libraries 📦 119
Lists Of Projects 📦 21
Machine Learning 📦 336
Mapping 📦 61
Marketing 📦 15
Mathematics 📦 55
Media 📦 228
Messaging 📦 97
Networking 📦 304
Operating Systems 📦 84
Operations 📦 120
Package Managers 📦 52
Programming Languages 📦 229
Runtime Environments 📦 96
Science 📦 42
Security 📦 375
Social Media 📦 26
Software Architecture 📦 70
Software Development 📦 68
Software Performance 📦 57
Software Quality 📦 127
Text Editors 📦 45
Text Processing 📦 131
User Interface 📦 310
User Interface Components 📦 465
Version Control 📦 29
Virtualization 📦 68
Web Browsers 📦 38
Web Servers 📦 25
Web User Interface 📦 194