Bitcurator Nlp Gentm

Generate topic models from open text extracted from files in disk images
Alternatives To Bitcurator Nlp Gentm
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
4 months ago1July 14, 202120mitPython
🦆 Contextually-keyed word vectors
Text Analytics With Python1,073
2 years agoapache-2.0Jupyter Notebook
Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.
3 years ago8gpl-3.0Python
ADAM - A Question Answering System. Inspired from IBM Watson
12 years ago39January 25, 202124mitPython
Textpipe: clean and extract metadata from text
Concise Concepts208
10 days ago34January 13, 20235mitPython
This repository contains an easy and intuitive approach to few-shot NER using most similar expansion over spaCy embeddings. Now with entity scoring.
4 years ago5agpl-3.0HTML
A text analysis application for performing common NLP tasks through a web dashboard interface and an API
a year agomitHTML
Free hands-on course with the implementation (in Python) and description of several Natural Language Processing (NLP) algorithms and techniques, on several modern platforms and libraries.
3 years agogpl-3.0Jupyter Notebook
Extensive tutorials for the Advanced NLP Workshop in Open Data Science Conference Europe 2020. We will leverage machine learning, deep learning and deep transfer learning to learn and solve popular tasks using NLP including NER, Classification, Recommendation \ Information Retrieval, Summarization, Classification, Language Translation, Q&A and Topic Models.
Stock Prediction33
3 months ago2Jupyter Notebook
Technical and sentiment analysis to predict the stock market with machine learning models based on historical time series data and news article sentiment collected using APIs and web scraping.
7 months ago3mitPython
End-to-end NLP tool to analyze research publications
Alternatives To Bitcurator Nlp Gentm
Select To Compare

Alternative Project Comparisons



GitHub issues GitHub forks Build Status Twitter Follow

Generate topic models using open text automatically extracted from various file formats in disk images. This project uses The Sleuth Kit (sleuthkit/sleuthkit) to parse file systems in disk images, textract ( to extract text from common file formats, gensim to generate topic models (, and pyLDAvis (bmabey/pyLDAvis) for visualization.

Setup and Installation

The topic model generation tool depends on a number of external natural language processing and digital forensics libraries. For convenience, we have included a script that will install all the required dependencies in Ubuntu 18.04LTS. This script will install certain tools (TSK, libewf, and several others) by compiling and installing from source.

In a Ubuntu host or a clean virtual machine, first make sure you have git installed:

  • Open a terminal and install git using apt:
$ sudo apt-get install git

Next, follow these steps:

  • Clone this repository:
$ git clone
  • Change directory into the repository:
$ cd bitcurator-nlp-gentm
  • Run the setup shell script to install and configure the required software (various dependencies, TSK, textract, and gensim). Note that this may take some time (typically 10-15 minutes).
$ sudo ./

Disk Image Selection and Configuration

This repository includes a sample Expert Witness Format disk image (govdocs45sampler.E01) in the disk_images directory. If you do not make any changes to the configuration file, the topic modeler and visualization tool will be run on text extracted from files discovered in this image.

To run the tool against other disk images (EWF or raw), simply copy those images into the disk_images directory and edit the [image_section] of the configuration file (config.txt) to include the relevant files. For example, if you had two images named testimage1.E01 and testimage2.dd, the section would be modified as follows:

# Disk images to process (the default location can be changed in the following section)
my-image-name1.E01 = 1
my-image-name2.dd = 1

Running the Tool

Run the following command to extract text from the configured file types, start the topic modeling tool, and load the results into a browser window.

$ python
  • Depending on the size of your corpus, this may take some time. You will see a range of log output and (possibly) deprecation warnings related to the operation of gensim and other tools. The tool is operating normally unless it drops back to a terminal prompt with an error.

  • The results based on the text extracted from your specified file types and processed using pyLDAvis will appear automatically in a browser window. When finished viewing, you can terminate the server in the existing terminal by typing "Ctrl-X" followed by "Ctrl-C".

Additional adjustments can be performed with command-line flags.

  • --topics: number of topics (default 10)
  • --tm: topic modeling tool (default gensim). (Graphlab option disabled due to licensing restrictions)
  • --infile: file source: if the --infile option is not used, the disc image(s) listed in the configuration file will be extracted. Use --infile to specify a directory instead.
  • --config: configuration file (default config.txt in main directory) - specify file path for alternate configuration file
$ Usage: python [--topics <10>] [--tm <gensim|graphlab>] [--infile </directory/path>] [--config </path/to/config-file/>] 


Additional project information can be found on the BitCurator NLP wiki at


The BitCurator logo, BitCurator project documentation, and other non-software products of the BitCurator team are subject to the the Creative Commons Attribution 4.0 Generic license (CC By 4.0).

Unless otherwise indicated, software items in this repository are distributed under the terms of the GNU Lesser General Public License, Version 3. See the text file "COPYING" for further details about the terms of this license.

In addition to software produced by the BitCurator team, BitCurator packages and modifies open source software produced by other developers. Licenses and attributions are retained here where applicable.

Additional Notes

If your Ubuntu VM does not already have a desktop (graphic UI), you will need to install one in order to view the results in a browser:

$ sudo apt-get update
$ sudo apt-get install ubuntu-desktop
Popular Spacy Projects
Popular Gensim Projects
Popular Machine Learning Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.