Awesome Open Source

Programming Languages

Search results for java corpus

98 search results found

Vespa ⭐ 5,115

AI + Data, online. https://vespa.ai

Wiki2vec ⭐ 587

Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps. Questions? https://gitter.im/idio-opensource/Lobby

Codebuff ⭐ 333

Language-agnostic pretty-printing through machine learning (uh, like, is this possible? YES, apparently).

Javafuzz ⭐ 195

coverage guided fuzz testing for java

Semantic Knowledge Graph ⭐ 180

Pignlproc ⭐ 160

Apache Pig utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.

Scalable Topic Modeling using Variational Inference in MapReduce

Improving topic models LDA and DMM (one-topic-per-document model for short texts) with word embeddings (TACL 2015)

Word2vec Lucene ⭐ 127

This tool extracts word vectors from Lucene index.

Syntactic ⭐ 102

Lexical categorization engine for large datasets. Good for NLP and Data Mining.

Blacklab ⭐ 97

Linguistic search for large annotated text corpora, based on Apache Lucene

A Java package for the LDA and DMM topic models

Rdrsegmenter ⭐ 67

A Fast and Accurate Vietnamese Word Segmenter (LREC 2018)

ANNIS is an open source, versatile web browser-based search and visualization architecture for complex multilevel linguistic corpora with diverse types of annotation.

Java text categorization system

Image Verification Corpus ⭐ 45

This contains an evolving dataset of fake and real images shared in social media.

Chemspot ⭐ 42

ChemSpot is a named entity recognition tool for identifying mentions of chemicals in natural language texts, including trivial names, drugs, abbreviations, molecular formulas and IUPAC entities. Since the different classes of relevant entities have rather different naming characteristics, ChemSpot uses a hybrid approach combining a Conditional Random Field with a dictionary. ChemSpot is released under the Common Public License 1.0.

Science Result Extractor ⭐ 42

Lucenetutorial ⭐ 39

A simple tutorial of Lucene for LIS 501 Introduction to Text Mining students at the University of Wisconsin-Madison (Fall 2021).

UFSAC is a resource containing all WordNet Sense Annotated Corpora, and a Java library for manipulating them

Dkpro C4corpus ⭐ 32

DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.

Logiclda ⭐ 32

Topic modeling with first-order logic (FOL) domain knowledge

Bigfatlm ⭐ 30

Hadoop MapReduce training of modified Kneser-Ney smoothed language models

Cryptokcodecracker ⭐ 27

Running Key Cipher Decoder + other classic cipher decoders. Automatically discovers likely solutions using an NGram language model.

Hypervec ⭐ 25

Hierarchical Embeddings for Hypernymy Detection and Directionality

Conll Rdf ⭐ 25

Advanced graph rewriting and LLOD publication for CoNLL and other TSV formats

Cocolian Nlp ⭐ 23

本项目目的在于构建一个标准化的NLP处理框架，提供企业级的API，以及各种推荐实现和测试包。目前国内外有不少NLP语言包，包括中科院、复旦大学的，通过对这些常用NLP软件的封装，可以为企业提供

Jitar HMM part of speech tagger

Rhapsode ⭐ 21

Advanced desktop search/corpus exploration prototype

Dependency Parsing Toolbox ⭐ 19

"Dependency Parsing toolbox" integrates different algorithms related to dependency parsing in one place. This toolbox has been developed by Mojtaba Khallash from Iran University of Science and Technology (IUST).

Openconvert ⭐ 19

Text conversion tool (from e.g. Word, HTML, txt) to corpus formats TEI or FoLiA)

Darks Learning ⭐ 18

Darks learning is the machine learning algorithm library. It contains Word2vec,DBN, RBM, MLP, LSA, PLSA, SDA, Maxent, regression, etc.

A tools for the automatic construction of Distribution Thesauri

Cc News Tools ⭐ 17

Tools relating to the CC-News-En Collection

Hfututils ⭐ 17

这是一个工具程序集合，方便我们平时对数据进行预处理。针对文本处理的内容较多。包括分词（集成了张华平分

Ixa Pipe Pos ⭐ 17

IXA pipes Part of Speech tagger and Lemmatizer (http://ixa2.si.ehu.es/ixa-pipes)

Alvisnlp ⭐ 17

ALvisNLP corpus processing engine

Workbench ⭐ 17

Java and Lucene based tools for BitFunnel corpus preparation

NLP Homework (Spring 2013)

A collection of tools for applying word senses to large corpora

Coverage based JVM Fuzz testing tool.

Quootstrap ⭐ 14

Unsupervised method for extracting quotation-speaker pairs from large news corpora.

🔍 A Corpus Data Retrieval Index using Lucene for Look-Ups

Porn_detector ⭐ 13

Porn Detection via Skin Tone

Termolator ⭐ 13

Chinese version of NYU's Termolator terminology extraction system. Also includes source code for the English part-of-speech tagger used in the English version.

Temporal Random Indexing

text-to-speech alignment java software

Bible Corpus Tools ⭐ 12

A collection of tools for reading/processing the multilingual Bible corpus

Reading the data from OPIEC - an Open Information Extraction corpus

Hexatomic ⭐ 12

Hexatomic is an extensible software for deep multi-layer annotation of linguistic corpora

Trombone ⭐ 12

Topic Evolution Analysis - an algorithm for analyzing knowledge flow in text based corpora

Invitationmodel ⭐ 10

Implementation of domain adaptation algorithm based on the paper "Latent Domain Translation Models in Mix-of-Domains Haystack" http://www.aclweb.org/anthology/C14-1182

Geocorpora ⭐ 10

The GeoCorpora project aims at creating corpora of fully geo-annotated texts (in particular microblog texts) and developing tools to support the corpus building process using crowd-sourcing and visual analytics approaches. Created corpora will be made publicly available in this repository. A first corpus of ~6000 geo-annotated tweets will be published here in the near future.

Neulearn Ai_agent ⭐ 10

대화형 에이전트 프로젝트 - Neulearn

Neural Bon ⭐ 10

code for AAAI-17 paper "Neural Bag-of-Ngrams"

Dwtc Extractor ⭐ 10

Extraction code used to create the Dresden Web Table Corpus

Slovene Named Entity Extractor

Emnlp2016 Empirical Convincingness ⭐ 9

Code and data for EMNLP2016 article "What makes a convincing argument? Empirical analysis and detecting attributes of convincingness in Web argumentation" by Ivan Habernal and Iryna Gurevych

An NLP framework for large scale processing using Hadoop

📝 Translation of query languages to serialized KoralQuery protocol

A sentence simplification system

Java Corpus ⭐ 8

Corpus of runnable, open-source Java 1.5+ programs.

Dwtc Tools ⭐ 8

Dresden Web Table Corpus Java library

Minesstubs ⭐ 8

Hosts our tool for mining simple "stupid'' bugs (SStuBs).

Naive Bayes classifier for detection of langage and spelling correction

Pre-processing , Training and Classification in Embedded GATE

Wordreprs_ner ⭐ 8

Fork of NER code from Turian et al. (2010) and Ratinov et al. (2009)

Linguaview ⭐ 8

A GUI tool to visualize grammar formalisms

Emnlp2017 Cmapsum Corpus ⭐ 8

Accompanying code for our EMNLP 2017 publication "Bringing Structure into Summaries: Crowdsourcing a Benchmark Corpus of Concept Maps"

Factcheck ⭐ 7

Mdswriter ⭐ 7

A software for manually creating multi-document summarization corpora and a platform for developing complex annotation tasks spanning multiple steps.

Justext Java ⭐ 7

Opiec Pipeline ⭐ 7

Translation without parallel corpora.

Corpuscompression ⭐ 6

Achieve better compression for small objects with a predefined corpus

Tacl2016 Trainingdata4srl ⭐ 6

Code for automated labeling of FrameNet roles in arbitrary text (TACL paper)

Randomwordgenerator ⭐ 6

This Android app generates randomly selected words from large word lists derived from dictionaries and published text corpora.

Naive Bayes Classifier ⭐ 6

Naive Bayes classifier is classification algorithm. It uses Naive based Bernoulli and Multinomial equation to classify documents(Text) as ham or spam.

MOTS (MOdular Tool for Summarization) is a summarization system, written in Java. It is as modular as possible, and is intended to provide an architecture to implement and test new summarization methods, as well as to ease comparison with already implemented methods, in an unified framework.

The Java source code and datasets of our ACL2017 paper: Time Expression Analysis and Recognition Using Syntactic Token Types and General Heuristic Rules

An RDF based Information Extraction system

Weiwei's personal tools

An open source contextual dictionary

Iust Htmlchardet ⭐ 6

A java tool for detecting charset encoding of HTML web pages

Codeswitchingresearch ⭐ 5

A project for Code-Switching Research

Information Retrieval ⭐ 5

Textual Information Retrieval (IR) and Information Extraction (IE) engine

Exapus is a web application for exploring the usage of APIs within a single project (i.e., project-centric exploration) and across a corpus of projects (i.e., api-centric exploration) along the dimensions of where, how much and in what manner.

Opendata Graph ⭐ 5

Code to crawl Common Crawl corpus in order to create a graph of french opendata websites

Machinelearning ⭐ 5

Text Classification using Machine Learning session at Lancaster Summer Schools in Corpus Linguistics

Learning Typed Entailment Graphs with Global Soft Constraints (TACL 2018)

Shami Corpus ⭐ 5

Shami Dialect Corpus (SDC)

Rslocator ⭐ 5

thesis source code and data set

NGram Map Reduce Algorithms

Related Searches

Java Spring (21,350)

Java Spring Boot (11,982)

Java Video Game (8,093)

Java Gradle (8,072)

Java Docker (6,180)

Java Database (6,015)

Java Mysql (5,954)

Java Server (5,922)

Java Sdk (5,864)

Javascript Java (5,468)

1-98 of 98 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.