Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for java corpus
corpus
x
java
x
98 search results found
Vespa
⭐
5,115
AI + Data, online. https://vespa.ai
Wiki2vec
⭐
587
Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps. Questions? https://gitter.im/idio-opensource/Lobby
Codebuff
⭐
333
Language-agnostic pretty-printing through machine learning (uh, like, is this possible? YES, apparently).
Javafuzz
⭐
195
coverage guided fuzz testing for java
Semantic Knowledge Graph
⭐
180
Pignlproc
⭐
160
Apache Pig utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.
Mr.lda
⭐
153
Scalable Topic Modeling using Variational Inference in MapReduce
Lftm
⭐
149
Improving topic models LDA and DMM (one-topic-per-document model for short texts) with word embeddings (TACL 2015)
Word2vec Lucene
⭐
127
This tool extracts word vectors from Lucene index.
Syntactic
⭐
102
Lexical categorization engine for large datasets. Good for NLP and Data Mining.
Blacklab
⭐
97
Linguistic search for large annotated text corpora, based on Apache Lucene
Jldadmm
⭐
68
A Java package for the LDA and DMM topic models
Rdrsegmenter
⭐
67
A Fast and Accurate Vietnamese Word Segmenter (LREC 2018)
Annis
⭐
67
ANNIS is an open source, versatile web browser-based search and visualization architecture for complex multilevel linguistic corpora with diverse types of annotation.
Jatecs
⭐
53
Java text categorization system
Image Verification Corpus
⭐
45
This contains an evolving dataset of fake and real images shared in social media.
Chemspot
⭐
42
ChemSpot is a named entity recognition tool for identifying mentions of chemicals in natural language texts, including trivial names, drugs, abbreviations, molecular formulas and IUPAC entities. Since the different classes of relevant entities have rather different naming characteristics, ChemSpot uses a hybrid approach combining a Conditional Random Field with a dictionary. ChemSpot is released under the Common Public License 1.0.
Science Result Extractor
⭐
42
Lucenetutorial
⭐
39
A simple tutorial of Lucene for LIS 501 Introduction to Text Mining students at the University of Wisconsin-Madison (Fall 2021).
Ufsac
⭐
35
UFSAC is a resource containing all WordNet Sense Annotated Corpora, and a Java library for manipulating them
Dkpro C4corpus
⭐
32
DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
Logiclda
⭐
32
Topic modeling with first-order logic (FOL) domain knowledge
Bigfatlm
⭐
30
Hadoop MapReduce training of modified Kneser-Ney smoothed language models
Cryptokcodecracker
⭐
27
Running Key Cipher Decoder + other classic cipher decoders. Automatically discovers likely solutions using an NGram language model.
Hypervec
⭐
25
Hierarchical Embeddings for Hypernymy Detection and Directionality
Conll Rdf
⭐
25
Advanced graph rewriting and LLOD publication for CoNLL and other TSV formats
Cocolian Nlp
⭐
23
本项目目的在于构建一个标准化的NLP处理框架,提供企业级的API,以及各种推荐实现和测试包。 目前国内外有不少NLP语言包,包括中科院、复旦大学的,通过对这些常用NLP软件的封装,可以为企业提供
Jitar
⭐
23
Jitar HMM part of speech tagger
Teneo
⭐
22
Rhapsode
⭐
21
Advanced desktop search/corpus exploration prototype
Dependency Parsing Toolbox
⭐
19
"Dependency Parsing toolbox" integrates different algorithms related to dependency parsing in one place. This toolbox has been developed by Mojtaba Khallash from Iran University of Science and Technology (IUST).
Openconvert
⭐
19
Text conversion tool (from e.g. Word, HTML, txt) to corpus formats TEI or FoLiA)
Darks Learning
⭐
18
Darks learning is the machine learning algorithm library. It contains Word2vec,DBN, RBM, MLP, LSA, PLSA, SDA, Maxent, regression, etc.
Byblo
⭐
18
A tools for the automatic construction of Distribution Thesauri
Cc News Tools
⭐
17
Tools relating to the CC-News-En Collection
Hfututils
⭐
17
这是一个工具程序集合,方便我们平时对数据进行预处理。针对文本处理的内容较多。包括分词(集成了张华平分
Ixa Pipe Pos
⭐
17
IXA pipes Part of Speech tagger and Lemmatizer (http://ixa2.si.ehu.es/ixa-pipes)
Alvisnlp
⭐
17
ALvisNLP corpus processing engine
Workbench
⭐
17
Java and Lucene based tools for BitFunnel corpus preparation
Nlp
⭐
16
NLP Homework (Spring 2013)
C Cat
⭐
16
A collection of tools for applying word senses to large corpora
Tribble
⭐
15
Coverage based JVM Fuzz testing tool.
Quootstrap
⭐
14
Unsupervised method for extracting quotation-speaker pairs from large news corpora.
Krill
⭐
14
🔍 A Corpus Data Retrieval Index using Lucene for Look-Ups
Porn_detector
⭐
13
Porn Detection via Skin Tone
Termolator
⭐
13
Chinese version of NYU's Termolator terminology extraction system. Also includes source code for the English part-of-speech tagger used in the English version.
Vizlinc
⭐
13
Vizlinc
Tri
⭐
12
Temporal Random Indexing
Jtrans
⭐
12
text-to-speech alignment java software
Bible Corpus Tools
⭐
12
A collection of tools for reading/processing the multilingual Bible corpus
Opiec
⭐
12
Reading the data from OPIEC - an Open Information Extraction corpus
Hexatomic
⭐
12
Hexatomic is an extensible software for deep multi-layer annotation of linguistic corpora
Trombone
⭐
12
Teva
⭐
12
Topic Evolution Analysis - an algorithm for analyzing knowledge flow in text based corpora
Invitationmodel
⭐
10
Implementation of domain adaptation algorithm based on the paper "Latent Domain Translation Models in Mix-of-Domains Haystack" http://www.aclweb.org/anthology/C14-1182
Geocorpora
⭐
10
The GeoCorpora project aims at creating corpora of fully geo-annotated texts (in particular microblog texts) and developing tools to support the corpus building process using crowd-sourcing and visual analytics approaches. Created corpora will be made publicly available in this repository. A first corpus of ~6000 geo-annotated tweets will be published here in the near future.
Neulearn Ai_agent
⭐
10
대화형 에이전트 프로젝트 - Neulearn
Neural Bon
⭐
10
code for AAAI-17 paper "Neural Bag-of-Ngrams"
Dwtc Extractor
⭐
10
Extraction code used to create the Dresden Web Table Corpus
Slner
⭐
10
Slovene Named Entity Extractor
Saga
⭐
10
Emnlp2016 Empirical Convincingness
⭐
9
Code and data for EMNLP2016 article "What makes a convincing argument? Empirical analysis and detecting attributes of convincingness in Web argumentation" by Ivan Habernal and Iryna Gurevych
Koshik
⭐
9
An NLP framework for large scale processing using Hadoop
Koral
⭐
9
📝 Translation of query languages to serialized KoralQuery protocol
Isimp
⭐
8
A sentence simplification system
Java Corpus
⭐
8
Corpus of runnable, open-source Java 1.5+ programs.
Dwtc Tools
⭐
8
Dresden Web Table Corpus Java library
Minesstubs
⭐
8
Hosts our tool for mining simple "stupid'' bugs (SStuBs).
Spelling
⭐
8
Naive Bayes classifier for detection of langage and spelling correction
Gate Ml
⭐
8
Pre-processing , Training and Classification in Embedded GATE
Wordreprs_ner
⭐
8
Fork of NER code from Turian et al. (2010) and Ratinov et al. (2009)
Linguaview
⭐
8
A GUI tool to visualize grammar formalisms
Emnlp2017 Cmapsum Corpus
⭐
8
Accompanying code for our EMNLP 2017 publication "Bringing Structure into Summaries: Crowdsourcing a Benchmark Corpus of Concept Maps"
Factcheck
⭐
7
Mdswriter
⭐
7
A software for manually creating multi-document summarization corpora and a platform for developing complex annotation tasks spanning multiple steps.
Justext Java
⭐
7
Opiec Pipeline
⭐
7
Babel
⭐
7
Translation without parallel corpora.
Corpuscompression
⭐
6
Achieve better compression for small objects with a predefined corpus
Tacl2016 Trainingdata4srl
⭐
6
Code for automated labeling of FrameNet roles in arbitrary text (TACL paper)
Randomwordgenerator
⭐
6
This Android app generates randomly selected words from large word lists derived from dictionaries and published text corpora.
Naive Bayes Classifier
⭐
6
Naive Bayes classifier is classification algorithm. It uses Naive based Bernoulli and Multinomial equation to classify documents(Text) as ham or spam.
Mots
⭐
6
MOTS (MOdular Tool for Summarization) is a summarization system, written in Java. It is as modular as possible, and is intended to provide an architecture to implement and test new summarization methods, as well as to ease comparison with already implemented methods, in an unified framework.
Syntime
⭐
6
The Java source code and datasets of our ACL2017 paper: Time Expression Analysis and Recognition Using Syntactic Token Types and General Heuristic Rules
Scoobie
⭐
6
An RDF based Information Extraction system
Ywwtools
⭐
6
Weiwei's personal tools
Contexto
⭐
6
An open source contextual dictionary
Iust Htmlchardet
⭐
6
A java tool for detecting charset encoding of HTML web pages
Sigir 17
⭐
5
Codeswitchingresearch
⭐
5
A project for Code-Switching Research
Information Retrieval
⭐
5
Textual Information Retrieval (IR) and Information Extraction (IE) engine
Exapus
⭐
5
Exapus is a web application for exploring the usage of APIs within a single project (i.e., project-centric exploration) and across a corpus of projects (i.e., api-centric exploration) along the dimensions of where, how much and in what manner.
Opendata Graph
⭐
5
Code to crawl Common Crawl corpus in order to create a graph of french opendata websites
Machinelearning
⭐
5
Text Classification using Machine Learning session at Lancaster Summer Schools in Corpus Linguistics
Entgraph
⭐
5
Learning Typed Entailment Graphs with Global Soft Constraints (TACL 2018)
Shami Corpus
⭐
5
Shami Dialect Corpus (SDC)
Rslocator
⭐
5
thesis source code and data set
Ngrams
⭐
5
NGram Map Reduce Algorithms
Related Searches
Java Spring (21,350)
Java Spring Boot (11,982)
Java Video Game (8,093)
Java Gradle (8,072)
Java Docker (6,180)
Java Database (6,015)
Java Mysql (5,954)
Java Server (5,922)
Java Sdk (5,864)
Javascript Java (5,468)
1-98 of 98 search results
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.