Awesome Open Source
Awesome Open Source

elasticsearch-jieba-plugin

jieba analysis plugin for elasticsearch: 7.7.0, 7.4.2, 7.3.0, 7.0.0, 6.4.0, 6.0.0, 5.4.0, 5.3.0, 5.2.2, 5.2.1, 5.2.0, 5.1.2, 5.1.1

特点

  • 支持动态添加字典,不重启ES。

简单的修改,即可适配不同版本的ES

戳这里

支持动态添加字典,ES不需要重启

戳这里

有关jieba_index和jieba_search的应用

戳这里

新分词支持

如果是ES6.4.0的版本,请使用6.4.0分支最新的代码,或者master分支最新代码,也可以下载6.4.1的release,强烈推荐升级!

6.4.1的release,解决了PositionIncrement问题。详细说明见ES分词PositionIncrement解析

版本对应

分支 tag elasticsearch版本 Release Link
7.7.0 tag v7.7.1 v7.7.0 Download: v7.7.0
7.4.2 tag v7.4.2 v7.4.2 Download: v7.4.2
7.3.0 tag v7.3.0 v7.3.0 Download: v7.3.0
7.0.0 tag v7.0.0 v7.0.0 Download: v7.0.0
6.4.0 tag v6.4.1 v6.4.0 Download: v6.4.1
6.4.0 tag v6.4.0 v6.4.0 Download: v6.4.0
6.0.0 tag v6.0.0 v6.0.0 Download: v6.0.1
5.4.0 tag v5.4.0 v5.4.0 Download: v5.4.0
5.3.0 tag v5.3.0 v5.3.0 Download: v5.3.0
5.2.2 tag v5.2.2 v5.2.2 Download: v5.2.2
5.2.1 tag v5.2.1 v5.2.1 Download: v5.2.1
5.2 tag v5.2.0 v5.2.0 Download: v5.2.0
5.1.2 tag v5.1.2 v5.1.2 Download: v5.1.2
5.1.1 tag v5.1.1 v5.1.1 Download: v5.1.1

more details

  • choose right version source code.
  • run
git clone https://github.com/sing1ee/elasticsearch-jieba-plugin.git --recursive
./gradlew clean pz
  • copy the zip file to plugin directory
cp build/distributions/elasticsearch-jieba-plugin-5.1.2.zip ${path.home}/plugins
  • unzip and rm zip file
unzip elasticsearch-jieba-plugin-5.1.2.zip
rm elasticsearch-jieba-plugin-5.1.2.zip
  • start elasticsearch
./bin/elasticsearch

Custom User Dict

Just put you dict file with suffix .dict into ${path.home}/plugins/jieba/dic. Your dict file should like this:

小清新 3
百搭 3
显瘦 3
隨身碟 100
your_word word_freq

Using stopwords

  • find stopwords.txt in ${path.home}/plugins/jieba/dic.
  • create folder named stopwords under ${path.home}/config
mkdir -p {path.home}/config/stopwords
  • copy stopwords.txt into the folder just created
cp ${path.home}/plugins/jieba/dic/stopwords.txt {path.home}/config/stopwords
  • create index:
PUT http://localhost:9200/jieba_index
{
  "settings": {
    "analysis": {
      "filter": {
        "jieba_stop": {
          "type":        "stop",
          "stopwords_path": "stopwords/stopwords.txt"
        },
        "jieba_synonym": {
          "type":        "synonym",
          "synonyms_path": "synonyms/synonyms.txt"
        }
      },
      "analyzer": {
        "my_ana": {
          "tokenizer": "jieba_index",
          "filter": [
            "lowercase",
            "jieba_stop",
            "jieba_synonym"
          ]
        }
      }
    }
  }
}
  • test analyzer:
PUT http://localhost:9200/jieba_index/_analyze
{
  "analyzer" : "my_ana",
  "text" : "黄河之水天上来"
}

Response as follow:

{
    "tokens": [
        {
            "token": "黄河",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        },
        {
            "token": "黄河之水天上来",
            "start_offset": 0,
            "end_offset": 7,
            "type": "word",
            "position": 0
        },
        {
            "token": "之水",
            "start_offset": 2,
            "end_offset": 4,
            "type": "word",
            "position": 1
        },
        {
            "token": "天上",
            "start_offset": 4,
            "end_offset": 6,
            "type": "word",
            "position": 2
        },
        {
            "token": "上来",
            "start_offset": 5,
            "end_offset": 7,
            "type": "word",
            "position": 2
        }
    ]
}

NOTE

migrate from jieba-solr

Roadmap

I will add more analyzer support:

  • stanford chinese analyzer
  • fudan nlp analyzer
  • ...

If you have some ideas, you should create an issue. Then, we will do it together.


Get A Weekly Email With Trending Projects For These Topics
No Spam. Unsubscribe easily at any time.
java (30,674
elasticsearch (540
dict (23
jieba (17

Find Open Source By Browsing 7,000 Topics Across 59 Categories