엔지니어링

Elasticsearch에서 아리랑 한글 분석기 사용하기

Elasticsearch .

: 2017 . Elastic Stack , , . . krlucene arirang Lucene elasticsearch . analysis plugin . arirang . .

Elasticsearch : https://www.facebook.com/groups/elasticsearch.kr/ : http://jjeong.tistory.com : https://www.linkedin.com/in/hwjeong/

Elasticsearch analyzer . Lucene Korean Analyzer Elasticsearch plugin .

Lucene Analyzer

lucene analyzer . Lucene analyzer tokenizer filter . Filter CharFilter TokenFilter . CharFilter normalization TokenFilter tokenizer token filter . analysis .

elasticsearch analyzing sequence

Lucene Korean Analyzer

. Lucene Korean Analyzer . repository .

svn https://lucenekorean.svn.sourceforge.net/svnroot/lucenekorean

github https://github.com/korlucene

Lucene Korean Analyzer Arirang .

Arirang .

  • arirang analyzer
  • arirang morph

1. arirang morph

. .

2. arirang analyzer

lucene analyzer lucene . Lucene analyzer pipeline

  • KoreanAnalyzer
  • KoreanFilter
  • KoreanFilterFactory
  • KoreanToken
  • KoreanTokenizer
  • KoreanTokenizerFactory

.

, . arirang.morph .

1. Dictionary classpath

  org/apache/lucene/analysis/ko/dic

2. Dictionary files

  org/apache/lucene/analysis/ko
    korean.properties
  org/apache/lucene/analysis/ko/dic
    abbreviation.dic
    cj.dic
    compounds.dic
    eomi.dic
    extension.dic
    josa.dic
    mapHanja.dic
    occurrence.dic
    prefix.dic
    suffix.dic
    syllable.dic
    total.dic
    uncompounds.dic

3.

.

total.dic arirang analyzer . , extension.dic .

extension.dic , .

compounds.dic .

4. total.dic / extension.dic

체언/용언/기타품사/하여(다)동사/되어(다)동사/'내'가 붙을 수 있는 체언/NA/NA/NA/불규칙변경

.

)

, ,

엘사,100000000X

()

노래,100100000X

소리,100001000X

.

  B : ㅂ 불규칙
  H : ㅎ 불규칙
  L : 르 불규칙
  U : ㄹ 불규칙
  S : ㅅ 불규칙
  D : ㄷ 불규칙
  R : 러 불규칙
  X : 규칙

: http://cafe.naver.com/korlucene/135

5. compound.dic

분해전 단어:분해후단어1,분해후단어2,,분해후단어N:DBXX

분해전 단어 하여(다)동사(D), 되어(다)동사(B) .

)

  객관화:객관,화:1100

.

) http://krdic.naver.com/search.nhn?query=%EA%B0%9D%EA%B4%80%ED%99%94&kind=all

Elasticsearch plugin

Elasticsearch plugin .

arirang

1. clone

master branch .

$ git clone https://github.com/korlucene/arirang.morph.git
$ git clone https://github.com/korlucene/arirang-analyzer-6.git

2. Maven build

maven project maven .

maven : https://maven.apache.org/

arirang-analyzer-6 arirang.morph arirang.morph arirang-analyzer-6 .

arirang.morph $ mvn clean package
arirang-analyzer-6 $ mvn clean package

3.

arirang-analyzer-6 test code . src/test TestKoreanAnalyzer1 . .

/**
 * Created by SooMyung(soomyung.lee@gmail.com) on 2014. 7. 30.
 */
public class TestKoreanAnalyzer1 extends TestCase {

  public void testKoreanAnalzer() throws Exception {

    String[] sources = new String[]{
      "고려 때 중랑장(中郞將) 이돈수(李敦守)의 12대손이며",
      "이돈수(李敦守)의",
      "K•N의 비극",
      "金靜子敎授",
      "天國의",
      "기술천이",
      "12대손이며",
      "明憲淑敬睿仁正穆弘聖章純貞徽莊昭端禧粹顯懿獻康綏裕寧慈溫恭安孝定王后",
      "홍재룡(洪在龍)의",
      "정식시호는 명헌숙경예인정목홍성장순정휘장소단희수현의헌강수유령자온공안효정왕후(明憲淑敬睿仁正穆弘聖章純貞徽莊昭端禧粹顯懿獻康綏裕寧慈溫恭安孝定王后)이며 돈령부영사(敦寧府領事) 홍재룡(洪在龍)의 딸이다. 1844년, 헌종의 정비(正妃)인 효현왕후가 승하하자 헌종의 계비로써 중궁에 책봉되었으나 5년 뒤인 1849년에 남편 헌종이 승하하고 철종이 즉위하자 19세의 어린 나이로 대비가 되었다. 1857년 시조모 대왕대비 순원왕후가 승하하자 왕대비가 되었다.",
      "노벨상을"
    };

    KoreanAnalyzer analyzer = new KoreanAnalyzer();

    for (String source : sources) {
      TokenStream stream = analyzer.tokenStream("dummy", new StringReader(source));

      CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);
      PositionIncrementAttribute posIncrAtt = stream.addAttribute(PositionIncrementAttribute.class);
      PositionLengthAttribute posLenAtt = stream.addAttribute(PositionLengthAttribute.class);
      TypeAttribute typeAtt = stream.addAttribute(TypeAttribute.class);
      OffsetAttribute offsetAtt = stream.addAttribute(OffsetAttribute.class);
      MorphemeAttribute morphAtt = stream.addAttribute(MorphemeAttribute.class);
      stream.reset();

      while (stream.incrementToken()) {
        System.out.println(termAtt.toString() + ":" + posIncrAtt.getPositionIncrement() + "(" + offsetAtt.startOffset() + "," + offsetAtt.endOffset() + ")");
      }
      stream.close();
    }

  }
}

Elasticsearch plugin

arirang elasticsearch plugin . , elasticsearch plugins .

Elasticsearch Plugins and Integrations : https://www.elastic.co/guide/en/elasticsearch/plugins/5.5/index.html

Elastic . https://github.com/elastic/elasticsearch/tree/master/plugins/jvm-example

elasticsearch source code official plugin .

analysis plugin .

1. Project Directory

src/main
assemblies
  plugin.xml
java
  org/elasticsearch
    index/analysis
      ${CUSTOM-ANALYZER-NAME}AnalyzerProvider
      ${CUSTOM-ANALYZER-NAME}TokenFilterFactory
      ${CUSTOM-ANALYZER-NAME}TokenizerFactory
    plugin/analysis/arirang
      Analysis${CUSTOM-ANALYZER-NAME}Plugin
resources
  plugin-descriptor.propeties

2. Files and classes

  • plugin.xml maven assembly plugin .
  • plugin-descriptor.propeties plugin authors . elasticsearch reference: https://www.elastic.co/guide/en/elasticsearch/plugins/5.5/plugin-authors.html
  • ${CUSTOM-ANALYZER-NAME}AnalyzerProvider custom analyzer .
  • ${CUSTOM-ANALYZER-NAME}TokenFilterFactory custom filter .
  • ${CUSTOM-ANALYZER-NAME}TokenizerFactory custom tokenizer .
  • Analysis${CUSTOM-ANALYZER-NAME}Plugin custom analyzer plugin .

elasticsearch-analysis-arirang plugin .

plugin arirang dynamic dictionary reload Rest Handler .

) https://github.com/HowookJeong/elasticsearch-analysis-arirang/tree/5.5.0

Step1)
Step2)
  • Plugin project structure .
Step3)
  • root path lib arirang analyzer jar .
  • arirang.lucene-analyzer-VERSION.jar
  • arirang-morph-VERSION.jar
Step4)
  • pom.xml local jar dependency .

    <dependency> <groupId>com.argo</groupId> <artifactId>morph</artifactId> <version>${morph.version}</version> <scope>system</scope> <systemPath>${project.basedir}/lib/arirang-morph-${morph.version}.jar</systemPath> <optional>false</optional> </dependency> <dependency> <groupId>com.argo</groupId> <artifactId>arirang.lucene-analyzer-${lucene.version}</artifactId> <version>${morph.version}</version> <scope>system</scope> <systemPath>${project.basedir}/lib/arirang.lucene-analyzer-${lucene.version}-${morph.version}.jar</systemPath> <optional>false</optional> </dependency>
Step5)
  • analysis plugin .

    @Override public List<RestHandler> getRestHandlers(Settings settings, RestController restController, ClusterSettings clusterSettings, IndexScopedSettings indexScopedSettings, SettingsFilter settingsFilter, IndexNameExpressionResolver indexNameExpressionResolver, Supplier<DiscoveryNodes> nodesInCluster) { return singletonList(new ArirangAnalyzerRestAction(settings, restController)); }

    @Override public Map<String, AnalysisProvider<TokenFilterFactory>> getTokenFilters() { return singletonMap("arirang_filter", ArirangTokenFilterFactory::new); }

    @Override public Map<String, AnalysisProvider<TokenizerFactory>> getTokenizers() { Map<String, AnalysisProvider<TokenizerFactory>> extra = new HashMap<>(); extra.put("arirang_tokenizer", ArirangTokenizerFactory::new);

      return extra;

    }

    @Override public Map<String, AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> getAnalyzers() { return singletonMap("arirang_analyzer", ArirangAnalyzerProvider::new); }

Step6)
  • analysis .

    // ArirangAnalyzerProvider private final KoreanAnalyzer analyzer;

    public ArirangAnalyzerProvider(IndexSettings indexSettings, Environment env, String name, Settings settings) throws IOException { super(indexSettings, name, settings);

      analyzer = new KoreanAnalyzer();

    }

    @Override public KoreanAnalyzer get() { return this.analyzer; }

    // ArirangTokenFilterFactory public ArirangTokenFilterFactory(IndexSettings indexSettings, Environment env, String name, Settings settings) { super(indexSettings, name, settings); }

    @Override public TokenStream create(TokenStream tokenStream) { return new KoreanFilter(tokenStream); }

    // ArirangTokenizerFactory public ArirangTokenizerFactory(IndexSettings indexSettings, Environment env, String name, Settings settings) { super(indexSettings, name, settings); }

    @Override public Tokenizer create() { return new KoreanTokenizer(); }

Step7)
  • rest action .

    // ArirangAnalyzerRestAction @Inject public ArirangAnalyzerRestAction(Settings settings, RestController controller) { super(settings);

    controller.registerHandler(RestRequest.Method.GET, "/_arirang_dictionary_reload", this);

    }

    @Override protected RestChannelConsumer prepareRequest(RestRequest restRequest, NodeClient client) throws IOException { try { DictionaryUtil.loadDictionary(); } catch (MorphException me) { return channel -> channel.sendResponse(new BytesRestResponse(RestStatus.NOT_ACCEPTABLE, "Failed which reload arirang analyzer dictionary!!")); } finally { }

    return channel -> channel.sendResponse(new BytesRestResponse(RestStatus.OK, "Reloaded arirang analyzer dictionary!!"));

    }

    // ArirangAnalyzerRestModule @Override protected void configure() { // TODO Auto-generated method stub bind(ArirangAnalyzerRestAction.class).asEagerSingleton(); }

Step8)
  • plugin-descriptor.properties .

    classname=org.elasticsearch.plugin.analysis.arirang.AnalysisArirangPlugin name=analysis-arirang jvm=true java.version=1.8 site=false isolated=true description=Arirang plugin version=${project.version} elasticsearch.version=${elasticsearch.version} hash=${buildNumber} timestamp=${timestamp}

Step9)
  • plugin.xml .

    <file> <source>lib/arirang.lucene-analyzer-6.5.1-1.1.0.jar</source> <outputDirectory>elasticsearch</outputDirectory> </file> <file> <source>lib/arirang-morph-1.1.0.jar</source> <outputDirectory>elasticsearch</outputDirectory> </file> <file> <source>target/elasticsearch-analysis-arirang-5.5.0.jar</source> <outputDirectory>elasticsearch</outputDirectory> </file> <file> <source>${basedir}/src/main/resources/plugin-descriptor.properties</source> <outputDirectory>elasticsearch</outputDirectory> <filtered>true</filtered> </file>
Step10)
  • .

    $ mvn clean package -DskipTests=true

github . , .

.

1.

$ bin/elasticsearch-plugin install --verbose file:///path/elasticsearch-analysis-arirang-5.5.0.zip

2.

  • RESTful endpoint

)

http://localhost:9200/_arirang_dictionary_reload

)

Reloaded arirang analyzer dictionary!!

arirang analyzer elasticsearch plugin . arirang analyzer .

arirang dictionary path .

1. classpath

  • elasticsearch classpath .
  • elasticsearch.in.sh .

    ES_CLASSPATH="$ES_HOME/lib/elasticsearch-5.5.0.jar:$ES_HOME/lib/*:$ES_CONF_PATH/dictionary"

) path .

config/dictionary/org/apache/lucene/analysis/ko
config/dictionary/org/apache/lucene/analysis/ko/dic
  • ESCONFPATH path.conf .

2.

  • 1 path .

3. reload

  • elasticsearch restart

    /_arirang_dictionary_reload

API .

arirang analyzer elasticseearch-analysis-arirang plugin dictionary . .

) http://cafe.naver.com/korlucene https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html https://www.elastic.co/guide/en/elasticsearch/plugins/current/index.html