公開日：2015年12月24日

記事概要

新米エンジニアの学習用に作った資料です。未だに10年以上も前の研修項目をこなしている新入社員が多いこと多いこと。まあ、会社の教育システムが時代に合っていないのが悪いのですけどね。
そんなわけで、今後(2015年以降)WEB(スマホ)エンジニアとして働くなら知っておいて欲しい事柄をまとめてみました。
この記事では検索エンジンで使うN-gram方式をまとめました。

環境

elasticsearch2.1.1

N-gram方式とは

文章を文字列の並び順にN文字を見出し語としてインデックスする方法です。
Nは、切り出す文字の単位になります。
日本語や中国語など、言語に依存しない見出し語作成が可能な方式とされています。検索漏れを少なくする特徴をもっています。

N-gramを用いるケース

例としては、検索漏れを少なくしたい店舗の名称、地域名などはN-gram方式でインデックスすると良いとされています。

N-gramのサンプル

ユニグラム = unigram (1文字単位)
バイグラム = bigram (2文字単位)
トリグラム = trigram (3文字単位)

のどれかを使うのが現実的とされています。
ここでは、bigram (2文字単位)を例にしてみましょう。

例1

東京都の天気は晴れです　→　東京/京都/都の/の天/天気/気は/は晴/晴れ/れで/です

2文字で区切っているだけです。
なので、インデックス(索引)は以下の通りになります。

東京
京都
都の
の天
天気
気は
は晴
晴れ
れで
です

2文字で区切っていますね。確認にもう一つ例文を確認してみましょう。

例2

天気予報は晴れです　→　天気/気予/予報/報は/は晴/晴れ/れで/です

天気
気予
予報
報は
は晴
晴れ
れで
です

理解できましたね。それでは実際に検索エンジンのelasticsearchを使ってみましょう。

elasticsearchのインストール

では、実際にelasticsearchとrubyを利用して試して、理解を深めていきましょう。
環境はdockerやvagrantを使って用意することをお勧めします。
以下は、vagrantのcentos6.5を使ったinstall方法です。

terminal


// javaのinstall
sudo yum -y install java-1.7.0-openjdk

java -version
java version "1.7.0_91"
OpenJDK Runtime Environment (rhel-2.6.2.2.el6_7-x86_64 u91-b00)
OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)

// rpm版を DLしてinstall
sudo rpm -ivh https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/rpm/elasticsearch/2.1.1/elasticsearch-2.1.1.rpm

### NOT starting on installation, please execute the following statements to configure elasticsearch service to start automatically using chkconfig
 sudo chkconfig --add elasticsearch
### You can start elasticsearch service by executing
 sudo service elasticsearch start
 
// chkconfigに追加
sudo chkconfig --add elasticsearch
// 確認
sudo chkconfig --list elasticsearch
elasticsearch   0:off   1:off   2:on    3:on    4:on    5:on    6:off

// 起動
sudo service elasticsearch start
Starting elasticsearch:                                    [  OK  ]

動作を確認します。

terminal


curl http://localhost:9200
{
  "name" : "Orb",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "2.1.1",
    "build_hash" : "40e2c53a6b6c2972b3d13846e450e66f4375bd71",
    "build_timestamp" : "2015-12-15T13:05:55Z",
    "build_snapshot" : false,
    "lucene_version" : "5.3.1"
  },
  "tagline" : "You Know, for Search"
}

以上でinstallは完了です。

N-gramの設定

elasticsearchでngramを使うには、Analyzersのトークナイザーにngramを設定します。

terminal


curl -XPUT 'http://localhost:9200/sample' -d '
    {
        "settings" : {
            "analysis" : {
                "analyzer" : {
                    "default" : {
                        "tokenizer" : "my_ngram_tokenizer"
                    }
                },
                "tokenizer" : {
                    "my_ngram_tokenizer" : {
                        "type" : "nGram",
                        "min_gram" : "2",
                        "max_gram" : "2",
                        "token_chars": [ "letter", "digit" ]
                    }
                }
            }
        }
    }'

サンプルなので、defaultのtokenizerをmy_ngram_tokenizerにしています。
type:nGramがngramの設定になります。
{"acknowledged":true}というjsonが返却されれば成功です。

N-gram解析を実行する

例文「東京都の天気は晴れです」をngramで解析します。

terminal


curl 'http://localhost:9200/sample/_analyze?pretty' -d '東京都の天気は晴れです'
{
  "tokens" : [ {
    "token" : "東京",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "京都",
    "start_offset" : 1,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "都の",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "の天",
    "start_offset" : 3,
    "end_offset" : 5,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "天気",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "気は",
    "start_offset" : 5,
    "end_offset" : 7,
    "type" : "word",
    "position" : 5
  }, {
    "token" : "は晴",
    "start_offset" : 6,
    "end_offset" : 8,
    "type" : "word",
    "position" : 6
  }, {
    "token" : "晴れ",
    "start_offset" : 7,
    "end_offset" : 9,
    "type" : "word",
    "position" : 7
  }, {
    "token" : "れで",
    "start_offset" : 8,
    "end_offset" : 10,
    "type" : "word",
    "position" : 8
  }, {
    "token" : "です",
    "start_offset" : 9,
    "end_offset" : 11,
    "type" : "word",
    "position" : 9
  } ]
}

おお。2文字でうまく解析されていますね。もう一つの例文も試してみましょう。

terminal


curl 'http://localhost:9200/sample/_analyze?pretty' -d '天気予報は晴れです'
{
  "tokens" : [ {
    "token" : "天気",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "気予",
    "start_offset" : 1,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "予報",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "報は",
    "start_offset" : 3,
    "end_offset" : 5,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "は晴",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "晴れ",
    "start_offset" : 5,
    "end_offset" : 7,
    "type" : "word",
    "position" : 5
  }, {
    "token" : "れで",
    "start_offset" : 6,
    "end_offset" : 8,
    "type" : "word",
    "position" : 6
  }, {
    "token" : "です",
    "start_offset" : 7,
    "end_offset" : 9,
    "type" : "word",
    "position" : 7
  } ]
}

きちんとN-gram(2-gram)で解析されていますね。