Elasticsearch 学习笔记

配置说明
api
Elasticsearch 常用术语
Document
Index
Mapping
Elasticsearch CRUD
Elasticsearch Query
- Query String
- Query DSL
Elasticsearch Ingest Node
插件
- Filter Plugin - dissect
预定义分词器
中文分词

配置文件位于config目录

elasticsearch.yml: es相关配置

cluster.name: 集群名称，以此作为是否同一集群的判断条件
node.name: 节点名称，以此作为集群中不同节点的区分团建
nerwork.host/http.port 网络地址和端口，用于http和transport服务使用
path.data: 数据存储地址
path.log: 日志存储地址

Development 与 Production模式说明

以transport的地址是否绑定在localhost为判断标准(network.host)
Dev模式下启动时会以warning的方式提示配置检查异常
Production模式下启动会以error的方式提示配置检查异常并退出

参数修改的第二种方式

bin/elasticsearch -E配置名=配置值

jvm.options: jvm的相关参数

log4j2.properties: 日志相关配置

/_cat/nodes

输出集群的结点信息

/_cat/nodes?v

输出集群的详细结点信息，其中master栏有*表示主结点

/_cluster/status

输出集群的详细信息

Rest API

REST REpresentational State Transfer，表现层状态转移
URL 指定资源，如 Index、Document 等
Http Method 指明资源操作类型，如GET获取、POST更新、PUT新增、DELETE删除

索引 API

es有专门的Index API,用于创建、更新、删除索引配置等

PUT /${index_name} : 创建索引
GET _cat/indices : 查看现有索引
DELETE //${index_name} : 删除索引

文档 API

指定 id 创建文档 api

# 创建文档时，如果索引不存在，es 会自动创建对应index、type# request#索引名index_name/类型type/idPUT /test_index/doc/1  {    "username":"alfred",    "age":1}# response{    "_index":"test_index",    "_type":"doc",    "_id":"1",    "_version":1,  # 每次对文档有变化的操作都会更新+1，包含了锁的机制
    "result":"created",    "_shards":{        "total":2,        "successful":1,        "failed":0
    },    "_seq_no":0,    "_primary_term":1}

不指定 id 创建文档 api

# requestPOST /test_index/doc
{    "username":"tom",    "age":20}# response{    "_index":"test_index",    "_type":"doc",    "_id":"Mj-H2ABSmWv7ZHR8Oa3", # 自动生成
    "_version":1,    "result":"created",    "_shards":{        "total":2,        "successful":1,        "failed":0
    },    "_seq_no":0,    "_promary_term":1}

指定要查询的文档id

# request#索引名index_name/类型type/idGET /test_index/doc/1# 200 response{    "_index":"test_index",    "_type":"doc",    "_id":"1",    "_version":1,    "found":true,    "_source":{  # 文档的原始数据
        "username":"alfred",        "age":1
    }
}# 404 response{    "_index":"test_index",    "_type":"doc",    "_id":"2", # 不存在的id    "found":false}

搜索所有文档

# request# 用到_search，并把查询语句作为json格式放到http body中发送到 esGET /test_index/doc/_search{    "query":{        "term":{ # 匹配id为1的
            "_id":"1"
        }
    }
}# response{    "took":0, # 查询耗时，单位ms
    "timed_out":false,    "_shards":{        "total":5,        "successful":5,        "skipped":0,        "failed":0
    },    "hits":{        "total":1, # 符合条件的总文档数
        "max_score":1,        "hits":[
            { # 返回的文档详情数据数组，默认前10个文档
                "_index":"test_index",                "_type":"doc",                "_id":"1",                "_version":1,                "_score":1, # 文档的得分
                "_source":{  # 文档的原始数据
                    "username":"alfred",                    "age":1
                }
            },
            {
                ...
            }
        ]
    }
}

批量写入文档

es允许一次创建多个文档，从而减少网络传输开销，提升写入速率

# repuestPOST _bulk# action_type支持: # index 创建文档，如果已经存在就覆盖# create 创建文档，如果已经存在就报错# update 更新文档# delete 删除文档{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"username":"alfred","age":10}
{"delete":{"_index":"test_index","_type":"doc","_id":1}}
{"update":{"_id":"2","_index":"test_index"."_type":"doc"}}
{"doc":{"age":"20"}}# response{    "took":33, 耗时，单位ms    "errors":false,    "items":[ # 每个bulk操作的返回结果
        {            "index":{                "_index":"test_index",                "_type":"doc",                "_id":"1",                "_version":1,                "result":"created",                "_shards":{                    "total":2,                    "successful":1,                    "failed":0
                },                "_seq_no":0,                "_primary_term":1,                "status":201
            }
        },
        {            "delete":{                "_index":"test_index",                "_type":"doc",                "_id":"1",                "_version":2,                "result":"deleted",                "_shards":{                    "total":2,                    "successful":1,                    "failed":0
                },                "_seq_no":0,                "_primary_term":1,                "status":200
            }
        },
        {            "update":{                "_index":"test_index",                "_type":"doc",                "_id":"1",                "_version":2,                "result":"updated",                "_shards":{                    "total":2,                    "successful":1,                    "failed":0
                },                "_seq_no":0,                "_primary_term":1,                "status":200
            }
        }
    ]
}

批量查询文档

# requestGET /_mget
{    "docs":[
        {            "_index":"test_index",            "_type":"doc",            "_id":"1"
        },
        {            "_index":"test_index",            "_type":"doc",            "_id":2
        }
    ]
}# response{    "docs":[
        {            "index":"test_index",            "_type":"doc",            "_id":"1",            "found":false # 未找到
        },
        {            "index":"test_index",            "_type":"doc",            "_id":"2",            "_version":2,            "found":true,            "_source":{                "username":"lee",                "age":"20"
            }
        }
    ]
}

Analyze API

es提供了一个测试分词的 api 接口，方便验证分词效果，endpoint 是 _analyze

可以直接指定 analyzer 进行测试

# requestPOST _analyze{    "analyzer": "standard", # 分词器
    "text":"hello world!" # 测试文本}# response{    "tokens": [
    {        "token":"hello",  # 分词结果
        "start_offset":0, # 起始偏移
        "end_offset":5, # 结束偏移
        "type":"<ALPHANUM>",        "position":0 # 分词位置
    },
    {        "token":"world",        "start_offset":6,        "end_offset":11,        "type":"<ALPHANUM>",        "position":1
    }
    ]
}

可以直接指定索引中的字段进行测试

# requestPOST test_index/_analyze{    "field":"username",  # 测试字段
    "text":"hello world!" # 测试文本}

可以自定义分词器进行测试

# requestPOST _analyze{    "tokenizer": "standard",    "filter": ["lowercase"], # 自定义 analyzer
    "text":"Hello World!"}

Elasticsearch 常用术语

Document
- 文档数据，相对于mysql的一行数据
Index
- 索引: 所有的 Document 都存储在对应的 Index 中
- 由具有相同字段的文档列表组成
- 相对于mysql的table
Type 索引中的数据类型，目前一个index只允许有一个Type，后续可能会移除Type的概念
- 一个es的运行实例，是集群的构成单元
Cluster
- 由一个或多个节点组成，对外提供服务
Field 字段，文档的属性
Query DSL 查询语法

Document

Json Object,由字段（Field）组成，常见数据类型如下：
- 字符串：text, keyword
- 数值：long，integer，short，byte，double，float，half_float，scaled_float
- 布尔：boolean
- 日期：date
- 二进制：binary
- 范围类型：integer_range，float_range，long_range，double_range，data_range
每个文档有唯一的 id 标识
- es 自动生成
元数据，用于标准文档的相关信息（Document MetaData）
- _index: 文档所在的索引名
- _type: 文档所在的类型名
- _id: 文档唯一id
- _uid: 组合id, 由 _type 和 _id 组成(6.x _type不再起作用，同 _id 一样)
- _source: 文档的原始 Json 数据, 可以从这里获取每个字段的内容
- _all: 整合所有字段内容到该字段, 默认禁用

Index

类别mysql的table
索引中存储具有相同结构的文档（Document）
- 每个索引都有自己的mapping 定义，用于定义字段名和类型
一个集群可以有多个索引，如：
- nginx-log-2017-01-01
- nginx-log-2017-01-02
- nginx-log-2017-01-03
- nginx 日志存的时候可以按照日期每天生成一个索引来存储

Mapping

类似数据库中的表结构定义：

定义 Index 下的字段名
定义字段的类型，比如数值型、字符串型、布尔型等
定义倒排索引相关的配置，比如是否索引、记录 position 等

# requestGET /test_index/_mapping# response{    "test_index": { # 索引
        "mappings": {            "doc": { # type                "properties": {                    "age": {                        "type": "integer"
                    },                    "username": {                        "type": "keyword"
                    }
                }
            }
        }
    }
}

自定义Mapping

# requestPUT my_index
{    "mappings": { # mappings 关键词
        "doc": { # type            "properties": {                "title": {                    "type": "text"
                },                "name": {                    "type": "keyword"
                },                "age": {                    "type": "integer"
                }
            }
        }
        
    }
}# response{    "acknowledged": true,    "shards_acknowledge": true,    "index": "my_index"}

类型一旦设定后，禁止直接修改，因为 Lucene 实现的倒排索引生成后不允许修改
重新建立新的索引，然后做 reindex 操作
允许新增字段
通过 dynamic 参数来控制字段的新增
- true（默认）: 允许自动新增字段
- false: 不允许字段新增字段，但是文档可以正常写入，但无法对字段进行查询等操作
- strict: 文档不能写入，报错

# requestPUT my_index{    "mappings": {        "my_type": {            "dynamic": false,            "properties": {                "user": {                    "properties": {                        "name": {                            "type": "text"
                        },                        "social_networds": {                            "dynamic": true,                            "properties": {}
                        }
                    }
                }
            }
        }
    }}

copy_to

将该字段的值复制到目标字段，实现类型 _all 的作用
不会出现在 _source 中，只用来搜索

PUT my_index
{    "mappings": {        "doc": {            "properties":{                "first_name":{                    "type": "text",                    "copy_to": "full_name"
                },                "last_name":{                    "type": "text",                    "copy_to": "full_name"
                },                "full_name":{                    "type":"text"
                }
            }
        }
    }
}

PUT my_index/doc/1{    "first_name":"John",    "last_name":"Smith"}

GET my_index/_search
{    "query":{        "match": {            "full_name":{                "query":"John Smith",                "operator": "and"
            }
        }
    }
}

index

控制当前字段是否索引，默认为true，即记录索引，flase 表示不记录，即不可搜索

# requestPUT my_index
{    "mappings":{        "doc": {            "properties": {                "cookie": {                    "type": "text",                    "index": false
                }
            }
        }
    }
}

PUT my_index/doc/1
{    "cookie":"name=alfred"}GET my_index/_search
{    "query":{        "match": {            "cookie":"name"
        }
    }
}# response{    "error":{        "root_cause":[            ......
            "index": "my_index3",            "caused_by":{                "type":"illegal_argument_exception",                "reason":"Cannot search on field [cookie] since it is not indexed"
            }
        ]
    },    "status":400
}

index_options 用于控制倒排索引记录的内容，有如下4种配置
- docs 只记录 doc id
- freqs 记录 doc id 和 term ferquencies
- positions 记录 doc id、term frequencies、term position 和 character offsets
text 类型默认配置为 positions, 其他默认为 docs
记录内容越多，占用空间越大

# requestPUT my_index{    "mappings":{        "doc":{            "properties":{                "cookie":{                    "type":"text",                    "index_options":"offsets"
                }
            }
        }
    }
}

null_value
- 当字段遇到 null 值是的处理策略，默认为 null 时，即空值，此时 es 会忽略该值。可以通过设定该值设定字段的默认值。

# requestPUT my_index{    "mappings":{        "my_type":{            "properties": {                "status_code":{                    "type": "keyword".                    "null_value":"NULL"
                }
            }
        }
    }
}

核心数据类型
- 字符串型 text、keyword
- 数值型 long、integer、short、byte、double、float、half_float、scaled_float
- 日期类型 date
- 布尔类型 boolean
- 二进制类型 binary
- 范围类型 integer_range、float_range、long_range、double_range、date_range
复杂数据类型
- 数组类型 array
- 对象类型 object
- 嵌套类型 nested object
地理位置数据类型
- geo_point
- geo_shape
- ip 记录 ip 地址
- completion 实现自动补全
- token_count 记录分词数
- murmur3 记录字符串 hash 值
- percolator

多字段特性 multi-fields

允许对同一个自动采用不同的配置，比如分词，场景例子如对人名实现拼音搜索，只需要在人名中新增一个子字段为pinyin 即可

# request{    "test_index":{        "mappings":{            "doc":{                "properties":{                    "username":{                        "type":"text",                        "fields":{                            "pinyin":{                                "type":"text",                                "analyzer":"pinyin"
                            }
                        }
                    }
                }
           }
        }
    }
}GET test_index/_search
{    "query":{        "match":{            "username_pinyin":"hanhan"
        }
    }
}

Dynamic Mapping

es 可以自动识别文档字段类型，从而降低用户使用成本，如下：

# requestPUT /test_index/doc/1{    "username":"alfred",    "age":1}

GET /test_index/_mapping# response{    "test_index":{        "mappings":{            "doc":{                "properties": {                    "age":{                        "type":"long"
                    },                    "username":{                        "type":"test",                        "fields":{                            "keyword":{                                "type":"keyword",  # es自动识别 age 为long 类型，username 为 text 类型
                                "ignore_above":256
                            }
                        }
                    }
                }
            }
        }
    }
}

es 是依靠 JSON 文档的字段类型来实现自动识别字段类型，支持的类型如下:

JSON 类型	es 类型
null	忽略
boolean	boolean
浮点类型	float
整数	long
object	object
array	由第一个非 null 值的类型决定
string	匹配为日期则设定为date 类型（默认开启），匹配为数组的话设为 float 或 long 类型（默认关闭），设为 text 类型，并附带 keyword 的子字段

# requestPUT /test_index/doc/1{    "username":"alfred",    "age":14,    "birth":"1988-10-10",    "married":false,    "year":"18",    "tags":["boy", "fashion"],    "money":100.1}

GET /test_index/_mapping# response{    "test_index":{        "mappings":{            "doc":{                "properties":{                    "age":{                        "type":"long"
                    },                    "birth":{                        "type":"date"
                    },                    "married":{                        "type":"boolean"
                    },                    "money":{                        "type":"float"
                    },                    "tags":{                        "type":"text",                        "fields":{                            "keyword":{                                "type":"keyword",                                "ignore_above":256
                            }
                        }
                    },                    "username":{                        "type":"text",                        "fields":{                            "keyword":{                                "type":"keyword",                                "ignore_above":256
                            }
                        }
                    },                    "year":{                        "type":"text",                        "fields":{                            "keyword":{                                "type":"keyword",                                "ignore_above":256
                            }
                        }
                    }
                }
            }
        }
    }
}

日期的自动识别可以自行配置日期格式，以满足各种需求
- YYYY-MM-DDThh:mm:ssTZD (eg 1997-07-16T19:20:30+01:00)
- 默认是 ["strict_date_optional_time", "yyyy/MM/dd HH:mm:ss Z"]
- strict_date_optional_time 是 ISO datetime 格式，完整格式类似下面:
- dynamic_date_formats 可以自定义日期类型
- date_detection 可以关闭日期的自动识别的机制

# requestPUT my_index{    "mappings":{        "my_type":{            "dynamic_date_formats":["MM/dd/yyyy"]
        }
    }
}

PUT my_index/my_type/1
{    "create_date":"09/25/2015"}# 关闭日期自动识别机制PUT my_index{    "mappings":{        "my_type":{            "date_detection":false
        }
    }
}

字符串是数字时，默认不会自动识别为整数，因为字符串中出现数字是完全合理的
- numeric_detection 可以开启字符串中数字的字段识别，如下：

# requestPUT my_index{    "mappings":{        "my_type":{            "numeric_detection":true
        }
    }
}
PUT my_index/my_type/1
{    "my_float":"1.0",    "my_integer":"1"}# responseGET my_index/_mapping{    "my_index1":{        "mappings":{            "my_type":{                "numeric_detection":true,                "properties":{                    "my_float":{                        "type":"float"
                    },                    "my_integer":{                        "type":"long"
                    }
                }
            }
        }
    }
}

Dynamic Templates

允许根据 es 自动识别的数据类型、字段名等来动态设定字段类型，可以实现如下效果：
- 所有字符串类型都设定为 keyword 类型，即默认不分词
- 所有以 message 开头的字段都设定为 text 类型，即分词
- 所有以 long_ 开头的字段都设定为 long 类型
- 所有字段匹配为 double 类型的都设定为 float 类型，以节省空间

# requestPUT test_index{    "mappings":{        "doc":{            "dynamic_templates":[ # 数组，可指定多个匹配规则
            {                "strings":{ # template 的名称
                    "match_mapping_type":"string", # 匹配规则
                    "mapping":{ # 设置 mapping 信息
                        "type":"keyword"
                    }
                }
            }
            ]
        }
    }
}

匹配规则参数

match_mapping_type: 匹配 es 自动识别的字段类型，如boolean,long,string等
match/unmatch: 匹配字段名
path_match/path_unmatch: 匹配路径，用于匹配object类型的内部字段

# 字符串默认使用 keyword 类型# es默认会为字符串设置 text 类型，并增加一个 keyword 的子字段# requestPUT test_index
{    "mappings":{        "doc":{            "dynamic_templates":[
            {                "strings_as_keywords":{                    "match_mapping_type":"string",                    "mapping":{                        "type":"keyword"
                    }
                }
            }
            ]
        }
    }
}

# 以 message 开头的字段都设置为 text 类型# requestPUT test_index
{    "mappings":{        "doc":{            "dynamic_templates":[
            {                "message_as_text":{                    "match_mapping_type":"string",                    "match":"message* ",                    "mapping":{                        "type":"text"
                    }
                }
            }
            ]
        }
    }
}

# double 类型设定为 float，节省空间# requestPUT test_index
{    "mappings":{        "doc": {            "dynamic_templates":[
            {                "double_as_float":{                    "match_mapping_type":"double",                    "mapping":{                        "type":"float"
                    }
                }
            }
            ]
        }
    }
}

Elasticsearch CRUD

Create

# 请求  /{Index}/{Type}/{id}POST /accouts/person/1{    "name": "John",    "lastname": "Doe",    "job_description": "Systems administrator and Linux specialit"}# 响应{    "_index": "accounts",    "_type": "person",    "_id":"1",    "_version": 1,    "result": "created",    "_shards": {        "total": 2,        "successful": 1,        "failed": 0
    },    "created": true}

和Create不同的是，使用GET

# 请求  /{Index}/{Type}/{id}GET /accouts/person/1
{    "name": "John",    "lastname": "Doe",    "job_description": "Systems administrator and Linux specialit"}# 响应{    "_index": "accounts",    "_type": "person",    "_id":"1",    "_version": 1,    "result": "created",    "_shards": {        "total": 2,        "successful": 1,        "failed": 0
    },    "created": true}

Update

# 请求POST /accounts/person/1/_update
{    "doc":{        "job_description": "Systems administrator and Linux specialist"
    }
}# 响应{    "_index": "accounts",    "_type": "person",    "_id": "1",    "_version": 2,    "result": "updated",    "_shards": {        "total": 2,        "successful":1,        "failed":0
    }
}

Delete

# 请求DELETE /accounts/person/1DELETE /accounts# 响应{    "found": true,    "_index": "acounts",    "_type": "person",    "_id": "1",    "_version":3,    "result":"deleted",    "_shards":{        "total":2,        "successful":1,        "failed":0
    }
}

Elasticsearch Query

Query String

# 请求GET /accounts/person/_search?q=john

Query DSL

# 请求GET /accounts/person/_search{    "query": {        "match": {            "name":"json"
        }
    }
}

Elasticsearch Ingest Node

因为 filebeat 缺乏数据转换能力，所以官方新增 Node: Elasticsearch Ingest Node 作为能力补充，在数据写入es前进行数据转换

pipeline api

Filter Plugin - dissect

基于分隔符原理解析数据，解决 grok 解析时消耗过多 cpu 资源的问题

%{clientip} %{ident} %{auth} [%{timestamp}] "%{request}" % {response} %{bytes} "%{referrer}" "%{agent}"

预定义分词器

standard

默认分词器

tokenizer:

standard

token filters:

standard
lower case

按词切分，支持多语言

simple

tokenizer:

lower case

按照非字母切分

whitespace

tokenizer:

whitespace

按照空格切分

按照 stop word 语气助词等修饰性的词语切分，如 the、an、的、这等等

tokenizer:

lower case

token filters:

比simple多了stop word处理

keyword

tokenizer: keyword

不分词，直接将输入作为一个单词输出

pattern

tokenizer:

pattern

token filters:

lower case

通过正则表达式自定义分隔符
默认是\W+，即非字词的符号作为分隔符

language

提供了 30+ 常见的分词器

实现中英文单词的切分，支持ik_smart、ik_maxword等模式
可自定义词库，支持热更新分词字典

jieba

python 中最流行的分词系统，支持分词和词性标注
支持繁体分词、自定义词典、并行分词等

Hanlp

由一系列模型与算法组成的java工具包

THULAC

由清华大学自然语言处理与社会人文计算实验室研制推出的一套中文词法分析工具包，具有中文分词和词性标注功能

Elasticsearch 学习笔记-zmh009

Elasticsearch 学习笔记

elasticsearch.yml: es相关配置

Development 与 Production模式说明

参数修改的第二种方式

jvm.options: jvm的相关参数

log4j2.properties: 日志相关配置

/_cat/nodes

/_cat/nodes?v

/_cluster/status

Rest API

索引 API

文档 API

批量写入文档

批量查询文档

Analyze API

可以直接指定 analyzer 进行测试

可以直接指定索引中的字段进行测试

Elasticsearch 常用术语

Document

Index

Mapping

自定义Mapping

copy_to

index

多字段特性 multi-fields

Dynamic Mapping

Dynamic Templates

匹配规则参数

Elasticsearch CRUD

Create

Update

Delete

Elasticsearch Query

Query String

Query DSL

Elasticsearch Ingest Node

Filter Plugin - dissect

预定义分词器

standard

simple

whitespace

keyword

pattern

language

jieba

Hanlp

THULAC

Recommend

About Joyk