6

kdd&kdi&kdm

 3 years ago
source link: https://www.amazingkoala.com.cn/Lucene_Document/IndexFile/2020/1104/175.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

kdd&kdi&kdm(Lucene 8.6.0)

  The data of Point is encoded in a block KD-tree structure described with three index files:

  • A .kdm file that records metadata about the fields
  • A .kdi file that stores inner nodes of the tree
  • A .kdd file that stores leaf nodes of the tree

  The data structure of .kdd file as below:

Fig.1:

1.png

  LeafNodeData describes the informations of leaf nodes, You can see there is only leaf node data in .kdd file and the default maximum number of point in each LeafNodeData is 512. let's see the details about LeafNodeData.

LeafNodeData

Fig.2:

2.png

Count

   Count describes the number of point in a LeafNodeData.

DocIds

   DocIds describes a set of document IDs which a point in a LeafNodeData belongs to.

PointValues

   PointValues describes a set of point values, each point value is split into two parts: CommonPrefixes and BlockPackedValues.

CommonPrefixes

   CommonPrefixes describes a set of common prefixes with each dimension.

Fig.3:

3.png

Length、Value

   Value describes the common prefix value with byte and Length means the length of common prefix value.

BlockPackedValues

   BlockPackedValues describes the suffix of point value. BlockPackedValues has two types of data structure according to the suffix. Due to the length of this article, it is not appropriate to extend the introduction.

  The data structure of .kdi file as below:

Fig.4:

4.png

PackedIndexValue

  PackedIndexValue describes the informations of all inner nodes. inner node has four types according to the position in KD-tree and its sub node type.

  • RightNodeHasLeafChild: the inner node is the right node of its parent node and its child nodes are leaf node
  • RightNodeHasNotLeafChild: the inner node is the right node of its parent node and its child nodes are not leaf node
  • LeftNodeHasLeafChild: the inner node is the left node of its parent node and its child nodes are leaf node
  • LeftNodeHasNotLeafChild: the inner node is the left node of its parent node and its child nodes are not leaf node

  Root node is a inner node and it belongs to RightNodeHasNotLeafChild.

RightNodeHasNotLeafChild

Fig.5:

5.png

LeftLeafBlockFP

  LeftLeafBlockFP describes a pointer where the position of the leftmost leaf node of current inner node in .kdd file is.

  Code describes the metadata about inner node, such as spilt dimension or bytes per dimension(How many bytes each value in each dimension takes).

SplitValue

  In the stage of constructing a KD-tree, the set of points in a inner node will be split into two sub nodes according to SpiltValue

LeftNumBytes

  LeftNumBytes describes the total size of one inner node's sub tree.

RightNodeHasNotLeafChild

Fig.6:

6.png

  LeftNumBytes, Code and SplitValue describes the same meaning as above.

RightLeafBlockFP

  RightLeafBlockFP describes a pointer where the position of the right leaf node of current inner node in .kdd file is.

LeftNodeHasLeafChild

Fig.7:

7.png

  Code, SplitValue and RightLeafBlockFP describe the same meaning as above.

LeftNodeHasNotLeafChild

Fig.8:

8.png

  Code, SplitValue and LeftNumBytes describe the same meaning as above.

KD-tree and Index File

  Inner node in a KD-tree would be stored in .kid file as blow:

Fig.9:

9.png

High-definition pictures

  Leaf node in a KD-tree would be stored in .kdd file as blow:

Fig.10:

10.png

High-definition pictures

LeftNumBytes

Fig.11:

11.png

High-definition pictures

Fig.12:

12.png

High-definition pictures

  As above said, LeftNumBytes describes the total size of one inner node's sub tree, it makes a reader can skip to the position of inner node's right node in .kid file and its left node is just in the next position.

LeftLeafBlockFP RightLeafBlockFP

Fig.13:

13.png

High-definition pictures

Fig.14:

14.png

  With LeftLeafBlockFP and RightLeafBlockFP, leaf Node can be positioned in .kdd file.

High-definition pictures

  The data structure of .kdm file as below:

Fig.15:

15.png

  -1 is a fix padding value, In the stage of reading .kdm file, it as a mark means all the Fields has been readed.

KdiFileLength KddFileLength

  In the stage of reading index file, KdiFileLength and KddFileLength used to validate .kdi and .kdd files, such as truncated file or file too long.

Fig.16:

16.png

FieldNumber

  FieldNumber uniquely represents a Field.

Index

Fig.17:

17.png

NumDataDims

  NumDataDims describes how many dimensions we are storing at the leaf (data) nodes.

NumIndexDims

  NumIndexDims describes how many dimensions we are indexing in the internal nodes.

CountPerLeaf

  CountPerLeaf describes the maximum number of point in each leaf node.

BytesPerDim

  BytesPerDim describes the number of bytes per dimension.

NumLeaves

  NumLeaves describes the number of leaf node.

MinPackedValue MaxPackedValue

  MinPackedValue describes minimum value for each dimension and MaxPackedValue describes maximum value for each dimension values.

PointCount

  PointCount describes the total number of indexed points across all documents.

DocCount

  DocCount describes the total number of documents that have indexed at least one point.

Length

  Length describes the size of PackedIndexValue in Fig.4.

DataStartFP

  DataStartFP describes a pointer where the beginning position of current field data in .kdd file is.

IndexStartFP

  IndexStartFP describes a pointer where the beginning position of current field data in .kdi file is.

Fig.18:

18.png

Complete data structure

Fig.19:

19.png

Fig.20:

20.png

Fig.21:

21.png


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK