28

Building a Content-Based Search Engine VI: Efficient Query Processing

 5 years ago
source link: https://www.tuicool.com/articles/hit/bUriQzJ
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

The Multi-Step kNN Algorithm , proposed by Seidl et al. in [SK98], utilizes a lower bound for efficient kNN search by iteratively updating the pruning distance $\epsilon_{max}$ while scanning the database.  As shown in [SK98], this algorithm is optimal with respect to the number of performed computations of the utilized distance function $\delta$. Assuming that we have specified multiple lower bounds $\delta_{LB_1}, …,\delta_{LB_m}$, this algorithm reads as follows:

BFZrmqu.png!web

Multi-Step kNN Algorithm

The efficiency of this algorithm highly depends on the utilized lower bounds. A good lower bound should meet the ICES criteria defined by Assent et al. in [AWS06]: It should be indexable such that multidimensional indexing structures like X-Tree or R-Tree can be applied. Furthermore, it should be complete , i.e. no false drops occur, which is guaranteed by the lower-bounding property. Moreover, it should be efficient , i.e. its computational time complexity should be significantly lower than the complexity of the actual distance function. Finally, it should be selective , i.e. it should allow us to exclude as many objects as possible from the actual distance computation, which is achieved by approximating the actual distance as good as possible.

If the lower bounds have different time complexities and selectivities, then the ordering of the lower bounds plays a role for the efficiency of the algorithm. In each iteration of the outer loop, we skip the actual distance computation when one of the lower bounds exceeds the current pruning radius $\epsilon_{max}$. Hence, a reasonable heuristic in order to skip as early as possible is to sort the lower bounds in ascending order of their time complexities.

Lower bounds can be devised by exploiting the inner workings of the utilized distance function. There are, however, some lower bounds that are generic in nature, i.e. they are applicable to a wide variety of distance functions, as long as these distance functions fulfill certain properties. In the following, we present a generic lower bound for metric distance functions and lower bounds for the Earth Mover’s Distance.

Metric Lower Bound

If we are dealing with a metric distance function, we can exploit the fact that the distance fulfills the triangle inequality to devise a lower bound (cf. [ZADB06]). Given a query $q$, a database object $o$ and a set of so-called pivot objects $P$, it follows from the triangle inequality that

$$\delta_{LB-Metric}(q, o) = max_{p \in P} |\delta(q, p) – \delta(p, o)| \leq \delta(q, o)$$

The distances $\delta(p, o)$ between all pivot objects $p \in P$ and all database objects $o \in DB$ can be computed in advance and stored in a so-called pivot table of size $|P| \cdot |DB|$.  When processing a query, we only need to compute the distances $\delta(q, p)$ between the query and all pivot objects $p \in P$ and store those distances in a list. After that, $\delta_{LB-Metric}$ can be computed in $\mathcal{O}(|P|)$ efficiently by looking up the stored distance values. The selectivity of this approach is highly dependent on the choice of pivot objects $P$ and the distribution of the database objects and the query object.

Lower Bounds for the Earth Mover’s Distance

Rubner Lower Bound

As shown by Rubner et al. in [RTG00], when using a norm-induced ground distance, the Earth Mover’s Distance can be lower-bounded by the ground distance between the weighted means of the two signatures.

Let $X$ be a feature signature and $\delta : \mathbb{F} \times \mathbb{F} \rightarrow \mathbb{R}$ be a norm-induced ground distance. Then it holds that

$$Rubner(X, Y) = \delta(\overline{X}, \overline{Y}) \leq EMD_\delta(X, Y)$$

where $\overline{X}$ is the weighted mean of $X$:

$$\overline{X} = \frac{\sum_{f \in R_X} X(f) \cdot f}{\sum_{f \in R_X} X(f)}$$

Independent Minimization Lower Bound

Proposed in [UBSS14], the Independent Minimization Lower Bound for feature signatures (short: IM-Sig) is a lower bound for EMD that corresponds to the EMD when removing the Target constraint and replacing it with the IM-Sig Target constraint defined as follows:

$$\forall g \in R_X, h \in R_Y : f(g, h) \leq Y(h)$$

Intuitively, this modified target constraint allows to distribute the flow optimally for each representative $g \in R_X$ without considering whether the total flow coming into the target representatives exceeds their weights, as long as the flow from $g$ to $h$ does not exceed the weight $Y(h)$ for all target representatives $h \in R_Y$. We use $IMSig_\delta(X,Y)$ to denote the minimum cost flow with respect to the modified target constraint. Since the set of feasible flows for IM-Sig includes the set of feasible flows for EMD, it holds that $IMSig_\delta(X,Y) \leq EMD_\delta(X,Y)$.

An efficient algorithm for computing $IMSig_\delta(X,Y)$ is given in [UBSS14].

We have now learned all the necessary theory to build an efficient content-based multimedia search engine. In the next section, we will see how to glue all the components together into an actual application. You can either subscribe to deep ideas by Emailsubscribe to the Facebook page or follow on Twitter to stay updated.

References

[AWS06] Ira Assent, Marc Wichterich, and Thomas Seidl. Adaptable distance functions  for similarity-based multimedia retrieval . Datenbank-Spektrum, 19:23–31, 2006.

[BS13] Christian Beecks and Thomas Seidl. Distance based similarity models for content-based multimedia retrieval . PhD thesis, Aachen, 2013. Zsfassung in dt. und engl. Sprache; Aachen, Techn. Hochsch., Diss., 2013.

[RTG00] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric for image retrieval . International journal of computer vision, 40(2):99–121, 2000.

[SK98] Thomas Seidl and Hans-Peter Kriegel. Optimal multi-step k-nearest neighbor search . In: ACM SIGMOD Record, volume 27, pages 154–165. ACM, 1998.

[UBSS14] Merih Seran Uysal, Christian Beecks, Jochen Schmücking, and Thomas Seidl. Efficient filter approximation using the earth mover’s distance in very large multimedia databases with feature signatures . In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 979–988. ACM, 2014.

[ZADB06] Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, and Michal Batko. Similarity search: the metric space approach , volume 32. Springer Science & Business Media, 2006.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK