6

Resolving Ambiguity in Text Rewriting

 3 years ago
source link: https://deniskyashif.com/2020/02/22/resolving-ambiguity-in-text-rewriting/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

February 22, 2020 |  Edit

Resolving Ambiguity in Text Rewriting

Text rewriting is the process of replacing parts of a text that matches some rewrite rules with other text. In some cases, however, these parts might overlap so we have to come up with a way to decide which one is to be replaced. In this article, we’ll discuss several conflict resolution strategies and describe their implementations from a formal standpoint.
This article builds on top of Finite-State Transducers for Text-Rewriting so check it out first, before proceeding with this one.

Ambiguity

Consider the following rewrite rule and suppose we have a finite-state transducer for obligatory replacement based on it.

R1=ab|bc→x

and the following input text.

aabcb

If we don’t specify a conflict resolution strategy, the obligatory rewrite transducer for R1 will result in two distinct translations because the matches “ab” and “bc” clearly overlap.

a⋅ab⋅cb⟼R(TR1)axcbaa⋅bc⋅b⟼R(TR1)aaxb

Let’s see another example. A rewrite rule

R2=aa∗b|aa→x

and an input text

aaaaabbaa

that results in even more valid decompositions.

aaaaab⋅b⋅aa⟼R(TR2)xbxaa⋅aaab⋅b⋅aa⟼R(TR2)xxbxaa⋅aa⋅ab⋅b⋅aa⟼R(TR2)xxxbx

A proper conflict resolution strategy would ensure that for each input text, we’ll get a single output following this strategy.

Definitions

Let’s provide some formal definitions that will make it easier for us to reason about the problem.

Given two strings u,v∈Σ∗ we write u<v if u is a prefix of v and u≤v if u is a proper prefix of v.

An infix occurence of an string v in text t is a triple ⟨u,v,w⟩ such that t=u⋅v⋅w. For example in text t=aabcb,⟨a,ab,cb⟩ and ⟨aa,bc,b⟩ infix occurrences of the strings “ab” and “bc” respectively.

Two infix occurrences ⟨u1,v1,w1⟩ and ⟨u2,v2,w2⟩ of a text overlap if u1<u2<u1⋅v1.

infix overlap

Figure 1: Overlapping infix occurrences

A set of infix occurrences A of the text t is said to be non-overlapping if two distinct infix occurrences of A never overlap. Let’s see an example.

t=aabbbccabA={⟨a,ab,bbccab⟩,⟨aabb,bc,cab⟩,⟨aabbbcc,ab,ϵ⟩}

Leftmost-Longest Match

The leftmost-longest match resolution strategy consists of two rules:

  1. For two overlapping texts, we pick the one that starts first in the natural (left-to-right) reading order.
  2. For distinct matches starting at the same place, we pick the longest one.

By applying the leftmost-longest strategy in the example above, we get rid of the ambiguity and, thus, end up with the following translations:

aabcb→RLML(TR1)axcbaaaaabbaa→RLML(TR2)xbx

Formal Description

We’ve developed an intuition for this strategy, let’s define it in formal terms so we can be precise. Given two sets of infix occurrences A,B, we define the following functions:

AFTER(A,B)={⟨u,v,w⟩∈A|∀⟨u′,v′,w′⟩∈B:u′⋅v′≤u∧u′<u}

AFTER selects from all infix occurrences in A the ones where the middle component v starts after all middle components of the set B.

LEFTMOST(A)={⟨u,v,w⟩∈A|∀⟨u′,v′,w′⟩∈A:u≤u′}

LEFTMOST selects the infix occurrences where the middle components have the leftmost position. Note that we might have multiple such components that start from the same position.

LONGEST(A)={⟨u,v,w⟩∈A|∀⟨u′,v′,w′⟩∈A:u≠u′∨v′≤v}

From the infix occurrences with a middle component starting from the same position, LONGEST selects the ones with the longest middle part.

Now we can define the function LML(A) which given a set of infix occurrences is going to return its subset consisting of the leftmost-longest ones.

LML(A):=∞⋃i=0Vi

Where the intermediate sets Vi are defined inductively.

V0=∅Vi+1=Vi∪LONGEST(LEFTMOST(AFTER(A,Vi)))

LML(A) is always finite because A is finite. That is, after a certain induction step i,Vi will remain unchanged and that’s when we break from the loop. At each step, we add at most one element to the resulting set.

Note that also all the infix occurrences in LML(A) are non-overlapping which leads to another useful property, that is, given a rewrite function T:Σ+→Σ∗ the obligatory rewrite relation RLML(T) under the leftmost-longest match strategy is also a function. That is, because our replacement candidates don’t overlap, we rewrite distinct regions in the input text. The original rewrite transducer representing a function means that we do not have the same rule mapping to more than one distinct outputs (e.g. ab→x,ab→y). This guarantees that given any input text, the obligatory rewrite transducer for _leftmost-longest_ match TRLML(T) is always going to produce a single output.

Other Conflict Resolution Strategies

Now we’ve defined the leftmost-longest match, it is trivial to modify the algorithm to obtain different replacement strategies. For example, for leftmost-shortest, we can replace the LONGEST function with

SHORTEST(A)={⟨u,v,w⟩∈A|∀⟨u′,v′,w′⟩∈A:u≠u′∨v≤v′}

We can also pick the rightmost out of several overlapping infix occurrences by replacing LEFTMOST with

RIGHTMOST(A)={⟨u,v,w⟩∈A|∀⟨u′,v′,w′⟩∈A:u′≤u}

By combining those functions, we can define any strategy we want, however, from now on, we’ll focus on the leftmost-longest as it is more applicable in practice. Most of the time, the reading order is from left to right and longer occurrences carry in general more information than the shorter ones.

Conclusion

In this article, we discussed the problem of having overlapping replacement candidates in text rewriting. We defined the leftmost-longest match strategy for resolving such conflicts and learned how to implement it from a formal standpoint. By performing small modifications, we saw how to formally define other kinds of strategies like the leftmost-shortest, rightmost-longest, etc.

The problem for implementing a finite-state transducer, corresponding to the rewrite relation RLML is already well explored. In the references you can find a few papers providing implementations which differ in complexity and performance.

Part 1: Finite-State Transducers for Text Rewriting

References and Further Reading


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK