

Unzipping Zipf’s law
source link: https://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0181987
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Materials and methods
Zipf’s law follows from the interaction between syntax and semantics, and neither of them is sufficient. As for syntax, language makes use of different word classes to build sentences. Whereas these word classes, or parts of speech (POS), are used with a comparable overall frequency, they differ hugely in class size. For example, there are only three articles in English (the, a, an) but probably more than 10,000 nouns. Therefore, an article will be more frequently used than the average noun. Within word classes, some words apply more often than others because of their meaning. As thing is a more general noun than submarine (the set of objects the former can refer to in fact includes the referent set of the latter), it can be expected to be used more often. Words shouldn’t be too general, however, as this would lead to ambiguity. In order to become frequent (within a word class), a word should be specific enough to single out its referent in context and general enough to be applied to different referents.
For both of these observations there is independent and well-established evidence. In the next sections, it will first be shown how syntax and semantics can be modeled and that neither of them is sufficient to explain Zipf’s law on its own. Next, I will show their interaction does produce a near-Zipfian distribution, only deviating from the ideal in the way natural language does.
Syntax
With the present availability of large language corpora that are annotated for POS, it is easy to show that word classes vary in size by orders of magnitude. For present purposes, it is irrelevant which word classes are used how frequently exactly; the important point is that all natural languages make use of different word classes, and that the number of items in these classes is extremely different indeed. Table 1 gives an overview of the major POS classes that are recognized in the Corpus of Spoken Dutch (CGN, 8.6M words; [8]), the Brown corpus (1.1M words; [9]), and the Hungarian National Corpus (HNC, 187M words of which only the Hungarian-press subcorpus is used; [10]; all data used in this paper are open-access available through third parties; cf. Section S2 File for repositories.) As can be seen, in each language the difference in overall class frequency is negligible in comparison with the difference in class size.
Sorted by average expected word frequency. For Hungarian, only the Hungarian-press subcorpus is used. POS abbreviations: ART article, PRO pronoun, CON connective, P adposition, INT interjection, ADV adverbial, NUM numeral, V verb, A adjective, N noun. [8–10].
If word class was the only factor at play, a Zipfian distribution would follow from sampling a number of items from each class that is proportional to the overall class frequency. For example for Dutch, to simulate a corpus of 100 words, we should randomly draw (with replacement) six articles from a set of five, 18 pronouns from a set of 86, etc. (cf. Table 1). Fig 2 shows the results of this procedure. As can be seen, the different parts of speech, represented by the numbers in the plot (1 is for articles, 2 for pronouns, the rest is unintelligible because of overlap), occupy frequency regions that seem to be of the right order of magnitude. But unlike in natural language, the different frequency bands do not line up. Also, the word classes form distinct groups, whereas in natural language, classes overlap (e.g. the most frequent N outranks the least frequent P by far). In sum, distinguishing between word classes does not suffice to explain Zipf’s law.
To generate these results, the class frequencies and class sizes reported for Dutch in Table 1 are used. Numbers correspond to word classes when ordered by expected frequency.
Semantics
As pointed out above, in order to become frequent, a word should be specific enough to single out its referent in context and general enough to be applied to different referents [11]. A simple way of approximating the degree of specification is by determining the depth of embedding of a word in a word taxonomy such as WordNet [12], assuming that a word inherits all of the specifications of its parent including those that set it apart from its sisters. (Note that this is only used as an initial proxy to show how meaning specificity matters; meaning will be operationalized differently in the remainder.) In WordNet, meanings are organized in synonym sets, groups of words with approximately the same meaning. Various lexical relations are determined between these sets. For our purposes, the most important relation is the super–subordinate or the is-a relation. For example, we find 17 subsequent superordinate sets for submarine, starting with submersible, submersible warship, and only two for thing, viz. physical entity ⊂ entity, the top node of the noun taxonomy. If we now look up the total frequency in the Brown corpus for all nouns in the two meaning sets, we find, not unexpectedly, that the latter is more frequent than the former (with 484 against 178 attestations, in which all 178 hits for submarine in fact were due to the synonym sub, which is homonymous and whose frequency is due to its other meaning substitute). (Note that this procedure does not distinguish within homonymic or polysemic sets, which is not a problem, as the simple word counts it tries to account for, such as the one in Fig 1, also ignore this.) We can check whether the intuition about the relation between meaning specificity/embedding depth and frequency of usage is right in general by doing the same for all nouns in WordNet. The top panel in Fig 3 shows the distribution of two different “specificity” classes over the overall frequency distribution of nouns in the Brown corpus, viz. nouns that have an embedding depth between 3 and 9 (medium; red circles), and nouns that are either on top or towards the lower ends of the taxonomy (high/low; blue pluses). Words that were not attested in the corpus were removed. As can be seen, the most frequently used concepts indeed are modestly specified with a depth of embedding of 3–9; that is, specific enough to be distinctive while general enough to be reusable. On the bottom panel, the distributions of the log rank per specificity class is shown. Words with modest specification have a lower rank (or higher frequency) on average and span the entire range; words with a high/low degree of specification have higher ranks only.
Top panel: distribution over overall distribution of nouns. Degree of meaning specification is approximated by automatically determining the depth of embedding in the WordNet noun taxonomy. Words with lowest ranks are all moderately specified with an embedding of 3–9 (red circles). Bottom panel: boxplots of frequency ranks per specificity class.
Instead of using embedding as an approximation, the degree of meaning specification of words can also be simulated, by generating an abstract lexicon in which words are specified for a number of meaning dimensions. The first dimension could be taken to represent a property that all concrete objects do and abstract objects do not have (i.e., it is activated in the vector representations of concrete objects only), the second dimension represents something animates objects do and inanimates do not have, etc. (cf. [13–16] for applications). Note that qualitatively, this is very different from the vector-semantics approach used in modern computational linguistics (e.g. [17, 18]), in which vectors represent behavior in texts rather than the underlying semantics that causes this behavior. Rather, the vectors used here should be understood as representations of activation in a neural-network model of the brain [19, 20].
Using this implementation, the usage of words is modeled by randomly generating contexts with a target object and a set of distractors that are fully specified for all meaning dimensions. Next, a word from the lexicon is selected that suffices to single out the target object. For example, we may have two words in our lexicon, the first of which, a, is specified for all three meaning dimensions, with values 0, 0, and 1 respectively, whereas the second, b, is specified for dimensions D1 and D3 only, with values 0 and 1 (cf. Table 2). If the target object is a 0 0 1, both words match in principle. In contexts with distractor objects that all happen to differ from the target on either D1 or D3 (the first four distractors in Table 2), both words a and b can be used; but whenever there is a distractor object that is similar to the target on both D1 and D3 (the fifth distractor), word a is necessary to uniquely refer to the target.
Words are specified for three dimensions or less, referential objects are always fully specified. To distinguish the target from the first four distractors, words a and b can both be used, in the presence of the fifth distractor, however, only a suffices.
In a simulation whose results are shown in Fig 4, a lexicon of 1,000 words with ten meaning dimensions is used, from which words are selected for 10,000 contexts with randomly generated targets and 5 randomly generated distractors. As with the natural-language example in the previous figure, words of moderate specification are used most frequently.
The lexicon consists of 1,000 words with ten optional meaning dimensions, from which words are selected for 10,000 contexts with randomly generated targets and 5 randomly generated distractors. Words with lowest ranks are all moderately specified (2–4 dimensions; red circles). Bottom panel: boxplots of frequency ranks per specificity class.
Given the match between the results from the combined WordNet/Brown study and the computer simulation, we can go one step further and develop a mathematical model of the dependence of usage frequency on degree of specification. Assuming binary meaning dimensions, the probability pa that a word is applicable in principle is .5nDim, with nDim being the number of meaning dimensions for which that word is specified [21]. As this holds for both target and distractor objects alike, the probability that a word can actually be used in context is dependent on the number of distractor objects n: The probability pd that there is no distractor object to which a word could apply is (1 − pa)n, hence the probability pu that a word will be used is pa * pd. We can now randomly generate words, assign them a degree of specification (without specifying the meaning dimensions), and calculate the expected usage frequency given a given number of distractor objects. The results are shown in Fig 5. The close similarity with the previous figure strongly suggest we have successfully modeled the interaction between meaning specification and usage frequency.
The lexicon consists of 1,000 words with ten optional meaning dimensions. Probability of usage depends on degree of specification and number of distractors assumed (here 5). As in the previous figures, words with lowest ranks are all moderately specified (3–6 dimensions; red circles). Bottom panel: boxplots of frequency ranks per specificity class.
Importantly, as the results in Fig 3 already showed, semantics alone does not suffice to yield a Zipfian distribution: The frequency distribution within nouns is not the straight line through double-log space Zipf’s law prescribes.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK