A Genetic Algorithm with Trusty PMML

Recently, I’ve stumbled upon this interesting article and paired project about a Genetic Algorithm. Then, I’ve asked myself if somehow the features of Trusty PMML could be meaningfully used inside such context.

I won’t go deep into technical details, but basically, the Genetic Algorithm classifies features as "genes", a set of genes is a "genoma", and that represents a unit of evaluation. A population is made of multiple instances of such units. Each individual has a fitness value, that is based on the values of its genes, and the scope of the algorithm is to select the individual with the fittest value, or the fitness value equal to the expected one.

During algorithm evaluation, new generations are created, where each new-born is the "combination" of two parents. The interesting feature is the "mutation" of such units, where some gene of one is exchanged with genes of another one. At each generation, fittest individuals replace older ones, and that will eventually lead to a "convergence", when the expected fitness value is obtained.

I’m always fascinated by technologies that mimic a biological behavior to obtain their results, and Genetic Algorithm should somehow mimic the genetic evolution through sexual reproduction. But in real life the genetic evolution is due to two factors:

genetic recombination (specific to sexual reproduction)
spontaneous genetic mutation.

In my implementation, I wanted to recreate both of them, and I wanted to demonstrate some cool features of Trusty PMML at the same time.

To start with, I forked the original project here and I begun to use PMML for gene representation.

The next step has been to try to figure out concrete, likely example. During a party meeting, a colleague started to talk about plants and crops at home, and that gives me the idea.

The Story of Gus the Farmer

One day, Gus decided to leave his work to make an agriculture business. He did not have an extensive area to cultivate, so he decided to crop apartment plants inside greenhouses. He made a market survey to find out which plants were the most looked after, and he also gathered historical weather information regarding water and light available on the place he was going to build in. Being wise enough, he also devised a plan to give flexibility to his business, to maintain his profit even when customer taste or environment conditions would change.

He finds out that people mostly like ferns, roses and cacti.

He builds three greenhouses, each of which has a slightly different configuration, specialized for one kind of plant. He then started a setup phase, with five harvests a week for each greenhouse (those are very extraordinary greenhouses, actually), and then, at each harvest, he slightly tweaked the configuration of each one so that, in the end, all of them would produce the most required plant. He then successfully started his business and, whenever people prefer different plants or environment variables mutate, all he has to do is go through the setup phase again to remain in business.

And the code ?

This story is represented inside the Greenhouses branch of my project. It is a standard maven-project and is written in Kotlin; there is no claim of being mathematically or theoretically accurate, but it shows how some Trusty PMML features may be applied to a Genetic Algorithm.

The setup phase is represented by the loop inside SimpleDemoGA class. Farm represents the overall installation, producing harvests at each iteration. Harvest is the product of greenhouses. And Greenhouse is the execution of a PMML model. The result of this execution may be fern, rose or cactus.

At startup there are three slightly different PMML files: Greenhouse_A.pmml, Greenhouse_B.pmml and Greenhouse_C.pmml.

Each of those PMMLs defines a Regression model with three RegressionTables, each of which targeting a specific plant:

<RegressionModel functionName="classification" modelName="..." targetFieldName="Species">
    <RegressionTable targetCategory="fern" ..,>
        ...
    </RegressionTable>
    <RegressionTable targetCategory="rose" ..,>
        ...
    </RegressionTable>
    <RegressionTable targetCategory="cactus" ..,>
        ...
    </RegressionTable>
</RegressionModel>

Each table has a different intercept value, and inside each of them a different coefficient is given to _water_ and _light_ variables

<RegressionTable targetCategory="..." intercept="1">
    <NumericPredictor name="water" exponent="1" coefficient="..."/>
    <NumericPredictor name="light" exponent="1" coefficient="..."/>
</RegressionTable>

and that made every greenhouse specific for a given plant.

When a greenhouse produce the required plant, its gene is set to "1", otherwise is "0".

The fitness of a harvest is the number of "1" genes. When all of them are "1", it means that all greenhouses are producing the required plant, and the target is obtained.

Trusty PMML in action

The mutation of the greenhouse configuration mimics the natural genetic "casual" mutation, and to achieve that a cool feature of Trusty PMML has been exploited, i.e. the ability to change and quick reload the underlying model itself at runtime. This is not standard procedure. Usually, predictive models are created, trained and tuned with different tools, and PMML is generated out of them.

This requires an overall company setup similar to the following one:

business analyst gather requirements
data scientists design and tune models
someone exports models in PMML format
developers write code that such PMML files (and a PMML engine) to actually return a value based on a given input.

There may be cases, though, where this setup does not fulfill the business requirement. One example could be a daily incremental fine-tuning of an already generated model. Another requirement could be the usage of the PMML model slightly beyond its specification, and I have a concrete example of that.

We received a request about the evaluation of the probability that a given string is similar to another one, both of which given as input by the user. This could be a particular use case of Levenshtein distance.

PMML defines an expression called TextIndex that actually uses that distance to evaluate if one string is similar to another one. The problem is that the maxLevenshteinDistance is a fixed attribute, so TextIndex could be used to evaluate if two strings are similar for a specific Levenshtein distance, while the request could be translated somehow to "how many Levenshtein distances there are from those two strings ?".

Of course it could be possible to write custom extension to solve that, and possibly even a really contrived PMML based on nested defined functions, replacement, regexp and what else.

Modify that maxLevenshteinDistance attribute on-the-fly could be another solution. I know it would be practically unfeasible if the requests are too frequent, but this gives you an idea of a possible use case.

Back to Gus

At each harvest Gus needs to tweak the greenhouses. In algorithmic terms, a gene mutation has to happen. In the actual implementation, a modification of the PMML is required. This is done inside PMMLManager.mutatePMMLFile() method:

the last version of the PMML file is loaded as PMML instance
the intercept value of the RegressionTable for the required specie is increased
the modified PMML is serialized back as PMML file, overwriting the previous version
for each greenhouse of each harvest, that modified versions of PMML files are loaded and actual evaluation is done on top of that

At the end of the cycle, when the target is obtained (i.e. an harvest with all three genes as "1") in the target/classes folder there will be the last versions of the PMML files, while the original ones are kept unmodified inside src/main/resources folder.

Setup in Action

The following is an exerpt of an actual execution of the program with the following input data

Required specie: rose Environment values {water=48, light=5}

During execution, the log will show the predicted values for each model while they are mutated:

...
Model Greenhouse_C 
results {Predicted_Species=fern, Probability_rose=66.9, Species=fern, Probability_fern=91.30000000000001, Probability_cactus=-157.20000000000002} 
Model Greenhouse_A 
results {Predicted_Species=fern, Probability_rose=54.0, Species=fern, Probability_fern=65.5, Probability_cactus=-118.5} 
Model Greenhouse_B 
results {Predicted_Species=fern, Probability_rose=51.7, Species=fern, Probability_fern=56.900000000000006, Probability_cactus=-107.60000000000001} 
Model Greenhouse_C 
results {Predicted_Species=fern, Probability_rose=66.9, Species=fern, Probability_fern=91.30000000000001, Probability_cactus=-157.20000000000002} 
Model Greenhouse_A 
results {Predicted_Species=fern, Probability_rose=54.0, Species=fern, Probability_fern=65.5, Probability_cactus=-118.5} 
...

At each cycle, an overall report is printed out, with the genomas of all the five harvests:

Population of 5 individual(s).
Required specie: rose Environment values {water=48, light=5}
Generation: 0 Fittest score: 0
==Genetic Pool==
> Harvest 0 | [genes=[0, 0, 0]] | [features= Greenhouse_A: fern, Greenhouse_B: fern, Greenhouse_C: fern |
> Harvest 1 | [genes=[0, 0, 0]] | [features= Greenhouse_A: fern, Greenhouse_B: fern, Greenhouse_C: fern |
> Harvest 2 | [genes=[0, 0, 0]] | [features= Greenhouse_A: fern, Greenhouse_B: fern, Greenhouse_C: fern |
> Harvest 3 | [genes=[0, 0, 0]] | [features= Greenhouse_A: fern, Greenhouse_B: fern, Greenhouse_C: fern |
> Harvest 4 | [genes=[0, 0, 0]] | [features= Greenhouse_A: fern, Greenhouse_B: fern, Greenhouse_C: fern |
================

Progressively, some genes will switch from 0 to 1:

Generation: 38 Fittest score: 2
==Genetic Pool==
> Harvest 0 | [genes=[1, 1, 0]] | [features= Greenhouse_A: rose, Greenhouse_B: rose, Greenhouse_C: fern |
> Harvest 1 | [genes=[1, 1, 0]] | [features= Greenhouse_A: rose, Greenhouse_B: rose, Greenhouse_C: fern |
> Harvest 2 | [genes=[1, 1, 0]] | [features= Greenhouse_A: rose, Greenhouse_B: rose, Greenhouse_C: fern |
> Harvest 3 | [genes=[0, 1, 0]] | [features= Greenhouse_A: fern, Greenhouse_B: rose, Greenhouse_C: fern |
> Harvest 4 | [genes=[0, 1, 0]] | [features= Greenhouse_A: fern, Greenhouse_B: rose, Greenhouse_C: fern |
================

until the conclusion of the setup:

Generation: 75 Fittest score: 3
==Genetic Pool==
> Harvest 0 | [genes=[1, 1, 1]] | [features= Greenhouse_A: rose, Greenhouse_B: rose, Greenhouse_C: rose |
> Harvest 1 | [genes=[1, 1, 0]] | [features= Greenhouse_A: rose, Greenhouse_B: rose, Greenhouse_C: fern |
> Harvest 2 | [genes=[1, 1, 0]] | [features= Greenhouse_A: rose, Greenhouse_B: rose, Greenhouse_C: fern |
> Harvest 3 | [genes=[1, 1, 0]] | [features= Greenhouse_A: rose, Greenhouse_B: rose, Greenhouse_C: fern |
> Harvest 4 | [genes=[1, 1, 0]] | [features= Greenhouse_A: rose, Greenhouse_B: rose, Greenhouse_C: fern |
================

Solution found in generation 75
Fitness: 3
Genes: 111

Process finished with exit code 0

Last, as comparison, here’s a snipppet from the original Greenhouse_B.pmml

<RegressionTable targetCategory="fern" intercept="1">
   <NumericPredictor name="water" exponent="1" coefficient="1.3"/>
   <NumericPredictor name="light" exponent="1" coefficient="-1.3"/>
</RegressionTable>
<RegressionTable targetCategory="rose" intercept="2">
   <NumericPredictor name="water" exponent="1" coefficient="0.9"/>
   <NumericPredictor name="light" exponent="1" coefficient="1.1"/>
</RegressionTable>
<RegressionTable targetCategory="cactus" intercept="1">
   <NumericPredictor name="water" exponent="1" coefficient="-1.3"/>
   <NumericPredictor name="light" exponent="1" coefficient="1.3"/>
</RegressionTable>

and here’s the mutated one after successful completion of setup:

<RegressionTable intercept="1" targetCategory="fern">
   <NumericPredictor name="water" exponent="1" coefficient="1.3"/>
   <NumericPredictor name="light" exponent="1" coefficient="-1.3"/>
</RegressionTable>
<RegressionTable intercept="29" targetCategory="rose">
   <NumericPredictor name="water" exponent="1" coefficient="0.9"/>
   <NumericPredictor name="light" exponent="1" coefficient="1.1"/>
</RegressionTable>
<RegressionTable intercept="1" targetCategory="cactus">
   <NumericPredictor name="water" exponent="1" coefficient="-1.3"/>
   <NumericPredictor name="light" exponent="1" coefficient="1.3"/>
</RegressionTable>

Conclusion

The code used in the example may be retrieved here. The scope of this article and the companion project is to illustrate a possible usage of Trusty PMML for a Genetic Algorithm implementation. It has no claim at all about theoretical nor mathematical integrity. Any comment or suggestion is more than welcome. I hope you enjoyed it!

A Genetic Algorithm with Trusty PMML