10

How We Made Profits Forecasting Wind Energy Production Levels

 3 years ago
source link: https://towardsdatascience.com/how-we-made-profits-forecasting-wind-energy-production-levels-b93bd3a7f1ed?gi=ef646f0dcdf9
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

How We Made Profits Forecasting Wind Energy

Jul 27 ·16min read

YnQviyA.jpg!web

Illustration by author.

Competition:AI4Impact Datathon 2020
DeepSleep

Team Members:Cleon Wong, Isabelle Lim Xiao Rong, Lau Tian Wei

Article Outline

  1. The Challenge
  2. Data Exploration and Analysis
  3. Data Preprocessing
  4. Two Metrics to Optimise for
  5. Feature Engineering
  6. The Models
  7. Limitations and Future Improvements

1. The Challenge

As a Wind Energy Trader , it is imperative to create accurate forecasts. The Grid Operator , who administers the energy grid of a particular area, purchases energy from the Wind Energy Producer based on the forecast.

Here are the rules of the game:

  1. We (Wind Energy Trader) are given 10,000,000¢ as cash reserve in case of an over-forecast.
  2. Each kWh of wind energy forecasted is sold at 10¢.
  3. Under-forecast: If the amount of wind energy produced is greater than the forecast, the surplus is wasted (we make no profits from it).
  4. Over-forecast: If the amount of wind energy produced is less than the forecast, we would have to purchase the energy from the spot market, costing us 20¢ nett per kWh in order to make up for the deficit. If we have less cash in hand that what is required to make up the shortfall, we would be fined 100¢ per kWh for the amount we can’t buy. This negative cash (debt) is cumulative.

In this challenge, we are tasked to maximise our profits by using artificial neural networks (ANN) to generate accurate T+18 hours forecasts of wind energy produced during the trading window of 23rd July 2020 to 29th July 2020.

In this article, we will be detailing our process from the raw data used to the final model which we used to forecast the wind energy.

2. Data Exploration and Analysis

We have two datasets to work with:

  1. Wind energy production levels from the RTE from 1 Jan 2017 to present. This is the value that we are trying to predict. This 15-minute timescale data was interpolated to a 1-hourly timescale using previous interpolation and a 72-hour window.
  2. Wind speed and direction forecasts for 8 locations in the Ile-de-France region, each of which represents a major wind farm in the region. Wind speed is measured in m/s and direction is measured as a bearing in degrees North. Similarly, these forecasts have also been interpolated using previous interpolation and a 72-hour window to fit the hourly timescale of the wind energy production levels.

Furthermore, the forecasts for wind speed and direction are from two different models (each covering that same 8 wind farm locations). This makes the total number of columns in the dataset:

8 locations * 2 features (speed & direction) * 2 models = 32 columns

2.1 Wind energy production levels

UbmQVrN.jpg!web

Wind energy production level. Illustration by author.

Looking at the raw data, we found that wind energy production level looked highly seasonal. Wind plant performance tends to be higher during winter and autumn and lower in the spring and summer. We can also see a gradual upward trend of energy being produced over the years. This could be due to the increase in number of wind farms in operation over the years that led to an increase in capacity of energy production

2.2 Wind speed forecasts

IzmAveE.jpg!web

2.3 Wind direction forecasts

3a2ema.jpg!web

Visualising wind directions from both forecast models. Illustration by author.

We see from both forecast models that wind generally blows in the North east and South West directions.

3. Data Preprocessing

For preprocessing, we applied spatial averaging and normalisation to wind speeds and direction.

Given that we have 32 columns of data just for two fundamental features (wind speed and direction), averaging the wind speeds and directions can help reduce the dimensions of our data input, potentially making the data less noisy and also computationally cheaper .

Since wind energy levels, speed and direction are all measured in different units, normalisation is also necessary to make sure that all these values share a common scale, without distorting differences in their respective ranges of values .

3.1 Preprocessing wind speed

Looking at the plots of wind speed forecasts from above, we see that each model’s forecasts across the 8 locations look very similar to each other. Hence, taking a simple average across the 8 locations would makes sense. Doing this for both forecast models reduces the number of columns of wind speeds from 16 to 2.

Normalising the wind speeds is straightforward using the formula below and we do the same for wind energy production levels. Doing this, we center the distribution wind speeds and energy production levels to around 0 and they both now share the same scale — i.e. the number of standard deviations where most values range between [ -1, 1 ].

2yE3uqq.jpg!web

Normalisation formula. Illustration by author.

3.2 Preprocessing wind direction

Since directions are measured in degrees North, averaging and normalising wind direction is a slightly less straightforward and hence interesting problem.

It is inaccurate to simply take the sum of the directions across all 8 locations and divide them by 8 (note that this is acceptable when averaging wind speeds, think of 50km/h being the average of 40km/h and 60km/h). Taking the average between 359° and 1°, (359°+1°)/2, will give us an average of 180°, which is in the complete opposite direction of both 359° and 1°! Hence, we would need another method of averaging directional data.

jmaQb2y.jpg!web

Taking the average of 359° and 1° gives us a direction that points in the almost complete opposite direction. Illustration by author.

Since radians measure angle by using the length of the arc it traces [1], we can address this problem by first converting degrees to radians, normalise, then finally average across the 8 locations . Note that we would have to normalise before taking the average or else we will still be facing the same issue stated above.

To normalise directions measured in radians, we have to take their sine and cosine to compress the direction to the range of [ -1, 1 ]. We have to take both the sine and cosine because leaving out either functions will make it impossible to tell which quadrant the original angle was from. The diagram below illustrates this mathematical issue. Having only sin(x) = ( 2)/2 gives us two possible values of x = {(1/4)π, (3/4)π}. Having only cos(x) = ( 2)/2 also give us two possible values of x = {(1/4)π, (7/4)π}. However, knowing that both sin(x) = cos(x) = ( 2)/2 tells us that x = (1/4)π.

MnM7bua.jpg!web

Using both sine and cosine functions to normalise direction. Illustration by author.

Following the averaging and normalisation, we managed to compress the original 33 columns of data down to 7 columns without losing much information.

iE7Rzyb.jpg!web

Visual summary of data preprocessing. Illustration by author.

4. Two Metrics to Optimise for

Before exploring feature engineering and our models, we have to understand the two metrics that serve as our yardsticks when optimising our features and models. They are:

  1. MAE test loss — to be minimised.
  2. Dollar profits as a percentage of total possible profits (profits ratio) — to be maximised.

Note that the persistence loss for MAE is 0.6483 and this is the loss that we aim to beat with our selected features and models. Furthermore, Section 7 — Limitations, will explain why MAE test loss is used.

5. Feature Engineering

Averaging and normalising help us preprocess the raw dataset. From this preprocessed dataset, we can use feature engineering to then prepare the actual input dataset that our model will learn from. Feature engineering is about creating new input features from our existing ones (the preprocessed raw dataset).

In general, you can think of data cleaning as a process of subtraction and feature engineering as a process of addition. [2]

5.1 Features that made the cut

Our feature engineering was very much an iterative process alongside our model building process (which we will dive into in the next section). We went through many features, dropping original features that we (wrongly) thought were good and adding new features that that we never thought would have worked. After a series of trial and error, these are the final features that made the cut.

f2m6ZrE.png!web

Our final engineered features.

For our window choices, we started off with a small naive window of 0:-5 and 18:0 respectively because starting with too big a window can result in poor learning. Then, we slowly increased window sizes because small windows may exclude important information in the data. We found windows of 0:-54 (triple the forecast window) for energy production and 18:-18 (double the forecast window) for wind speed and direction respectively gave us the best model performance.

5.2 Features that did not make the cut

Other features that we had experimented with include:

  • Cubic values of wind speed: We attempted to infuse domain knowledge using the Wind Power Generation Equation, where wind speed was a cubic variable. However, upon testing, we found that using wind speeds in cubic terms performed worse in terms of profit and test loss than using the original wind speeds terms.

uAr6BrM.jpg!web

Wind Power Generation Equation. Illustration by author.
  • Capturing the interaction between wind speed and direction: We thought that it was intuitive that wind energy is highly dependent on wind speed and direction. Based on our layperson intuition, if we’ve got wind blowing in the right direction but at a low speed (or high speed but wrong direction), that should lead to low levels of energy produced. The converse seems like a sound argument too. Hence, we tried to capture this interaction by multiplying wind speeds and direction. However, we found that these features did not help improve the test loss or profits of the model.
  • Difference, momentum and force: We experimented with difference, momentum and force terms of 2, 6, 9 and 18, and found little to no improvements made to the model. In some cases, the model even performed worse.
  • A window of raw wind speed and direction: We tested using windows for wind speed and direction as inputs. A window of data can capture important information like trends and patterns. However, a caveat is that windows with similar speeds and directions can be mapped to vectors that are vastly different from each other (i.e. poor clustering). Our tests showed that models that used a collection of features derived from a moving window ( mean , standard deviation , maximum , minimum , first order differential and average second order differential ) consistently outperformed models that used a moving window of raw speeds and directions.

6. The Models

Having the preprocessed raw data and an array of engineered features now at our disposal, we are now fully equipped to begin experimenting with the architecture of our neural networks , tuning its hyper-parameters and evaluating its performance in order to decide on the most appropriate model to deploy.

Amongst the countless models that we ran, we document a key model from each of the 3 phases of our experiments that captures our intentions and the model’s results.

6.1 Model V0 — The baseline benchmark

UruIjaj.jpg!web

Model V0’s architecture. Illustration by author.

We started with the most basic neural network and the most basic features as shown in the diagram. While keeping a ⅔ reduction in number of nodes between layers, we experimented with the number of nodes in Layer 1. As listed below, we experimented with different configurations of the number of nodes in Layer 1, types of loss functions and types of back-propagation algorithms. The configuration of [32] nodes in Layer 1, [MAE] loss function and [Adam] for back-propagation (bolded below) yielded a minimum test loss of 0.7306 and profits ratio of 0.4218.

This model sets the baseline benchmark that we aim to surpass in subsequent model iterations.

  • Network: 4 layers with ⅔ reduction, starting with 16 / [ 32 ] / 64 nodes in Layer 1.
  • Input features: Value of wind energy at T0, averaged wind speed and wind direction (both cosine and sine) per model at T18 (7 columns)
  • Loss function: RMSE / [MAE] / ASYM*
  • Back-propagation algorithm: [Adam] / SGD
  • Min. test loss: 0.7306
  • Profits / Max. possible profits: 0.4218

RBBfeiN.jpg!web

6.2 Model V1 — Experimenting with our engineered features

u2mMJ3q.jpg!web

Model V1’s architecture. Illustration by author.

Once we’ve settled on the rough architecture of the neural network, we proceeded to experiment with our engineered features mentioned in Section 4.1. Using all of our engineered features, the configuration of [128] nodes in Layer 1, [MAE] loss function and [Adam] for back-propagation yielded (bolded below) the minimum test loss of 0.4829 and profits of 0.5804. This configuration was the greatest improvement from Model V0’s benchmark.

  • Network: 4 layers with ⅔ reduction, starting with 32 / 64 / [128] nodes in Layer 1.
  • Input features: All engineered features mentioned in Section 4.1 (313 columns)
  • Loss function: RMSE / [MAE] / ASYM
  • Back-propagation algorithm: [Adam] / SGD
  • Min. test loss: 0.4829
  • Profits / Max. possible profits: 0.5804

RnQbmeJ.jpg!web

6.3 Model V2 — Experimenting with sophisticated network features

yENvMza.jpg!web

Model V2’s architecture. Illustration by author.

We learnt from the experiments using Model V1 that our engineered features consistently gave lower test losses and higher profits as compared to the benchmark. In this final round of experimentation, we thus stuck with all our engineered features and experimented with more sophisticated network features like input scaling, dropouts and L2 regularisation. This time, the configuration of starting with input scaling [without] clamping, [128] nodes in Layer 1, [0.1] dropout probability, [0.001] L2 regularisation weight decay, [MAE] loss function and [Adam] for back-propagation (bolded below) yielded the lowest test loss of 0.4794.

  • Network: 4 layers with ⅔ reduction, starting with 32 / 64 / [128] nodes in Layer 1.
  • Input scaling: With / [Without] clamping
  • Dropout probability: [0.1] / 0.2 / 0.25
  • L2 regularisation weight decay: [0.001] / 0.0001 / 0.00001 / 0.000001
  • Input features: All engineered features mentioned in Section 4.1 (313 columns)
  • Loss function: RMSE / [MAE] / ASYM
  • Back-propagation algorithm: [Adam] / SGD
  • Min. test loss: 0.4794
  • Profits / Max. possible profits: 0.5912

Vfime2v.jpg!web

6.4 Other experiments / models:

Throughout the three main types of experiments listed above, these are the other modifications that we experimented with that did not optimise our models for our two metrics:

  • Adding a 5th Layer: Due to our relatively large training dataset, we wanted to see if our model would improve with extra complexity, and hence tested it with a 5th layer. However, this increased our test loss and resulted in more overfitting instead.
  • Auto-encoder with bottleneck sizes of 1.5, 2, 4 and 8 : As our neural networks are dealing with large dimensional inputs (>300), we wanted to reduce the dimensions using an autoencoder. However, this instead made our test loss and profits worse.
  • Squared Perceptrons: Squared perceptrons used as a drop-in replacement for ordinary perceptrons may help models learn better than ordinary perceptrons, but in our model’s case, models without squared perceptrons performed better.
  • Momentum and Force Losses: Momentum and force losses are usually meant to help reduce lag, but in our case our lag correlations already peak at 0. We experimented with these losses anyway, and as expected, models with the momentum and force losses ended up having worse test losses because taking differences often result in greater noise.

7. Limitations and Future Improvements

7.1 High test loss, gap between training and test loss remains large

This shows that the model is not generalising well enough on unseen data and could be a warning sign that the network may be memorising the training data. Even with feature engineering, fine-tuning the windows of existing features, adding dropouts and weight regularisation in Model V2, these changes only made a slight improvement to the gap between training and test loss. Two possible reasons for the relatively high test loss and large gap could be: (1) the data itself is very noisy, and (2) the data provided is not truly predictive for wind energy production.

A potential method to tackle the problem of non-predictive data is sourcing for more predictive data . In the case of predicting wind energy production, we can consider including external sources of data on air density, efficiency factor of the wind farms and length of the rotor blades (all of which are terms belonging to the wind power generation equation). This may improve the amount of predictive data that we have and thus improve the neural networks’ ability to generalise.

7.2 Consistent under-predictions

From the ‘Actual vs Test Predictions’ graphs, we see that in our models, a larger proportion of the plotted points tend to fall below the 45 degree line. This implies that our model consistently under-predicts wind energy production. This is also visible in the ‘Test Predictions’ graphs where we see more blues in the upper region, indicating the upper values are often under-predicted.

One possible reason for the under-predictions could be due to the datasets we employ to train the models. In the original dataset used to train the models, we have wind speed and wind direction forecasts for all 8 locations that spans 01 Jan 2017 to the present. While each location represents the location of a major wind farm in the Ile-De-France region, Boissy-la-Rivière was only in operation from August 2017, while Angerville 1 and Angerville 2 (Les Pointes) were only in operation from 02 Jul 2019. This is likely to have impacted the total energy production capacity of the region for a given wind forecast (i.e the same average wind speed would likely lead to a higher wind energy production in the later part of 2019 than in the earlier part of 2017). This may potentially introduce noise in the data — variations in wind energy production that are not due to the variation in our model inputs.

Therefore, we tested our V2 model on 3 datasets:

  1. The original dataset spanning 01 Jan 2017 to the present
  2. A weighted dataset that multiplies energy actuals by 50% before 01 Aug 2017, by 80% between 01 Aug 2017 and 02 July 2019, and 100% for energy actuals after 02 July 2019. The weights for the respective durations were estimated based on this formula: % Wind Energy Production Capacity in Operation = Total Estimated Nominal Power Output of Farms Currently in Operation ÷ Total Estimated Nominal Power Outputs of the 8 major wind farms.
  3. A truncated dataset that consists of only data after 02 July 2019 based on the assumption that all 8 farms are in operation and wind energy production capacity does not change significantly from that point.

Here is a comparison of the results:

i6BZv2n.png!web

Comparison of MAE test loss between original, weighted and truncated datasets → Original dataset outperforms the rest.

Though the model trained on the truncated dataset resulted in the lowest test loss of 0.38, the model trained on the original dataset resulted in the highest actualised profits to total possible profit ratio . The model trained on the weighted dataset performed the worst in terms of both metrics and even resulted in substantial dollar losses. In the end, we decided to stick with the original dataset because our goal was to maximise profits rather than to minimise test loss .

Using the truncated dataset may be a viable way to improve both our test loss and profits in future when there is a larger amount of data available after 02 July 2019, assuming wind energy production capacity in the Ile-De-France region does not change.

7.3 MAE test loss may not be aligned with goal of profit-maximisation

The MAE test loss we employed penalises each unit of under-prediction and over-prediction to the same extent . However, from the setup of the challenge, we know that 1 kWh of over-prediction is twice as costly and 1 kWh of under-prediction. Hence, the MAE test loss is not a true reflection of the relative cost of our prediction errors and thus may not be the best metric to use to optimise our models.

Hence we defined our own loss function:

vuqaqaR.jpg!web

Our customised loss function, ASYM loss, that penalised over-predictions greater than under-predictions.

In this way, we are able to more accurately reflect our challenge setup and penalise over-predictions twice as heavily as under-predictions.

We tested the our model performance on the original dataset, but using the two different loss functions:

NfMz2mf.jpg!web

Actual vs Test Predictions graph for MAE (left) and ASYM (right) losses.

bEBvyy2.png!web

Comparing profit ratio between MAE and ASYM loss functions → MAE outperforms ASYM.

We see that in the ‘Actual vs Test Predictions’ graphs, using the ASYM loss function reduced the number of over-predictions and increased the number of under-predictions compared to using the MAE loss function. This is as expected now that we penalise over-predictions more than under-predictions.

However, when we compare actualised profits to total possible profits, we find that using the MAE test loss helps our model make more profits than the ASYM test loss . Thus in the end, we chose to use the MAE test loss.

One reason why using ASYM did not improve our profits, even though it was more reflective of the challenge setup, could be due to the already existing problem of more prevalent under-prediction in our model (see Limitation 7.2). Our model already showed a tendency to under-predict rather than over-predict, hence using ASYM would make the under-prediction problem in our model even more severe while making only modest improvements to our over-predictions, therefore causing us to lose out more on substantial potential profits. Thus, the ASYM loss function may be a more viable solution only when Limitation 7.2 is resolved and we get better data.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK