8

Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Serie...

 2 years ago
source link: https://itnext.io/autoformer-decomposition-transformers-with-auto-correlation-for-long-term-series-forecasting-8f5a8b115430
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting

It is undeniable that when it comes to time-series forecasting, we need to forecast long dependencies for better decision-making in the future to cope with challenges regardless of the industry. Though transformers are revolutionary in the Deep Learning era, they contain some difficulties capturing long dependencies. As I discussed in the previous article ”Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting” about long dependencies to forecast the sequence length up to 480, we need algorithms beyond Transformers. This article is the same as the previous one but for longer sequence lengths which are highly demanded in industries. This article provides information about Autoformer (Decomposition Transformers with Auto-Correlation) to capture longer dependencies with an outstanding performance.

Honestly, the first time I saw this research, it blew my mind because I am interested in and working on time series, and this is a significant breakthrough. So, it would be the same as me for you. Enjoy it.😉 I tried to add fun or avoid too much focus on mathematics to prevent being boring.

[source]

Before reading carefully, if you found this article interesting or the topics that I am writing about practical in your works, please do not hesitate to follow me on medium to reach more articles from me.

Overview

To make Transformers efficient on long dependencies, we have to make some modifications to them (add sparse versions of point-wise self-attention mechanisms), which leads to the information utilization bottleneck. To be beyond simple Transformer, researchers designed Autoformer as a modern decomposition architecture with an Auto-Correlation mechanism. This design enables Autoformer with developed decomposition capacities. They proposed the model based on the series periodicity by inspiring stochastic process theory (you can find a lot of content about the stochastic process).

The results provide excellent accuracy, meaning 38% relative improvement on six benchmarks in 5 applications: energy, traffic, economics, weather, and disease.

Moreover, Autoformer is a Transformer based model that follows residual and encoder-decoder architecture but reconstructs the Transformer into a decomposition forecasting architecture.

Interesting, what a huge work.🤯 do you agree?! Well, I am going to make it understandable.🙂 So, don’t worry.🨊🤞😉

[Source]

Before diving into the Autoformer itself, let me give you a number of short illustrations on fundamentals:

Decomposition-Based Approach

It is a typical, simple but robust method for modeling and time series forecasting. The main idea is that they model the data as a blend of trend, seasonal, and remainder components instead of just trying to capture temporal dependencies and auto-correlation in the data just like other traditional models do (ARIMA, GRAPH, etc.).

We often used time-series decomposition as an analysis step before forecasting. Also, we can use it for forecasting itself, just in case you are aware of the prospect structure of data. This method involves splitting it into 3 or 4 components where each represents one of the more predictable categories.

The Autoformer here restrains the decomposition as an inner block of the deep models that can decompose the secret series throughout the whole process of forecasting (containing both the past series and the predicted intermediate results).

Auto-Correlation Mechanism

In this mechanism, we calculate the relationship between the current value of the variable and its past values. We used Auto-Correlation Mechanism instead of Attentions that we used to. The architecture of Auto-Correlaiton can be seen in Figure 1.

Figure 1. Auto-Correlaiton (left) and Time Delay Aggregation (right). We utilize the Fast Fourier Transform to calculate the autocorrelation R(τ), which reflects the time-delay similarities. then the similar sub-process are rolled to the same index based on selected delay T and aggregated by R(T). [source]

Auto-Correlation explores the period-based dependencies by counting 1. The series autocorrelation and, 2. Aggregates similar subseries by time delay aggregation.

Period-based dependencies:

Eq 1.

Time delay aggregation:

The position of this block can be easily seen in Figure 1. This operation is different from the point-wise dot-product aggregation in self-attention (For a better understanding of self-attentions see: “Attention Is All You Need”).

For an individual head situation and time series X with length-L, after the projector, we get query Q, key K, and value V. So, we can obtain it instead of self-attention. Also, the formula can be seen in Eq 2.

Eq 2–4.

Please note that I did not give any illustration on attention mechanism, so if you are not aware it, I highly recommend reading “Attention Is All You Need”.

Let’s have a general look at Autoformer architecture:

Autoformer Overview

[source]

There are two challenges for forecasting long dependencies in time series:

  1. Managing complex time-based patterns
  2. breaking the bottleneck of computation efficiency and information usage
Figure 2. Autoformer architecture. The encoder eliminates the long-term trend-cyclical part by series decomposition block (blue blocks) and focuses on seasonal patterns modeling. The past seasonal information from the encoder is utilized by the encoder-decoder Auto-Correlation (center green block in the decoder) [source]

With a careful inspection of the general architecture, we can find out that we can split the model into two sections; so, it has an encoder-decoder structure. Let’s dive into more details of these two:

Encoder:

Figure 3. Autoformer Encoder Structure [source]

We can understand some points here: the focus of the encoder is the seasonal part modeling; it means the output of the encoderincludes the past seasonal information. Then, we use the encoder’s output (as cross information) to assist the decoder in predicting better. The N x above-right of Figure 3. conveys the number of encoder layers. Of course, the mathematics (I don’t want to discuss them; so, I just mentioned them with the paper’s illustration):

Eq. 5Eq 6&7

Decoder

Figure 4. Autoformer Decoder Structure [source]

Figure 4. shows the architecture of the decoder, which includes two sections itself:

  1. The accumulation structure(for trend-cyclical components)
  2. The stacked Auto-Correlation mechanism (for seasonal components)

We can see that the decoder includes one Auto-Correlation (refine the prediction)and encoder-decoder Auto-Correlation (use the past seasonal information). The M x down-right of Figure 4. conveys the number of decoder layers. And also the formulas:

Eq. 8–11

Auto-Correlation vs. Self-Attention Family

I told that we use Auto-Correlation instead of Self-Attentions when it becomes widespread to use self-attentions in the state-of-the-art algorithms. But not in this research. By the way, researchers managed a comparison between Auto-Correlation and various Self-Attention constructions. (I’m not going to illustrate Self-Attention constructions, just their comparison with our Auto-Correlation)

Figure 5. Auto-Correlation vs. self-attention family. Full Attention (a) adapts the fully connection among all time points. Sparse Attention (b) selects points based on the proposed similarity metrics. LogSparse Attention (c) chooses points following the exponentially increasing intervals. Auto-Correlation (d) focuses on the connections of sub-series among underlying periods. [source]

Although some self-attentions (LogSparse and Sparse Attention) contemplate the local information, they use it to assist the discovery of point-wise dependencies. In terms of information aggregation, researchers adapt the time delay block to gather similar sub-series from underlying periods. On the opposite side, self-attentions perform this operation by dot-product.

Figure 6. Visualization of learned dependencies. For clearance, we select the top-6 time delays sizes T1, …, T6 of Auto-Correlation and mark them in raw series (red lines). For self-attentions, top-5 similar points with respect to the last time step (red stars) are also marked by orange points. [source]

Also, to analyze the efficiency of using attentions or auto-correlation we can see Figure 7. as below:

Figure 7. Efficiency Analysis. For memory, we replace Auto-Correlation with self-attention family in Autoformer and record the memory with input 96. For running time, we run the Auto-Correlaiton or self-attention 1000 times to get the execution time per step. The output length increases exponentially. [source]

Model Evaluation

The researchers evaluated the model with other state-of-the-art models for both univariate and multivariate time series data. Six real-world datasets were used in the evaluation section, including:

  1. ETT (Electricity Transformer Temperature) | 2. Electricity | 3. Exchange | 4. Traffic |5. Weather | 6. ILI

please note that I didn't provide any information about the datasets. If your want to explore this you just need to google it.😉

Two metrics (MSE and MAE) are used to show the evaluation. The results of this comparison are summarized into two tables (Table 1&2).

Table 1. Multivariate results with different prediction lengths O={96, 192, 336, 720}. We set the input length I as 36 for ILI and 96 for the others. A lower MSE or MAW indicates a better prediction. [source]Table 2. Univariate results with different prediction lengths O={96, 192, 336, 720} on typical datasets. We set the input length I as 96. A lower MSE and MAE indicate a better prediction. [source]

Also, this article is written to be used in my future work on time series, so I tried to be aware of its aspects thoroughly; if you ever found any error or gap, please let me know to fix it immediately. Finally, if you have any questions, feel free to ask; I will respond as soon as possible.

Please note that this post is for my research in the future to look back and review the materials on this topic. If you find any errors, please let me know. Meanwhile, you can directly contact me on Twitter here or LinkedIn here for any reason.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK