# Environmental (Spatiotemporal) Data Analysis with Gaussian Processes

source link:https://sandipanweb.wordpress.com/2021/12/19/environmental-spatiotemporal-data-analysis-with-gaussian-processes/

Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

# Environmental (Spatiotemporal) Data Analysis with Gaussian Processes

In this blog, we shall first discuss about how to use simulation techniques and then Gaussian Processes to analyze spatiotemporal data and make predictions. This problem appeared as homework / project in the edX course **MITx 6.419x Data Analysis: Statistical Modeling and Computation in Applications, **everything about the problem description, statement is taken from the course itself.

- The Philippine Archipelago is a fascinating multiscale ocean region. Its geometry is complex, with multiple straits, islands, steep shelf-breaks, and coastal features, leading to partially interconnected seas and basins. In this part, we will be studying, understanding, and navigating through the ocean current flows.

- The data set may be found in OceanFlow. It consists of the ocean flow vectors for time from 1 to 100. The flow in the data set is an averaged flow from the surface to either near the bottom or 400m of depth, whichever is shallower. It is thus a 2D vector field.
- There are two sets of csv data files: the first set contains 100 files (corresponding to the information flow at that time index) for the horizontal components of the vectors and the second set contains another 100 files with the vertical components at different (x,y) coordinates. Here
**zero-indexed**

arrays are used for python. The following code shows how the horizontal component of the velocity vectors corresponding to time index 1 looks like:

`import`

`pandas as pd`

`df `

`=`

`pd.read_csv(`

`'Module5\OceanFlow/1u.csv'`

`, header`

`=`

`None`

`)`

`df.head()`

- There is yet another mask file, which one, if needed, contains a 0-1 matrix identifying land and water. The following code shows how a mask file looks like.

`dfm `

`=`

`pd.read_csv(`

`'Module5\OceanFlow/mask.csv'`

`, header`

`=`

`None`

`)`

`dfm.head()`

**Additional info and units:**The data were collected in January 2009. Flows are gien in kilometers per hour (km/h) units. The time interval between the data snapshots is 3hrs. The first time index ( for zero-indexed, for one-indexed) will correspond in these problems to the time coordinate of hrs. Thus, for example,`1u.csv`

gives data at a time coordinate of hours.

- The grid spacing used is 3 km. The matrix index will correspond in these problems to the coordinate (km,km), or the
**bottom, left**of the plot. For simplicity, we will not be using longitudes and latitudes in this problem.

- The columns of the
`.csv`

files correspond to the**horizontal**direction (-axis), the rows of the`.csv`

files correspond to the**vertical**direction (-axis).

- The data has been provided by the MSEAS research group at MIT (http://mseas.mit.edu/). The flow field is from a data-assimilative multiresolution simulation obtained using their MSEAS primitive-equation ocean modeling system. It simulates tidal flows to larger-scale dynamics in the region, assimilating a varied set of gappy observations.

## Preprocessing

We need to start by preprocessing the spatiotemporal dataset. We can create a 3D numpy array for each of horizontal and vertical flows, where the dimensions represent x, y coordinates and time, respectively. The following code snippet shows, how the preprocessing is done for the vertical flow velocity vectors.

`import`

`numpy as np`

`v `

`=`

`np.zeros((`

`100`

`, `

`504`

`, `

`555`

`))`

`for`

`i `

`in`

`range`

`(`

`1`

`, `

`101`

`):`

`df `

`=`

`pd.read_csv(`

`'Module5\OceanFlow/{}v.csv'`

`.`

`format`

`(i), header`

`=`

`None`

`)`

`v[i`

`-`

`1`

`] `

`=`

`df.values `

`*`

`mask `

`#df.iloc[::-1].values * mask`

## Problem 1: Identifying long-range correlations

In this problem, we will try to identify areas in the Philippine Archipelago with long-range correlations. Your task is to identify two places on the map that are not immediately next to each other but still have some high correlation in their flows. Your response should be the map of the Archipelago with the two areas marked (e.g., circled). You claim that those two areas have correlated flows. Explain how did you found those two areas have correlated flows. Below you will find some hints.

A point is a particular location in the provided data set. Moreover, we have 100 measurements (at 100 different times) of the flow speeds in the -axis and -axis for each point. To compute the correlation between two points: select two points and compute the correlation coefficient along the -direction, or the -direction, or both. That is a correlation between two vectors of dimension 100.

- The provided data set is quite large, with a map of 555×504 points. Thus, computing the correlation between every possible pair of points might be computationally expensive. We shall need to compute (555×504)2 correlations! That is very large. Instead, we can randomly sample pairs of points and compute their correlations.
- Since this is a relatively small area, there might be many correlations between the points, so set the threshold to define “high correlations” sufficiently large. If it is too small, we shall find that most of the points are correlated. If it is too high, no pairs of points will be correlated.
- An area might be a single point correlated with other points. However, maybe there are clusters of points that are correlated with other clusters of points.
- Remember that correlation can be positive or negative. We may want to find positively correlated areas or negatively correlated areas.
- Let’s find a correlation for each direction separately. We can combine these two values using some aggregate measure. Maybe set the correlation between the two points as the maximum directional correlation. Average between the two directional correlations also work. Of course, the minimum as well.

## Solution

The next figure shows 2 locations that are far apart but still have high positive correlation > 0.7 (threshold). In the next figure the blue lines correspond to low correlation, where the red ones correspond to high correlations (> threshold).

The following code snippet shows how the long correlation was found.

`# np.random.seed(7)`

`n `

`=`

`100`

`cor_thresh `

`=`

`0.7`

`dist_thresh `

`=`

`100`

`i1s, j1s `

`=`

`np.random.choice(mask.shape[`

`1`

`], n, replace`

`=`

`True`

`), np.random.choice(mask.shape[`

`0`

`], n, replace`

`=`

`True`

`)`

`i2s, j2s `

`=`

`np.random.choice(mask.shape[`

`1`

`], n, replace`

`=`

`True`

`), np.random.choice(mask.shape[`

`0`

`], n, replace`

`=`

`True`

`)`

`for`

`k `

`in`

`range`

`(n):`

`cor_u `

`=`

`np.corrcoef(u[:,j1s[k],i1s[k]], u[:,j2s[k],i2s[k]])[`

`0`

`,`

`1`

`]`

`cor_v `

`=`

`np.corrcoef(v[:,j1s[k],i1s[k]], v[:,j2s[k],i2s[k]])[`

`0`

`,`

`1`

`]`

`cor `

`=`

`min`

`(cor_u, cor_v)`

`cor_high `

`=`

`cor > cor_thresh`

`far_enough `

`=`

`np.sqrt((i1s[k]`

`-`

`i2s[k])`

`*`

`*`

`2`

`+`

`(j1s[k]`

`-`

`j2s[k])`

`*`

`*`

`2`

`) > dist_thresh`

`if`

`cor_high `

`and`

`far_enough:`

`print`

`(`

`'point1: {}, point2: {}, correlation = {}'`

`.`

`format`

`((i1s[k],j1s[k]), (i2s[k],j2s[k]), cor)) `

**How the correlation was computed: **

- First 100 random pairs were chosen from all possible pair of 555×504 points.
- Correlations between each of the pair of points, for both flow u and v were computed (between 100 dim vectors across the time axis).
- The minimum of correlations of u and v flows were chosen to compute the final correlation.

**Why the two marked locations could be correlated:**

As can be seen from the below figure, comparing the u-flows and v-flows of the two locations, they have similar trends over time and that’s why they are highly correlated (e.g., for v flows, the blue and green lines decrease till 75*3 hrs and then increase).

The following figure shows how the time series data for the horizontal and vertical flows are highly correlated for the pair of points found above.

`plt.figure(figsize`

`=`

`(`

`10`

`,`

`7`

`))`

`plt.plot(`

`range`

`(`

`100`

`), u[:,`

`233`

`, `

`75`

`], label`

`=`

`'u(75, 233)'`

`)`

`plt.plot(`

`range`

`(`

`100`

`), v[:,`

`233`

`, `

`75`

`], label`

`=`

`'u(75, 233)'`

`)`

`plt.plot(`

`range`

`(`

`100`

`), u[:,`

`103`

`,`

`548`

`], label`

`=`

`'v(548, 103)'`

`)`

`plt.plot(`

`range`

`(`

`100`

`), v[:,`

`103`

`,`

`548`

`], label`

`=`

`'v(548, 103)'`

`)`

`plt.legend()`

## Problem 2: Simulating particle movement in flows

In this problem, we shall build a simulator that can track a particle’s movement on a time-varying flow.

- We assume that the velocity of a particle in the ocean, with certain coordinates, will be determined by the corresponding water flow velocity at those coordinates. Let’s implement a procedure to track the position and movement of multiple particles as caused by the time-varying flow given in the data set. Also, let’s explain the procedure, and show that it works by providing examples and plots.

- The above figure shows a simple flow system with four points, shown as white circles. Each point corresponds to a physical location also shown in kilometers. Attached to each point is a flow data point, which is shown as a blue arrow. Recall that flow data is given in the and direction. It is assumed that a particle moving in one of the zones or boxes acquires the velocity given by the flow data. This example shows that at time , a particle is located at a position x(t), shown as a green diamond.
- After some time ε has passed, at time t+ε, the particle is at a new location x(t+ε) shown as a blue diamond. The new location can be computed as the previous location x(t) plus a displacement computed as the time-step multiplied by the velocity at that point v(t) . Note that the velocity is given in kilometers per hour. Thus, the location x(t) must be given in the same distance units, e.g., kilometers, and the time-step should be given in hours.
- The data provides a discretization of the ocean flow. The particles will, however, be moving on a continuous surface. For simplicity, let us assume that the surface is the plane
**R**2. The data can be seen to provide flow information at integer points, namely at (m, n) for m and n integers. Let’s divide the continuous surface into squares in such a way that each square contains a unique data point. One way to achieve this is to assign to every point in the surface the closest data point. For instance, given (x, y) ∈**R**2, this consist of rounding both x and y to the closest integer. We may then suppose that each square has the same flow information as the data point it contains. - Now take a particle at (x, y) in a particular square. The flow in the square will displace it at the corresponding velocity. Once the particle moves out of this square, it is governed by the flow information of the next square that it enters.
- The below figure shows a grid that we should have in mind when building the simulator. For each point, which corresponds to a physical location, there is a flow vector given, for each time step (recall there are 100 time instances).

Draw particle locations uniformly at random across the entire map, let’s not worry if some of them are placed on land. Let’s simulate the particle trajectories for 300 hours and provide a plot of the initial state, a plot of the final state, and two plots at intermediate states of the simulation. We may wish to draw colors at random for our particles in order to help distinguish them.

## Solution

**Explanation **

The following code shows the function that implements the simulation for the evolution of flow velocities. At a point (x,y), the velocity vector at time t is given by the tuple (u(t,y,x), v(t,y,x)), since in u and v matrices, x axis represents the columns and y axis represents the rows (x, y being integers, so that velocities are discretized). We can draw quivers with the position of a location and the directions from the flow matrices and evolve over time t.

`def`

`plot_flows_over_time(u, v, mask):`

`xx, yy `

`=`

`np.meshgrid(`

`range`

`(`

`0`

`, mask.shape[`

`1`

`], `

`10`

`), `

`range`

`(`

`0`

`, mask.shape[`

`0`

`], `

`10`

`))`

`for`

`t `

`in`

`range`

`(`

`100`

`):`

`plt.figure(figsize`

`=`

`(`

`15`

`,`

`15`

`))`

`sns.heatmap(mask, cmap`

`=`

`pal, alpha`

`=`

`0.4`

`, cbar`

`=`

`None`

`)`

`plt.quiver(xx, yy, u[t,yy,xx],v[t,yy,xx], scale`

`=`

`0.1`

`, units`

`=`

`'xy'`

`, angles`

`=`

`'xy'`

`, \`

`color`

`=`

`[(np.random.random(),np.random.random(),np.random.random()) \ `

`for`

`_ `

`in`

`range`

`(np.prod(u[t,yy,xx].shape))]) `

`#color=(uv, uv, uv))`

`plt.title(`

`'T = {} hrs'`

`.`

`format`

`(`

`3`

`*`

`t), size`

`=`

`20`

`)`

`plt.axis(`

`'off'`

`)`

`plt.show()`

The next figures show the initial, final and the intermediate flows.

The following animation shows the simulation of the flows at different point in time

## Problem 3: Guess the location of the debris for a toy plane crash with simulation

A (toy) plane has crashed north of Palawan at T=0. The exact location is unknown, but data suggests that the location of the crash follows a Gaussian distribution with mean (100, 350) (namely (350 km, 1050 km) ) with variance σ2. The debris from the plane has been carried away by the ocean flow.

Let’s say we are about to lead a search expedition for the debris. Where would we expect the parts to be at 48 hrs, 72 hrs, 120 hrs? Study the problem by varying the variance of the Gaussian distribution. Either pick a few variance samples or sweep through the variances if desired. Let’s sample particles and track their evolution.

## Solution

First let’s visualize how the mean location drifts with time, Let’s first create one single plot instead of three to show the location of the plane at different three times (shown as legends), since it’s more insightful and easier to find the trajectory this way.

Now, let’s sample 100 particles around the mean for 3 different variance values from the Gaussian distribution and visualize where the plane could have been at different time points.

The following animation shows how the debris of the toy plane can get carried away by the flows over time for σ=5:

- It’s quite natural to assume a normal distribution of locations around which debris could have been placed. Since this approach provides us a controlled way to increase the search space (just increase σ), it can be efficient, since we have to explore less locations till finding the debris.
- However, because we only have data for 100-time instances, separated 3 hours each, technically, we can only run such simulation for a total of 300 hours. Hence, simulation can’t extend beyond the time for which we don’t have any (flow) data (i.e., larger time scales).
- Also, we can see from the previous exercise that in the debris simulation based on this time scale, the points did not move too much or too far.

## Problem 4: Creating a Gaussian process model for the flow

In this problem, we shall create a Gaussian process model for the flow given the information we have. We shall perform a toy experiment where we can simulate larger time scales. Furthermore, we are going to model flows as a Gaussian process to estimate some unobserved variables.

- Here onward, we are going to assume that the flow data available,
**instead of being given for measurements every 3 hours, is given for measurements****every three days!**. This implies that we can potentially simulate the movement of debris for almost one year. - However, to achieve a sufficiently smooth simulation, having flow data every three days seems quite a stretch. Whereas in the previous problems we assumed that the flow in the ocean remains approximately the same on a period of 3 hours, to assume that this is also the case for three days does not seem correct. For example, we would like to have time steps of 24 hours, 12 hours or possibly even smaller.
**NOTE**: Of course we are not trying to make a perfect simulation of the ocean, nor an exact fluid dynamics simulations, so for the sake of simplicity and the toy experiment, we shall this simplistic scenario.

- Let’s start by picking a location of from the map for which we are given flow data (ideally from a location on the ocean not in the land). Moreover, consider the two vectors containing the flow speed for each direction: we shall end up with two vectors of dimension 100.
- Let’s find the parameters of the
**kernel**function that best describes the data independently for each direction. That means we shall do the following for each direction, and obtain a model for each direction at that particular location, as per the following instructions.

The following figure shows the general ideas for a Gaussian Process model

Let’s recall the steps to estimate the parameters of the **kernel** function (for a **GP**).

- Pick a kernel function: choose the proper kernel function.
- Identify the parameters of the kernel function.
- Find a suitable search space for each of the parameters.

We shall find a good set of parameters via cross-validation. Pick a number k and split the data k-wise defining some training and some testing points. Each vector is of dimension 100, so for k=10 we shall have ten different partitions, where there are 90 points for training and 10 points for testing.

For each possible set of parameters and each data partition.

- Let’s build an estimate for the mean at each possible time instance. For example, we can estimate the mean as all zeros, or take averages – moving or otherwise – over our training data.
- Construct the covariance matrix for the selected kernel functions and the selected set of parameters.
- Compute the conditional mean and variance of the testing data points, given the mean and variance of the training data points and the data itself. Let’s recall the following for GP:

- In this case, X1 are the velocities for the testing data-points, X2 are the velocities for the training data-points. μ1 are the mean velocities for the testing data-points. μ2 are the mean velocities for the training data-points. Σ22 is the covariance of the training data-points. Σ11 is the covariance of the testing data-points. Σ12 is the cross-covariance. x2 are the observed velocities for the training data-points. Finally, τ is the parameter indicating the variance of the noise in the observations. We can pick τ=0.001.
- Compute the log-likelihood performance for the selected parameters. We can select to compute it only on the testing data, or on the complete vector.
- For each possible set of parameters, we shall then have the performance for each of the partitions of our data. Find the parameters that maximize performance. Save the computed cost/performance metric for each choice of parameters, then create a plot over the search space that shows this metric.

## Solution

Here using the same point (100, 350) where the toy plane crashed. The following figure shows the plot of the time series for u-flows and v-flows at the location.

**Choice and the parameters of kernel function **

**Squared exponential** (**RBF**) **kernel **was chosen as kernel function:

It makes sense to assume that the similarity between the flow velocities decrease exponentially with the square of the time difference they were recorded at.

**Search space for each kernel parameter **

The search space is the Cartesian product of the following values of the hyper-parameters l and σ (each of size 10, chosen at uniform gaps in between 0.1 and 5), as shown below, with 100 possible combinations of hyper-parameter values in the search space.

**The number of folds (k) for the cross-validation**

10-fold cross-validation was used (k=10). The following code shows how the average marginal log-likelihood over k folds was computed on the held-out test dataset. The prior mean (μ1) is set to the mean of all the 100 data points in the time series. The following code shows how the hyperparameter tuning is done with 10-fold cross validation and then optimal parameters are chosen by maximizing the marginal log-likelihood on the test dataset.

`def`

`compute_log_marginal_likelihood(y, μ, K):`

`return`

`-`

`(y`

`-`

`μ)[email protected](K, (y`

`-`

`μ))`

`/`

`2`

`-`

`log(`

`round`

`(np.linalg.det(K), `

`3`

`))`

`/`

`2`

`-`

`len`

`(y)`

`*`

`np.log(`

`2`

`*`

`np.pi)`

`/`

`2`

`def`

`get_train_test_fold(z, i, k):`

`sz `

`=`

`len`

`(z) `

`/`

`/`

`k`

`return`

`np.concatenate([np.arange((i`

`-`

`1`

`)`

`*`

`sz), np.arange(i`

`*`

`sz, k`

`*`

`sz)]), \`

`np.concatenate([z[np.arange((i`

`-`

`1`

`)`

`*`

`sz)], z[np.arange(i`

`*`

`sz, k`

`*`

`sz)]]), \`

`np.arange((i`

`-`

`1`

`)`

`*`

`sz, i`

`*`

`sz), z[(i`

`-`

`1`

`)`

`*`

`sz:i`

`*`

`sz]`

`k `

`=`

`10`

`z `

`=`

`u[:,x,y]`

`l `

`=`

`np.linspace(`

`0.1`

`, `

`5`

`, `

`10`

`)`

`σ `

`=`

`np.linspace(`

`0.1`

`, `

`5`

`, `

`10`

`) `

`ll_mean_k_folds `

`=`

`np.zeros((`

`len`

`(l), `

`len`

`(σ)))`

`for`

`i1 `

`in`

`range`

`(`

`1`

`, k`

`+`

`1`

`):`

`x2, y2, x1, y1 `

`=`

`get_train_test_fold(z, i1, k)`

`ll `

`=`

`np.zeros((`

`len`

`(l), `

`len`

`(σ)))`

`for`

`i `

`in`

`range`

`(`

`len`

`(l)):`

`for`

`j `

`in`

`range`

`(`

`len`

`(σ)):`

`Σ`

`22`

`=`

`compute_sigma(x2, x2, σ[j], l[i])`

`Σ`

`12`

`=`

`compute_sigma(x1, x2, σ[j], l[i])`

`Σ`

`21`

`=`

`compute_sigma(x2, x1, σ[j], l[i])`

`Σ`

`11`

`=`

`compute_sigma(x1, x1, σ[j], l[i])`

`Σ`

`1_2`

`=`

`Σ`

`11`

`-`

`Σ`

`12`

`@(np.linalg.solve(Σ`

`22`

`+`

`τ`

`*`

`*`

`2`

`*`

`np.eye(`

`len`

`(y2)), Σ`

`21`

`)) `

`ll[i,j] `

`=`

`compute_log_marginal_likelihood(y1, Σ`

`1_2`

`)`

`ll_mean_k_folds `

`+`

`=`

`ll`

`ll_mean_k_folds `

`/`

`=`

`k`

**Optimal kernel parameters from the search **

The following figures show the plots of the computed cost/performance metric over the search space for the kernel parameters.

For the GP model for the u-flow direction:

l = 1.733, σ=1.733, with corresponding maximum marginal-log-likelihood = -7.89, as shown in the next heatmap.

For the GP model for the v-flow direction: l = 1.733, σ=1.733, with corresponding maximum marginal-log-likelihood = -7.67, as shown in the next heatmap.

Let’s compute the optimal kernel values for three new locations that are different from the last location, as shown below:

- (407,350): optimal GP hyper-parameters obtained with 10-fold CV: l = 1.733, σ = 1.733 for u-flow and l = 1.733, σ = 1.733 for v-flow direction
- (100,33): optimal GP hyper-parameters obtained with 10-fold CV: l = 1.733, σ = 1.733 for u-flow and l = 1.733, σ = 1.733 for v-flow direction
- (8,450): optimal GP hyper-parameters obtained with 10-fold CV: l = 1.733, σ = 1.733 for u-flow and l = 1.733, σ = 1.733 for v-flow direction

For both the hyperparameters, we get maximum average log-likelihood on held-out folds with the same values of l (length-scale) and σ parameters, typically at l = 1.733, σ = 1.733.

**Impact of τ value on the optimal hyperparameter selection**

We have suggested one particular value for τ. Let’s consider other possible values and observe the effects such parameter has on the estimated parameters and the estimation process’s performance.

- (407,350), τ=0.1: optimal GP hyper-parameters obtained with 10-fold CV: l = 2.278 , σ = 3.367 for u-flow and l = 2.278, σ = 3.367 for v-flow direction.

Heatmap for u flow

Heatmap for v flow

- (407,350), τ=1: optimal GP hyper-parameters obtained with 10-fold CV: l = 2.822 , σ = 5.0 for u-flow and l = 2.278, σ = 2.822 for v-flow direction.

Heatmap for u flow

Heatmap for v flow

As can be seen from above, the optimal values for τ=0.1 and τ=1 differ from those found in with τ=0.001, and also they differ from each other too.

## Problem 5: Using with Gpy package in python and comparing with above results

Currently, most of the commonly used languages like Python, R, Matlab, etc., have pre-installed libraries for Gaussian processes. Let’s use one library of your choice, maybe the language or environment, and compare the obtained results.

## Solution

- Optimal kernel parameters as found through the software library (length scale) l = 3.465, σ = 0.367 (as shown in the next figure)
- Details on the library used: Gpy – A Gaussian Process (GP) framework in Python https://gpy.readthedocs.io/
- The following code snippet shows how to perform the hyperparameter tuning and prediction with GP using the Gpy modules’ functions:

`import`

`GPy`

`kernel `

`=`

`GPy.kern.RBF(input_dim`

`=`

`1`

`, variance`

`=`

`1.`

`, lengthscale`

`=`

`1.`

`)`

`model `

`=`

`GPy.models.GPRegression(np.arange(`

`100`

`).reshape(`

`-`

`1`

`,`

`1`

`), z.reshape(`

`-`

`1`

`,`

`1`

`), kernel)`

`model.constrain_positive(`

`"noise"`

`)`

`model.randomize()`

`model.noise_variance `

`=`

`.`

`001`

`model.optimize(messages`

`=`

`True`

`, max_iters`

`=`

`1e5`

`)`

`model.plot(levels`

`=`

`10`

`)`

- The optimal hyperparameters seem to be little different from the ones obtained without using the library. This may be due to the grid search values used, different hyperparameter tuning method is used, and / or different initialization values used for the prior mean and difference in the noise variances used. The next figures show the comparison of the predictions generated (the predictions for the u-flow direction is shown).

## Problem 6: Estimating unobserved flow data.

In the previous problem, we have found a good set of parameters to model the sequence of speeds at one location as a Gaussian process. Recall that we have assumed our 100 observations came at a rate of one every three days. We are going to assume that when we advance to our simulations, we will choose a smaller time step. Thus, we need to interpolate how the flow would look like at some unobserved points.

Let’s say we are given flow information every three days. Let’s pick some time stamps in-between each observation for which to estimate the flow. For example, if we want flows every day, there will be two unknown points between two observations. We could pick only one, or more than two.

- Let’s compute the conditional distribution (mean and covariance) at the time locations selected in the previous solution.
- Use the kernel parameters that you obtained earlier, and use the same location as we did in the previous problem.
- For the initial estimate of the mean at the unknown time locations, we can use zero, use the average of all the observations, or take the average of the two closest observed points.
- Let’s plot the predictions, clearly showing:
- The predicted means.
- The predicted standard deviation as a 3σ band (three standard deviations above and below the mean).
- The observed data points.

## Solution

- The predicted means are shown in as blue continuous line and the observed data points are shown as red stars in the next figure.

- Choice of time-stamps at which to create predictions: Wanted flows every day, so two unknown points were chosen between every two observations, so in total we have 300 datapoints to predict, corresponding to 300 days.
- The prior means were chosen as the mean of the entire (observed) dataset (i.e., u and v velocities corresponding to the 100 points, respectively).
- Predictions for the horizontal velocity components at the chosen location (100, 350) at different points in time is shown in the next figure.

- Predictions for the vertical velocity components at the chosen location at different points in time is shown in the next figure.

The following code snippet used to compute the prediction with GP using our own implementation of GP (also find the code here):

`z `

`=`

`u[:,x,y] `

`#v[:,x,y]`

`μ`

`1`

`=`

`np.mean(z)`

`x `

`=`

`np.arange(`

`0`

`, `

`100`

`)`

`x_ `

`=`

`np.linspace(`

`0`

`, `

`99`

`, `

`300`

`)`

`#μ1 = np.zeros(len(x_))`

`Σ`

`22`

`=`

`compute_sigma(x, x, σ, l)`

`Σ`

`12`

`=`

`compute_sigma(x_, x, σ, l)`

`Σ`

`21`

`=`

`compute_sigma(x, x_, σ, l)`

`Σ`

`11`

`=`

`compute_sigma(x_, x_, σ, l)`

`μ`

`2`

`=`

`np.mean(z)`

`μ`

`1_2`

`=`

`μ`

`1`

`+`

`Σ`

`12`

`@(np.linalg.solve(Σ`

`22`

`+`

`τ`

`*`

`*`

`2`

`*`

`np.eye(`

`len`

`(z)), z `

`-`

`μ`

`2`

`)) `

`Σ`

`1_2`

`=`

`Σ`

`11`

`-`

`Σ`

`12`

`@(np.linalg.solve(Σ`

`22`

`+`

`τ`

`*`

`*`

`2`

`*`

`np.eye(`

`len`

`(z)), Σ`

`21`

`)) `

The next animation shows the effect of the hyperparameter (length-scale and variance for the RBF kernel) values on the fitted GP and their impact on prediction:

## Problem 7: A longer time-scale simulation with Gaussian Process

In the previous problems, we learned to model the flow at one location as a Gaussian process. Thus, we can extend this to estimate the flow at any point in time at that particular location using the kernel function parameters. At a certain point in time, the flow can be computed as the realization of a multivariate Gaussian random variable with parameters given by the conditional distributions given the flow data. At this point, you are asked to simulate a particle moving according to the flow data and using the estimates for times between the original timestamps of the problem.

- Ideally, one would have to estimate the parameters of the flow at every point in the map. However, having to run parameter selection models seems like much computational work. So, here we take a more straightforward approach: use the results from the previous problem to choose a value of our kernel parameters that is generally representative of the points that we tested.

- Let’s modify the simulator that we built earlier to use this new flow estimated flow information. Note that with this new change, we should be able to simulate the flow of particles for 300 days! Regarding data, originally, we have 100 measurements per point, now with this approach, let us say we use the estimates for two extra points to get one flow data per day, so in total, we should have at your disposal 300 descriptions of flow per location.
- Let’s now repeat the simulation problem with GP. This time we shall be simulating flows for 300 days. This allows some debris to arrive on land. Let’s investigate where are some possible places along the coast where one could find debris? Again, to do this, let’s choose a σ, and simulate the movement of particles with initial location sampled from the bivariate Gaussian. Evolve the location of the particles. Some times particles trajectories will terminate on the shore. Continue to keep track of such particles. These points are likely where we could find the debris.
- Provide a plot that includes your initial, final, and at least one intermediate state of your simulation. For the final state, clearly mark one location on land where we would search for debris. Also mark one location over the ocean where we would search for debris.
- Try at least one other value for σ, and create the same three plots.

## Solution

- Plot with the initial / intermediate / final state of the simulation with GP (used 10 sample points from the Gaussian distribution with mean location (100,350) and s.d. 5) is shown in the next figure:

- Marking a location on the coast and another one over the ocean of the final state of the simulation where one should search for debris: as can be seen from next figure, the debris is most likely to land at (112, 380), keeping in view of the smoothness introduced by GP while getting carried away.

- Plots (initial, intermediate, final) for one other choice of σ.

As can be seen the conclusion changes with higher standard deviation, the debris are more dispersed and can potentially land at any of the above two locations.

**To be continued…**

## Recommend

## About Joyk

Aggregate valuable and interesting links.

Joyk means Joy of geeK