# Chapter 13 Multivariate and Spatiotemporal Geostatistics

Building on the simple interpolation methods presented in chapter 12, this chapter works out a case study for spatiotemporal interpolation, using NO\(_2\) air quality data, and populations density as covariate.

## 13.1 Preparing the air quality dataset

The dataset we work with is an air quality dataset obtained from
the European Environmental Agency (EEA). European member states report
air quality measurements to this Agency. So-called *validated*
data are quality controlled by member states, and are reported on a
yearly basis. They form the basis for policy compliancy evaluations.

The EEA’s air quality e-reporting website gives access to the data reported by European member states. We decided to download hourly (time series) data, which is the data primarily measured. A web form helps convert simple selection criteria into an http GET request. The following URL

`https://fme.discomap.eea.europa.eu/fmedatastreaming/AirQualityDownload/AQData_Extract.fmw?CountryCode=DE&CityName=&Pollutant=8&Year_from=2017&Year_to=2017&Station=&Samplingpoint=&Source=E1a&Output=TEXT&UpdateDate=`

was created to select all validated (`Source=E1a`

) \(NO_2\)
(`Pollutant=8`

) data for 2017 (`Year_from`

, `Year_to`

) from Germany
(`CountryCode=DE`

). It returns a text file with a set of URLs to CSV
files, each containing the hourly values for the whole period for a
single measurement station. These files were downloaded and converted
to the right encoding using the `dos2unix`

command line utility.

In the following, we will read all the files into a list,

```
files = list.files("aq", pattern = "*.csv", full.names = TRUE)
r = lapply(files[-1], function(f) read.csv(f))
```

then convert the time variable into a `POSIXct`

variable, and put them in time order by

```
Sys.setenv(TZ = "UTC") # make sure times are not interpreted in local time zone
r = lapply(r, function(f) {
f$t = as.POSIXct(f$DatetimeBegin)
f[order(f$t), ]
}
)
```

We remove smaller subdatasets, which for this dataset have no hourly data:

```
r = r[sapply(r, nrow) > 1000]
names(r) = sapply(r, function(f) unique(f$AirQualityStationEoICode))
length(r) == length(unique(names(r)))
```

and then combine all files using `xts::cbind`

, so that they are matched based on matching times:

A usual further selection for this dataset is to select stations
for which 75% of the hourly values measured are valid, i.e. drop
those with more than 25% hourly values missing.
Knowing that `mean(is.na(x))`

gives the *fraction* of missing values
in a vector `x`

, we can apply this function to the columns (stations):

Next, the station metadata was read
and filtered for rural background stations in Germany (`"DE"`

) by

```
library(tidyverse)
read.csv("aq/AirBase_v8_stations.csv", sep = "\t") %>%
as_tibble() %>%
filter(country_iso_code == "DE", station_type_of_area == "rural",
type_of_station == "Background") -> a2
```

These stations contain coordinates, and an `sf`

object with (static) station metadata is created by

```
library(sf)
a2.sf = st_as_sf(a2, coords = c("station_longitude_deg", "station_latitude_deg"), crs = 'EPSG:4326')
```

We now subset the air quality measurements to include only stations that are of type rural background:

We can compute station means, and join these to stations locations by

```
tb = tibble(NO2 = apply(aqsel, 2, mean, na.rm = TRUE), station_european_code = colnames(aqsel))
crs = "EPSG:32632"
right_join(a2.sf, tb) %>% st_transform(crs) -> no2.sf
```

Station mean NO\(_2\) concentrations, along with country borders, are shown in in figure 12.1.

## 13.2 Multivariable geostatistics

Multivariable geostatics involves the *joint* modelling, prediction
and simulation of multiple variables,
\[Z_1(s) = X_1 \beta_1 + e_1(s)\]
\[...\]
\[Z_n(s) = X_n \beta_n + e_n(s).\]
In addition to having observations, trend models, and variograms
for each variable, the *cross* variogram for each pair of residual
variables, describing the covariance of \(e_i(s), e_j(s+h)\),
is required. If this cross covariance is non-zero, knowledge of
\(e_j(s+h)\) may help predict (or simulate) \(e_i(s)\). This is
especially true if \(Z_j(s)\) is more densely sample than \(Z_i(s)\).
Prediction and simulation under this model are called cokriging
and cosimulation. Examples using gstat are found when running the
demo scripts

and are further illustrated and discussed in Bivand, Pebesma, and Gomez-Rubio (2013).

In case the different variables considered are observed at
the same set of locations, for instance different air quality
parameters, then the statistical *gain* of using cokriging
as opposed to direct (univariable) kriging is often modest,
when not negligible. A gain may however be that the prediction
is truly multivariable: in addition to the prediction vector
\(\hat{Z(s_0)}=(\hat{Z}_1(s_0),...,\hat{Z}_n(s_0))\) we get the full
covariance matrix of the prediction error (Ver Hoef and Cressie 1993).
This means for instance that if we are interested in some
linear combination of \(\hat{Z}(s_0)\), such as \(\hat{Z}_2(s_0) - \hat{Z}_1(s_0)\), that we can get the standard error of that
combination because we have the correlations between the prediction
errors.

Although sets of direct and cross variograms can be computed and fitted automatically, multivariable geostatistical modelling becomes quickly hard to manage when the number of variables gets large, because the number of direct and cross variograms required is \(n(n+1)/2\).

In case different variables refer to the same variable take at
different time steps, one could use a multivariable (cokriging)
prediction approach, but this would not allow for e.g. interpolation
between two time steps. For this, and for handling the case of
having data observed at many time instances, one can also model
its variation as a function of continuous space *and* time, as of
\(Z(s,t)\), which we will do in the next section.

## 13.3 Spatiotemporal geostatistics

Spatiotemporal geostatistical processes are modelled as variables having a value everywhere in space and time, \(Z(s,t)\), with \(s\) and \(t\) the continuously indexed space and time index. Given observations \(Z(s_i,t_j)\) and a variogram (covariance) model \(\gamma(s,t)\) we can predict \(Z(s_0,t_0)\) at arbitrary space/time locations \((s_0,t_0)\) using standard Gaussian process theory.

Several books have been written recently about modern approaches to handling and modelling spatiotemporal geostatistical data, including Wikle, Zammit-Mangion, and Cressie (2019) and Blangiardo and Cameletti (2015). Here, we will use Gräler, Pebesma, and Heuvelink (2016) and give some simple examples using the dataset also used for the previous chapter.

### 13.3.1 A spatiotemporal variogram model

Starting with the spatiotemporal matrix of NO\(_2\) data in `aq`

constructed at the beginning of this chapter, we will first select
the measurements taken at rural background stations:

Then we will select the spatial locations for these stations by

and finally build a `stars`

object with time and station as dimensions:

```
library(stars)
st_as_stars(NO2 = as.matrix(aqx)) %>%
st_set_dimensions(names = c("time", "station")) %>%
st_set_dimensions("time", index(aqx)) %>%
st_set_dimensions("station", sfc) -> no2.st
```

From this, we can compute the spatiotemporal variogram using

```
# Warning: Strategy 'multiprocess' is deprecated in future (>= 1.20.0). Instead,
# explicitly specify either 'multisession' or 'multicore'. In the current R
# session, 'multiprocess' equals 'multicore'.
```

which is shown in figure 13.1.

```
v1 = plot(v.st)
v2 = plot(v.st, map = FALSE)
print(v1, split = c(1,1,2,1), more = TRUE)
print(v2, split = c(2,1,2,1), more = FALSE)
```

To this sample variogram, we can fit a variogram model. One relatively flexible model we try here is the product-sum model (Gräler, Pebesma, and Heuvelink 2016), fitted by

```
# product-sum
prodSumModel <- vgmST("productSum",
space = vgm(150, "Exp", 200, 0),
time = vgm(20, "Sph", 40, 0),
k = 2)
StAni = estiStAni(v.st, c(0,200000))
(fitProdSumModel <- fit.StVariogram(v.st, prodSumModel, fit.method = 7,
stAni = StAni, method = "L-BFGS-B",
control = list(parscale = c(1,10,1,1,0.1,1,10)),
lower = rep(0.0001, 7)))
```

```
# space component:
# model psill range
# 1 Nug 26.3 0
# 2 Exp 140.5 432
# time component:
# model psill range
# 1 Nug 1.21 0.0
# 2 Sph 15.99 40.1
# k: 0.0322469094848839
```

and shown in figure 13.2.

which can also be plotted as wireframes, shown in figure 13.3.

```
plot(v.st, model = fitProdSumModel, wireframe = TRUE, all = TRUE,
scales = list(arrows = FALSE), zlim = c(0, 185))
```

Hints about the fitting strategy and alternative models for spatiotemporal variograms are given in Gräler, Pebesma, and Heuvelink (2016).

With this fitted model, and given the observations, we can carry out kriging or simulation at arbitrary points in space and time. For instance, we could estimate (or simulate) values in the time series that are now missing: this occurs regularly, and in section 12.4 we used means over time series based on simply ignoring up to 25% of the observations: substituting these with estimated or simulated values based on neigbouring (in space and time) observations before computing yearly mean values seems a more reasonable approach.

More in general, we can estimate at arbitrary locations and time
points, and we will illustrate this with predicting time series at
particular locations, and and predicting spatial slices (Gräler, Pebesma, and Heuvelink 2016).
We can create a `stars`

object for two randomly picked spatial points
and all time instances by

```
pt = st_sample(de, 2)
t = st_get_dimension_values(no2.st, 1)
st_as_stars(list(pts = matrix(1, length(t), length(pt)))) %>%
st_set_dimensions(names = c("time", "station")) %>%
st_set_dimensions("time", t) %>%
st_set_dimensions("station", pt) -> new_pt
```

and we obtain the spatiotemporal predictions at these two points using `krigeST`

by

```
no2.st <- st_transform(no2.st, crs)
new_ts <- krigeST(NO2~1, data = no2.st["NO2"], newdata = new_pt,
nmax = 50, stAni = StAni, modelList = fitProdSumModel,
progress = FALSE)
```

where the results are shown in figure 13.4.

Alternatively, we can create spatiotemporal predictions for a set of time-stamped raster maps, evenly spaced over the year 2017, created by

```
t4 = t[(1:4 - 0.5) * (3*24*30)]
d = dim(grd)
st_as_stars(pts = array(1, c(d[1], d[2], time = length(t4)))) %>%
st_set_dimensions("time", t4) %>%
st_set_dimensions("x", st_get_dimension_values(grd, "x")) %>%
st_set_dimensions("y", st_get_dimension_values(grd, "y")) %>%
st_set_crs(crs) -> grd.st
```

and the subsequent predictions are obtained by

```
new_int <- krigeST(NO2~1, data = no2.st["NO2"], newdata = grd.st,
nmax = 200, stAni = StAni, modelList = fitProdSumModel,
progress = FALSE)
names(new_int)[2] = "NO2"
```

```
g + geom_stars(data = new_int, aes(fill = NO2, x = x, y = y)) +
facet_wrap(~as.Date(time)) +
geom_sf(data = st_cast(de, "MULTILINESTRING")) +
geom_sf(data = no2.sf, col = 'grey', cex = .5) +
coord_sf(lims_method = "geometry_bbox")
```

`# Coordinate system already present. Adding new coordinate system, which will replace the existing one.`

and shown in figure 13.5.

A larger value for `nmax`

was needed here to decrease the visible
disturbance (sharp edges) caused by discrete neighbourhood
selections, which are now done in space *and* time.

## 13.4 Exercises

- Which fraction of the stations is removed in section 13.1 when the criterion applied that a station must be 75% complete?
- From the hourly time series in
`no2.st`

, compute daily mean concentrations using`aggregate`

, and compute the spatiotemporal variogram of this. How does it compare to the variogram of hourly values? - Carry out a spatiotemporal interpolation for daily mean values for the days corresponding to those shown in 13.5, and compare the results.
- Following the example in the demo scripts pointed at in section 13.2, carry out a cokriging on the daily mean station data for the four days shown in 13.5. What are the differences of this approach to spatiotemporal kriging?

### References

Bivand, Roger S., Edzer Pebesma, and Virgilio Gomez-Rubio. 2013. *Applied Spatial Data Analysis with R, Second Edition*. Springer, NY. http://www.asdar-book.org/.

Blangiardo, Marta, and Michela Cameletti. 2015. *Spatial and Spatio-Temporal Bayesian Models with R-Inla*. John Wiley & Sons.

Gräler, Benedikt, Edzer Pebesma, and Gerard Heuvelink. 2016. “Spatio-Temporal Interpolation using gstat.” *The R Journal* 8 (1): 204–18. https://doi.org/10.32614/RJ-2016-014.

Ver Hoef, Jay M, and Noel Cressie. 1993. “Multivariable Spatial Prediction.” *Mathematical Geology* 25 (2): 219–40.

Wikle, Christopher K, Andrew Zammit-Mangion, and Noel Cressie. 2019. *Spatio-Temporal Statistics with R*. CRC Press.