stgam
The aim of this vignette is provide an introduction to the
stgam
package and demonstrates how to construct a spatially
vary coefficient (SVC) model and spatial and temporally varying (STVC)
model using the stgam
package. Essentially
stgam
provides a wrapper for constructing GAMs with
Gaussian Process (GP) smooths or splines, that are parameterised with
location for SVCs and with location and time for STVCs. The key ideas
underpinning the development of SVCs with GAMs in the stgam
package and this vignette are:
The approach is presented in outline below, but detail of the SVCs, TVCs and STVCs constructed using GAMs with GP smooths, and the evolution of their application from spatial models to space-time coefficient modelling can be found in Comber, Harris, and Brunsdon (2024) XXX. In this, the concept of a Gaussian Process (Wood 2020) is important in the context of regression modelling. It provides a data model of the likelihood that a given data set is generated given a statistical model involving some unknown parameters and in regression modelling, the unknown parameters are the regression parameters. These are described formally in Comber, Harris, and Brunsdon (2024) and XXX.
One final comment is that in this vignette you will all of the
varying coefficient models you create assume that spatial, temporal or
spatio-temporal dependencies are present in the data. This may not be
the case and these assumptions are examined in detail in the second
vignette in the stgam
package.
In this vignette you will:
stgam
package!You should install the stgam
package either from CRAN or
from GitHub:
And then make sure the required packages and data are loaded:
# load the packages
library(stgam)
library(cols4all) # for nice shading in graphs and maps
library(cowplot) # for managing plots
library(dplyr) # for data manipulation
library(ggplot2) # for plotting and mapping
library(glue) # for model construction
library(mgcv) # for GAMs
library(sf) # for spatial data
library(doParallel) # for parallelising operations
library(purrr) # for model construction
library(tidyr) # for model construction
# load the data
data(productivity)
data(us_data)
The productivity
data is annual economic productivity
data for the 48 contiguous US states (with Washington DC merged into
Maryland), for years 1970 to 1985 (17 years). This was extracted from
the plm
package (Croissant, Millo,
and Tappe 2022). The us_data
is a spatial dataset of
the US states in a USA Contiguous Equidistant Conic projection
(ESRI:102005) from the spData
package (Bivand, Nowosad, and Lovelace 2019). The
productivity
data includes locational information of the
state geometric centroid in the same projection. The code below maps the
X
and Y
locations in productivity
along with the US state areas.
ggplot() + geom_sf(data = us_data, fill = NA) +
geom_point(data = productivity |> filter(year == "1970"), aes(x = X, y = Y)) +
theme_bw() + ylab("") + xlab("")
The data attributes can be examined and a spatial model of
constructed using gam
function from the mgcv
package. The varying coefficient models created below all focus on
Private capital stock (privC
) as the target variable, with
Unemployment (unemp
) and Public capital (pubC
)
covariates, where the coefficient functions are assumed to be
realisation of a Gaussian Process (GP) introduced above.
head(productivity)
#> # A tibble: 6 × 14
#> state GEOID year region pubC hwy water util privC gsp emp unemp X
#> <chr> <chr> <int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 ALABAMA 01 1970 6 15033. 7326. 1656. 6051. 35794. 28418 1010. 4.7 858146.
#> 2 ALABAMA 01 1971 6 15502. 7526. 1721. 6255. 37300. 29375 1022. 5.2 858146.
#> 3 ALABAMA 01 1972 6 15972. 7765. 1765. 6442. 38670. 31303 1072. 4.7 858146.
#> 4 ALABAMA 01 1973 6 16406. 7908. 1742. 6756. 40084. 33430 1136. 3.9 858146.
#> 5 ALABAMA 01 1974 6 16763. 8026. 1735. 7002. 42057. 33749 1170. 5.5 858146.
#> 6 ALABAMA 01 1975 6 17316. 8158. 1752. 7406. 43972. 33604 1155. 7.7 858146.
#> # ℹ 1 more variable: Y <dbl>
A spatially varying coefficient model will be created using the
productivity
dataset.
The code below defines an intercept column (Intercept
)
in the data. This to allow the intercept to be treated as an addressable
term in the model. It also defines parametric and non-parametric forms
for the intercept and each covariate, so that they can can take a global
form (i.e. as in a standard OLS regression) and a spatially varying form
in the GP smooth.
# define intercept term
productivity <- productivity |> mutate(Intercept = 1)
# create the SVC
svc.gam = gam(privC ~ 0 +
Intercept + s(X, Y, bs = 'gp', by = Intercept) +
unemp + s(X, Y, bs = "gp", by = unemp) +
pubC + s(X, Y, bs = "gp", by = pubC),
data = productivity |> filter(year == "1970"))
Notice the 0 +
in the model. This indicates that the
intercept coefficient is not included implicitly and it is included
explicitly as Intercept
. Also notice the different form of
the splines from those specified in Part 1. Here, for each covariate, a
GP smooth is specified for X
and Y
(the
coordinates in geographic space) and the covariate is included via the
by
parameter. This is to explore the interaction of the
covariate with the target variable over the space defined by
X
and Y
locations, allowing spatially varying
coefficients to be generated. The model has 4 key terms specified in the
same way by
<VAR> + (X, Y, b s= 'gp', by = <VAR>)
:
<VAR>
is the fixed parametric term for the
covariates(...)
defines the smoothbs = 'gp'
states that this is a GP smoothby = <VAR>
suggests the GP should be multiplied
by variableThe model output can be assessed: the k'
and
edf
parameters are quite close, but the p-values are high
and and the k-index
is greater than 1 so this looks OK. The
diagnostic plots are again generated by the gam.check
function as below:
#>
#> Method: GCV Optimizer: magic
#> Smoothing parameter selection converged after 6 iterations by steepest
#> descent step failure.
#> The RMS GCV score gradient at convergence was 2452.878 .
#> The Hessian was not positive definite.
#> Model rank = 8 / 101
#>
#> Basis dimension (k) checking results. Low p-value (k-index<1) may
#> indicate that k is too low, especially if edf is close to k'.
#>
#> k' edf k-index p-value
#> s(X,Y):Intercept 32.00 2.00 0.82 0.070 .
#> s(X,Y):unemp 33.00 2.00 0.82 0.030 *
#> s(X,Y):pubC 33.00 3.67 0.82 0.045 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# model summary
summary(svc.gam)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> privC ~ 0 + Intercept + s(X, Y, bs = "gp", by = Intercept) +
#> unemp + s(X, Y, bs = "gp", by = unemp) + pubC + s(X, Y, bs = "gp",
#> by = pubC)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> Intercept 0.0014245 0.0003432 4.151 0.000168 ***
#> unemp 0.0064540 0.0015548 4.151 0.000168 ***
#> pubC -3.9243042 0.9571325 -4.100 0.000196 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(X,Y):Intercept 2.000 2.000 0.453 0.639
#> s(X,Y):unemp 2.000 2.000 0.405 0.676
#> s(X,Y):pubC 3.672 3.715 188.593 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Rank: 8/101
#> R-sq.(adj) = 0.918 Deviance explained = 96.5%
#> GCV = 1.9338e+08 Scale est. = 1.613e+08 n = 48
Here it can be seen that:
edf
are well below k
in the
gam.check()
printed output (the k-index<1
issue is not important because of this).pccap
is locally significant
and spatially varying.The spatially varying coefficient estimates can be extracted using
predict
. To do this a dummy data set is created with the
pubC
term set to 1, and the intersect and
unemp
terms set to zero. The result is that the predicted
values for the coefficient estimate are just a function of β2, the pubC
coefficient estimate at each location.
get_b2<- productivity |> filter(year == "1970") |> mutate(Intercept = 0, unemp = 0, pubC = 1)
res <- productivity |> filter(year == "1970") |>
mutate(b2 = predict(svc.gam, newdata = get_b2))
The resulting data.frame
called res
has a
new variable called b2
which is the spatially varying
coefficient estimate for pubC
. For comparison, we can
generate the spatially varying coefficient estimate for the intercept
(β0) and
unemp
β1 (which were not found
to be significant locally) in the same way by setting the other terms in
the model to zero:
get_b0 <- productivity |> filter(year == "1970") |> mutate(Intercept = 1, unemp = 0, pubC = 0)
res <- res |> mutate(b0 = predict(svc.gam, newdata = get_b0))
get_b1 <- productivity |> filter(year == "1970") |> mutate(Intercept = 0, unemp = 1, pubC = 0)
res <- res |> mutate(b1 = predict(svc.gam, newdata = get_b1))
So res
has the records for the year 1970 and three new
columns for b0
, b1
and b2
The
distribution of the spatially varying coefficient estimates can be
examined:
res |> select(b0, b1, b2) |>
apply(2, summary) |> round(2)
#> b0 b1 b2
#> Min. -19058.71 -2848.15 -0.18
#> 1st Qu. -7882.46 -1199.14 1.29
#> Median 731.22 -156.64 2.11
#> Mean 0.00 0.30 2.08
#> 3rd Qu. 7891.22 1179.57 2.85
#> Max. 20699.80 3362.89 3.80
The stgam
package has a function to extract the varying
coefficients, calculate_vcs
. This takes three arguments,
the GAM varying coefficient model, the model terms, and the data used to
create the model:
terms = c("Intercept", "unemp", "pubC")
res <- calculate_vcs(model = svc.gam,
terms = terms,
input_data = productivity |> filter(year == "1970"))
summary(res[, paste0("b_",terms)])
#> b_Intercept b_unemp b_pubC
#> Min. :-19058.713 Min. :-2848.148 Min. :-0.1822
#> 1st Qu.: -7882.457 1st Qu.:-1199.142 1st Qu.: 1.2941
#> Median : 731.222 Median : -156.638 Median : 2.1100
#> Mean : 0.001 Mean : 0.302 Mean : 2.0825
#> 3rd Qu.: 7891.224 3rd Qu.: 1179.568 3rd Qu.: 2.8474
#> Max. : 20699.803 Max. : 3362.893 Max. : 3.8034
Standard dplyr
and ggplot
approaches can be
used to join and map the coefficient estimates, formally β0, β1 and β2) as in the figure
below. Notice the North-South trend for the Intercept and the East-West
and trend for Unemployment - both insignificant predictors of
privC
- and a much stronger specific spatial pattern
between the target variable and Public capital (pubC
), with
particularity high coefficient estimates in the south.
# join the data
map_results <-
us_data |> left_join(res |> select(GEOID, b_Intercept, b_unemp, b_pubC), by = "GEOID")
# plot the insignificant coefficient estimates
tit =expression(paste(""*beta[0]*""))
p1 <-
ggplot(data = map_results, aes(fill=b_Intercept)) +
geom_sf() +
scale_fill_continuous_c4a_div(palette="brewer.spectral",name=tit) +
coord_sf() +
ggtitle("Intercept: not significant")
tit =expression(paste(""*beta[1]*""))
p2 <-
ggplot(data = map_results, aes(fill=b_unemp)) +
geom_sf() +
scale_fill_continuous_c4a_div(palette="brewer.spectral",name=tit) +
coord_sf() +
ggtitle("Unemployment: not significant")
# plot the significant pubC coefficient estimates
tit =expression(paste(""*beta[2]*" "))
p3 <-
ggplot(data = map_results, aes(fill=b_pubC)) +
geom_sf() +
scale_fill_continuous_c4a_div(palette="brewer.prgn",name=tit) +
coord_sf() +
ggtitle("Public captial: significant")
plot_grid(p1, p2, p3, ncol = 1)
The productivity
data was filtered for 1970 the the SVC
above in which the X-Y
location of the 48 states was used
to parametrise the Gaussian Process smooths. The same structure can be
used to create a temporally vary coefficient model (TVC), with smooths
specified to include the year
parameter, but this time not
restricting the analysis data to records from a single year:
# create the TVC
tvc.gam = gam(privC ~ 0 +
Intercept + s(year, bs = 'gp', by = Intercept) +
unemp + s(year, bs = "gp", by = unemp) +
pubC + s(year, bs = "gp", by = pubC),
data = productivity)
The model can be inspected in the same way to examine the the
k'
and edf
parameters using the
gam.check
function and again there are no concerns:
The model summary indicates that all of the parametric and temporally vary coefficients are significant at the 95% level except the parametric one for Unemployment, but there are some interesting patterns in the residuals, as reflected in the diagnostics plots above and the model fit (R2) value.
The temporally varying coefficients can be extracted in the same way
as the DVC approach, explored using the predict
function
and setting each of the covariates to 1 and the others to zero in
turn:
terms = c("Intercept", "unemp", "pubC")
res <- calculate_vcs(model = tvc.gam,
terms = terms,
input_data = productivity)
summary(res[, paste0("b_",terms)])
#> b_Intercept b_unemp b_pubC
#> Min. :-1057 Min. :-3733.9 Min. :1.462
#> 1st Qu.: 7649 1st Qu.:-2342.3 1st Qu.:1.653
#> Median :16356 Median : -950.6 Median :1.844
#> Mean :16356 Mean : -950.6 Mean :1.844
#> 3rd Qu.:25063 3rd Qu.: 441.0 3rd Qu.:2.035
#> Max. :33769 Max. : 1832.6 Max. :2.226
The variation in coefficient estimates can be inspected over time, remembering that each State has the same coefficient estimate value for each year, so just one state is selected in the code below (you could chose another and the result would be the same). Notice the linear declines due to the linearity of the time covariate in the figure below.
res |>
filter(state == "ARIZONA") |>
select(year, b_Intercept, b_unemp, b_pubC) |>
pivot_longer(-year) |>
ggplot(aes(x = year, y = value)) +
geom_point() + geom_line() +
facet_wrap(~name, scale = "free") +
theme_bw() + xlab("Year") + ylab("")
We can combine space and time in GAM GP splines. But how? We could
use separate smooths for location and for time, or a single, 3D smooth
parameterised with both location and time. There are assumptions
associated with each of these. The code below specifies the interaction
of the covariates within a single space-time GP smooth. You will notice
this takes a few seconds longer to run, and choice of how to specify the
smooths is explored in the second vignette in the stgam
package.
stvc.gam = gam(privC ~ 0 +
Intercept + s(X, Y, year, bs = 'gp', by = Intercept) +
unemp + s(X, Y, year, bs = "gp", by = unemp) +
pubC + s(X, Y, year, bs = "gp", by = pubC),
data = productivity)
The model can be inspected in the usual way using the
gam.check
function (again, no concerns) and the
summary
function with k and edf despite the
concerning k-index
and p-value
values. In this
case all of the parametric and smooth coefficient estimates are
significant at the 95% level, and the model fit (R2) has again increased
over the SVC model.
The coefficients spatial and temporally vary coefficients can be extracted in the same way as before and the variation in coefficient estimates from the STVC-GAM model summarised:
terms = c("Intercept", "unemp", "pubC")
res <- calculate_vcs(model = stvc.gam,
terms = terms,
input_data = productivity)
summary(res[, paste0("b_",terms)])
#> b_Intercept b_unemp b_pubC
#> Min. :-9767.27 Min. :-1581.313 Min. :0.009162
#> 1st Qu.:-3966.23 1st Qu.: -642.095 1st Qu.:1.619839
#> Median : -77.85 Median : -312.564 Median :2.298574
#> Mean : 0.00 Mean : -2.411 Mean :2.308127
#> 3rd Qu.: 3378.97 3rd Qu.: 629.192 3rd Qu.:3.027175
#> Max. :10806.31 Max. : 2195.418 Max. :4.396288
This indicates that a positive relationship between Private capital with with Public capital and mixed positive and negative one with Unemployment and the Intercept.
It is instructive to unpick some of the model coefficients in more detail and the code below summarises variations over time through the median values of each coefficient estimate:
res |>
select(year, b_Intercept, b_unemp, b_pubC) |>
group_by(year) |>
summarise(med_b0 = median(b_Intercept),
med_b1 = median(b_unemp),
med_b2 = median(b_pubC))
#> # A tibble: 17 × 4
#> year med_b0 med_b1 med_b2
#> <int> <dbl> <dbl> <dbl>
#> 1 1970 -77.8 -313. 1.95
#> 2 1971 -77.8 -313. 1.99
#> 3 1972 -77.8 -313. 2.03
#> 4 1973 -77.8 -313. 2.07
#> 5 1974 -77.8 -313. 2.11
#> 6 1975 -77.8 -313. 2.15
#> 7 1976 -77.8 -313. 2.19
#> 8 1977 -77.8 -313. 2.23
#> 9 1978 -77.8 -313. 2.27
#> 10 1979 -77.8 -313. 2.31
#> 11 1980 -77.8 -313. 2.35
#> 12 1981 -77.8 -313. 2.39
#> 13 1982 -77.8 -313. 2.43
#> 14 1983 -77.8 -313. 2.47
#> 15 1984 -77.8 -313. 2.51
#> 16 1985 -77.8 -313. 2.55
#> 17 1986 -77.8 -313. 2.59
It is evident that of the 2 covariates and the intercept used to
model Private capital, only Public capital (b2
) varies
(increases) over time. This increase is shown visually below .
# inputs to plot
res |> select(starts_with("b"), year) |>
mutate(year = "All Years") -> tmp
cols = c(c4a("tableau.red_gold", n = 17, reverse = T), "grey")
tit =expression(paste(""*beta[`Private Capital`]*""))
# plot
res |> select(starts_with("b"), year) |>
rbind(tmp) |>
mutate(year = factor(year)) |>
ggplot(aes(y = year, x = b_pubC, fill = year)) +
geom_boxplot(outlier.alpha = 0.1) +
scale_fill_manual(values=cols, guide = "none") +
theme_bw() + xlab(tit) + ylab("Time")
The spatial pattern of this temporal trend can also be explored as below. This shows that the increasing intensity of the effect of Public capital on Private capital does not vary spatially: the increase in effect is spatially even, with high values in the south.
tit =expression(paste(""*beta[`Public Capital`]*""))
# join the data
map_results <-
us_data |> left_join(res |> select(GEOID, year, b_Intercept, b_unemp, b_pubC), by = "GEOID")
# create the plot
map_results |>
ggplot() + geom_sf(aes(fill = b_pubC), col = NA) +
scale_fill_binned_c4a_seq(palette="scico.lajolla", name = tit) +
facet_wrap(~year) +
theme_bw() + xlab("") + ylab("") +
theme(
strip.background =element_rect(fill="white"),
strip.text = element_text(size = 8, margin = margin()),
legend.position = "inside", legend.position.inside = c(0.7, 0.1),
legend.direction = "horizontal",
legend.key.width = unit(1, "cm"),
axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
The rationale for using GAMs with GP splines for spatially varying coefficient (SVC) or temporally varying coefficient (TVC) models is as follows:
Initial research has demonstrated the formulation and application of a GAM with GP splines calibrated via observation location as a multiscale SVC model: the Geographical Gaussian Process GAM (GGP-GAM) (Comber, Harris, and Brunsdon 2024). The GGP-GAM was compared with the most popular SVC model, Geographically Weighted Regression (GWR) (Brunsdon, Fotheringham, and Charlton 1996) and shown to out-perform Multiscale GWR.
However, when handling space and time, simply plugging all the space time data into specific GAM configuration is to make potentially unreasonable assumptions about how space and time interact in spatially and temporally varying coefficient (STVC) models. To address this workshop has suggested that the full set of models is investigated to identify the best model or the best set of models. Where there is a clear winner, this can be applied. Where these is not, as in the example used then the model coefficients can be combined using Bayesian Model Averaging (in the second vignette).