\documentclass[10pt]{article} %\VignetteIndexEntry{Distributed-lag linear structural equation models in R: the dlsem package} %\VignettePackage{dlsem} %\VignetteKeyword{distributed-lag linear recursive models} %\VignetteKeyword{constrained lag shapes} %\VignetteKeyword{dynamic causal effects} \usepackage{boxedminipage,color,a4wide,url} \usepackage[english]{babel} \usepackage[authoryear]{natbib} \usepackage{url} \usepackage{subfigure} \usepackage{multicol} \usepackage{multirow} \usepackage{rotating} \usepackage{dsfont,mathrsfs} \usepackage{listings,mdframed} \usepackage{caption} \usepackage[T1]{fontenc} \usepackage{epsfig,graphicx,amsbsy,amsfonts,amssymb,pifont,amsmath,mathcomp} \usepackage[latin1]{inputenc} \usepackage{hyperref} \usepackage{Sweave} \usepackage[a4paper,top=2.5cm,bottom=2.5cm,left=2.5cm,right=2.5cm]{geometry} \setkeys{Gin}{width=0.4\textwidth} \def\pkg#1{\texttt{#1}} \let\proglang=\textsf \let\code=\texttt \title{Distributed-lag linear structural equation models in \proglang{R}:\\the \pkg{dlsem} package} \author{Alessandro Magrini\\Dep. Statistics, Computer Science, Applications\\University of Florence, Italy\\} \date{\pkg{dlsem} version 2.4.6 -- 23 March 2020} \begin{document} \SweaveOpts{concordance=FALSE} %%\SweaveInput{Rmarkup.STY} %% ------------------------ %\definecolor{darkred}{rgb}{.7,0,0} %\definecolor{midnightblue}{rgb}{0.098,0.098,0.439} % %\DefineVerbatimEnvironment{Sinput}{Verbatim}{ % fontfamily=tt, % %%fontseries=b, % %% xleftmargin=2em, % formatcom={\color{midnightblue}} %} %\DefineVerbatimEnvironment{Soutput}{Verbatim}{ % fontfamily=tt, % %%fontseries=b, % %% xleftmargin=2em, % formatcom={\color{darkred}} %} %\DefineVerbatimEnvironment{Scode}{Verbatim}{ % fontfamily=tt, % %%fontseries=b, % %% xleftmargin=2em, % formatcom={\color{blue}} %} % %\fvset{listparameters={\setlength{\topsep}{-2pt}}} %\renewenvironment{Schunk}{\linespread{.90}}{} %% ------------------------ \maketitle \tableofcontents \parindent0pt\parskip5pt \section{Introduction} Structural causal models (SCMs, \citealp{pearl00}) are a mathematical framework describing the behaviour of a multivariate system, and represent one of the prevalent methodologies for causal inference in contemporary applied sciences. Markovian SCMs are a special case where the joint probability distribution of the considered variables can be factored according to a directed acyclic graph. % Distributed-lag linear structural equation models (DLSEMs) are Markovian SCMs, where each factor of the joint probability distribution is a distributed-lag linear regression with constrained lag shapes. They were firstly introduced in the context of lag exposure assessment \citep{magrini18}, then applied to impact assessment of research expenditure in Agriculture \citep{magrini19}. % DLSEMs account for temporal delays in the dependence relationships among the variables through a single parameter per covariate, thus allowing to perform dynamic causal inference in a feasible fashion. %As such, they are suitable to investigate the effect of an external impulse on a multivariate system through time. %Econometrics and epidemiology are two of the main fields of application for DLSEMs. Package \pkg{dlsem} implements inference functionalities for DLSEMs with several types of constrained lag shapes \citep{magrini20}. Currently, endpoint-constrained quadratic ('ecq'), quadratic decreasing ('qd'), linearly decreasing ('ld') and gamma ('gam') lag shapes are available. This vignette is structured as follows. In Section \ref{sec:theory}, theory on the DLSEM is presented. In Section \ref{sec:install}, instructions for the installation of the \pkg{dlsem} package are provided. In Section \ref{sec:example}, the practical use of \pkg{dlsem} is illustrated through a simple impact assessment problem. %Section \ref{sec:remarks} includes final remarks and considerations on future development of the package. \section{Theory} \label{sec:theory} \paragraph{Distributed-lag linear regression} % Let $Y$ be a response variable and $X_1,\ldots,X_p$ be the covariates, with $y_t$ and $x_{jt}$, respectively, the value of $Y$ and of $X_j$ at time $t$. Under the hypothesis that time is discrete and that both $Y$ and $X_1,\ldots,X_p$ are stationary time series, lagged instances of one or more covariates may be included in the linear regression model to account for temporal delays in their influence on the response: \begin{equation}\label{eq:uncons_dllm} y_t = \beta_0+\sum_{j=1}^J\sum_{l=0}^{L_j}\beta_{j,l}~x_{j,t-l}+\epsilon_t %\hspace{1cm} \epsilon_t \sim \text{N}(0,\sigma^2) \end{equation} where $x_{j,t-l}$ is the value of the $j$-th covariate at $l$ time lags before $t$, and $\epsilon_t$ is the random error at time $t$ uncorrelated with the covariates and with $\epsilon_k$, $\forall k\not=t$. The set $(\beta_{j,0},\beta_{j,1},\ldots,\beta_{j,L_j})$ is denoted as the \textit{lag shape} of the $j$-th covariate and represents its regression coefficient (in the remainder, simply `coefficient') at different time lags. Least squares can be used to consistently estimate the lag shapes \footnote{ % Note that stationarity of the time series is required to obtain consistent least squares estimation \citep{granger74}. If some of them contains a stochastic trend (unit root), a reasonable procedure is to sequentially apply differencing to all variables until they are made stationary according to a unit root test. Sequential testing is required for interpretation, as it allows to have the same order of differencing for all variables. }, but, since time series data are likely to be serially correlated and heteroskedastic, a good practice is to apply the Heteroskedasticity and Autocorrelation Consistent (HAC) correction for the covariance matrix of least squares estimators \citep{newey78}. The model in Formula \ref{eq:uncons_dllm} has the disadvantage that a parameter is required for each lagged instance of a covariate, and lagged instances of the same covariate tend to be highly correlated. The Almon's polynomial lag shape \citep{almon65} overcomes these limitations by forcing the coefficients for lagged instances of the same covariate to follow a polynomial of order $Q$: \begin{equation}\label{eq:almon} \beta_{j,l} = \begin{cases} \phi_{j,0} & l=0\\ \sum_{q=0}^Q \phi_{j,q} l^q & \text{otherwise} \end{cases} \end{equation} For instance, for $q=2$ we have that $\beta_{j,l} =\phi_{j,0}+\phi_{j,1} l+\phi_{j,2} l^2$. Unfortunately, the Almon's polynomial lag shape may show multiple modes and coefficients with different signs, thus entailing problems of interpretation. Type II constrained lag shapes \citep{magrini20} overcome this issue. They include %Package \pkg{dlsem} includes the \textit{endpoint-constrained quadratic} lag shape: \begin{equation}\label{eq:quec} \beta_{j,l} = \begin{cases} \theta_j \left[-\frac{4}{(b_j-a_j+2)^2} l^2+\frac{4(a_j+b_j)}{(b_j-a_j+2)^2} l-\frac{4(a_j-1)(b_j+1)}{(b_j-a_j+2)^2} \right] & a_j \leq l \leq b_j\\ 0 & \text{otherwise} \end{cases} \end{equation} the \textit{quadratic decreasing} lag shape: \begin{equation}\label{eq:qdec} \beta_{j,l} = \begin{cases} \theta_j \frac{l^2-2(b_j+1)l+(b_j+1)^2}{(b_j-a_j+1)^2} & a_j \leq l \leq b_j\\ 0 & \text{otherwise} \end{cases} \end{equation} the \textit{linearly decreasing} lag shape: \begin{equation}\label{eq:ldec} \beta_{j,l} = \begin{cases} \theta_j \frac{b_j+1-l}{b_j+1-a_j} & a_j \leq l \leq b_j\\ 0 & \text{otherwise} \end{cases} \end{equation} and the \textit{gamma} lag shape: \begin{equation}\label{eq:gamma} \begin{gathered} \beta_{j,l} = \theta_j (l+1)^\frac{a_j}{1-a_j}b_j^l \left[\left(\frac{a_j}{(a_j-1)\hspace{0.04cm}\text{log}b_j}\right)^\frac{a_j}{1-a_j}b_j^{\frac{a_j}{(a_j-1)\hspace{0.02cm}\text{log}b_j}-1}\right]^{-1}\\ 0b_j$, and symmetric with mode equal to $\theta_j$ at lag $(a_j+b_j)/2$. The quadratic decreasing lag shape decreases from value $\theta_j$ at lag $a_j$ to value $0$ at lag $b_j+1$ according to a quadratic function. The linearly decreasing lag shape is a linear version of the quadratic one. The gamma lag shape is positively skewed with mode equal to $\theta_j$ at lag $\frac{a_j}{(a_j-1)\hspace{0.04cm}\text{log}b_j}$. For the endpoint-constrained quadratic, quadratic decreasing and linearly decreasing lag shapes, $a_j$ represents the \textit{gestation lag}, $b_j$ the \textit{lead lag}, and $b_j-a_j$ the \textit{lag width} (a static lag shape is obtained if $a_j=b_j=0$). Gestation lag, lead lag and lag width are not explicit in a gamma lag shape, but they can be approximated numerically from parameters $a_j$ and $b_j$. % For these constrained lag shapes, it holds: \begin{equation}\label{eq:sign} \begin{gathered} \beta_{j,l}>0 \Longleftrightarrow \theta_j>0\\ \beta_{j,l}<0 \Longleftrightarrow \theta_j<0 \end{gathered} \hspace{1.2cm} \forall l:~a_j \leq l \leq b_j \end{equation} and we refer to the \textit{lag sign} as the sign of parameter $\theta_j$. A linear regression with these constrained lag shapes is linear in parameters $\beta_0,\theta_1,\ldots,$ $\theta_J$, provided that the values of $a_1,\ldots,a_J,b_1,\ldots,b_J$ are known. Thus, one may fit %use ordinary least squares to estimate parameters $\beta_0,\theta_1,\ldots,\theta_J$ for several regressions with different values of $a_1,\ldots,a_J,b_1,\ldots,b_J$, and select the one with the minimum residual sum of squares (see \citealp{magrini20} for details). \paragraph{Structural causal models} % Structural causal models (SCMs) were developed by \cite{pearl00} in the context of causal inference. They are rooted to path analysis \citep{wright34} and simultaneous equation models \citep{haavelmo43,koopmans50}. A SCM consists of a tuple $\{\boldsymbol{V},\boldsymbol{U},\Omega_{\boldsymbol{V}},\Omega_{\boldsymbol{U}}, \boldsymbol{f},\mathbb{P}_{\boldsymbol{U}}\}$, where: \begin{itemize} \item $\boldsymbol{V}=\{V_1,\ldots,V_J\}$ is a set of endogenous variables; \item $\Omega_{\boldsymbol{V}}=\Omega_{V_1}\times\ldots\times\Omega_{V_J}$ is the cartesian product of the domains of variables in $\boldsymbol{V}$; \item $\boldsymbol{U}=\{U_1,\ldots,U_K\}$ is a set of unobserved variables; \item $\Omega_{\boldsymbol{U}}=\Omega_{U_1}\times\ldots\times\Omega_{U_K}$ is the cartesian product of the domains of variables in $\boldsymbol{U}$; \item $\boldsymbol{f}:~\Omega_{\boldsymbol{V}}\times\Omega_{\boldsymbol{U}}\longrightarrow\Omega_{\boldsymbol{V}}$ is a measurable function; \item $\mathbb{P}_{\boldsymbol{U}}$ is a probability measure on $\Omega_{\boldsymbol{U}}$. \end{itemize} Markovian SCMs \citep[Chapter 3]{pearl00} are a special case where $\boldsymbol{f}$ is acyclic and variables in $\boldsymbol{U}$ are each other independent. In a Markovian SCM, the following factorization of the joint probability distribution of variables in $\boldsymbol{V}$ holds: \begin{equation}\label{eq:markov} p(v_1,\dots,v_J)=\prod_{j=1}^J p(v_j \mid \Pi_j=\pi_j) \end{equation} where $\Pi_j$ is the set of variables in $\boldsymbol{V}$ such that, for $j>1$, $V_j$ is independent of variables in $\{V_1,\ldots,V_{j-1}\}\setminus\Pi_j$, given variables in $\Pi_j$. This means that the joint probability distribution of variables in $\boldsymbol{V}$ can be factored according to conditional independence relationships holding among them disregarding variables in $\boldsymbol{U}$. \citet[pages 12 and following]{pearl00} shows that these conditional independence relationships are encoded into a directed acyclic graph (DAG) such that $\Pi_j$ is the parent set of $V_j$, $\forall~j=1,\ldots,J$. For example, in the Markovian SCM associated to the DAG in Figure \ref{fig:dagsam}, it holds: \begin{equation}\label{eq:dag} p(v_1,v_2,v_3,v_4)=p(v_1)~p(v_2 \mid v_1)~p(v_3 \mid v_1)~p(v_4 \mid v_2,v_3) \end{equation} and, for example, $V_4$ is independent of $V_1$ given $V_2$ and $V_3$. \begin{figure}[!h] \centering \includegraphics[width=0.3\textwidth]{figures/sampledag.png} \caption{An example of directed acyclic graph.} \label{fig:dagsam} \end{figure} Let $\text{do}(V_i=v_i)$ denote an intervention setting the value of $V_i$ to $v_i$. Then, in a Markovian SCM it holds: \begin{equation}\label{eq:trunc} p(v_1,\dots,v_J \mid \text{do}(V_i=v_i))=\prod_{j\not=i} p(v_j \mid \pi_j)\left.\right|_{V_i=v_i} \end{equation} where $\left.\right|_{V_i=v_i}$ indicates that $p(v_i \mid \pi_i)$ is replaced by value $v_i$. This formula, called \textit{truncated factorization} \citep[Section 3.2]{pearl00}, allows to compute the effect of an intervention from the (pre-intervention) distribution in Formula \ref{eq:markov}, that is to predict such effect from non-experimental (observational) data. In a Markovian SCM, the effect of $\text{do}(V_i=v_i)$ on $V_j$, called \textit{causal effect} of $V_i$ on $V_j$, is given by the following expression (see \citealp[page 70 and following]{pearl00}): \begin{equation}\label{eq:markov_ce} p(V_j=v_j \mid \text{do}(V_i=v_i))=\sum_{\pi_i} p(V_j=v_j \mid V_i=v_i,\Pi_i=\pi_i)p(\Pi_i=\pi_i) \end{equation} where $\Pi_i$ is the parent set of $V_i$. In a linear parametric formulation of SCMs (linear Markovian SCMs), each factor $p(v_j \mid \pi_j)$ of the joint probability distribution in Formula \ref{eq:markov} is the linear regression where $V_j$ is the response variable and variables in $\Pi_j$ are the covariates. For example, in the linear Markovian SCM associated to the DAG in Figure \ref{fig:dagsam}, $p(v_4 \mid v_2,v_3)$ is the linear regression where $V_4$ is the response variable and $V_2$ and $V_3$ are the covariates. As shown by \citealp{magrini18}, the computation of causal effects in a linear Markovian SCM involves the coefficients of the regressions only, without the need of Formula \ref{eq:markov_ce}. Let $\text{do}(a V_i=1)$ be an intervention changing the value of $V_i$ by a unit. Under such intervention: \begin{itemize} \item if $V_i$ is parent of $V_j$, the \textit{direct} causal effect of $V_i$ on $V_j$ is equal to the coefficient of $V_i$ in the regression of $V_j$; \item the causal effect of $V_i$ on $V_j$ through a multi-edge directed path $$ connecting $V_i$ to $V_j$, called \textit{indirect} causal effect of $V_i$ on $V_j$ through $$, is equal to (see, for example, \citealp{wright34}): \begin{equation} e()=\prod_{k:~V_k \in \wedge k\not=i} \beta_{k|k-1} \end{equation} where $\beta_{k|k-1}$ is the coefficient of $V_{k-1}$ in the regression of $V_k$. The direct causal effect of $V_i$ on $V_j$ and all the indirect causal effects of $V_i$ on $V_j$ are denoted as \textit{pathwise} causal effects of $V_i$ on $V_j$; \item the \textit{overall} causal effect of $V_i$ on $V_j$ is equal to the sum of all the pathwise causal effects of $V_i$ on $V_j$. \end{itemize} For example, in the linear Markovian SCM associated to the DAG in Figure \ref{fig:dagsam}, there are two directed paths connecting $V_1$ to $V_4$: $$ with pathwise causal effect $\beta_{2|1}\cdot \beta_{4|2}$, and $$ with pathwise causal effect $\beta_{3|1}\cdot \beta_{4|3}$. Thus, the overall causal effect of $V_1$ on $V_4$ is equal to $\beta_{2|1}\cdot \beta_{4|2}+\beta_{3|1}\cdot \beta_{4|3}$. \paragraph{Distributed-lag linear structural equation models} % Distributed-lag linear structural equation models (DLSEMs) are Markovian SCMs where each factor of the joint probability distribution in Formula \ref{eq:markov} is a distributed-lag linear regression with constrained lag shapes. They were firstly introduced in the context of lag exposure assessment \citep{magrini18}, then applied to impact assessment of research expenditure in Agriculture \citep{magrini19}. % The DAG of a DLSEM would involve all the possible temporal instances of each variable in $\boldsymbol{V}$. Here, for simplicity, a static DAG is still used for a DLSEM, where the edge $$ exists if and only if there exists at least one time lag where the coefficient of variable $V_i$ in the regression of variable $V_j$ is non-zero. Causal effects at different time lags in a DLSEM are defined as follows: \begin{itemize} \item if $V_i$ is parent of $V_j$, the \textit{direct} causal effect of $V_i$ on $V_j$ at lag $l$ is equal to the coefficient of $V_i$ at lag $l$ in the regression of $V_j$; \item Let $$, $d_0=i$ and $d_m=j$, be a directed path composed of $m$ edges connecting $V_i$ to $V_j$, and $Q_m^{(l)}$ be the set of all the possible ordered $m$-uples of time lags such that their sum is equal to $l$. The \textit{indirect} causal effect of $V_i$ on $V_j$ through such path at lag $l$ is equal to: \begin{equation}\label{eq:dlsem_indirect} e(;d_0=i,d_m=j)=\sum_{(q_1,\ldots,q_m)\in Q_m^{(l)}}\hspace{0.3cm}\prod_{k=1}^m b_{d_k|d_{k-1},q_k} \end{equation} where $b_{d_k|d_{k-1},q_k}$ is the coefficient of $V_{d_{k-1}}$ at lag $q_k$ in the regression of $V_{d_k}$; \item the \textit{overall} causal effect of $V_i$ on $V_j$ at lag $l$ is equal to the sum of all the pathwise causal effects of $V_i$ on $V_j$ at lag $l$. \end{itemize} A \textit{pathwise causal lag shape} is the set of causal effects associated to a path at different time lags. An \textit{overall causal lag shape} is the set of the overall causal effects of a variable on another one at different time lags. \section{Installation} \label{sec:install} Before installing \pkg{dlsem}, you must have installed \proglang{R} version 3.5.0 or higher, which is freely available at \url{http://www.r-project.org/}. To install the \pkg{dlsem} package, type the following in the \proglang{R} command prompt: @ <>= install.packages("dlsem") @ and \proglang{R} will automatically install the package to your system from CRAN. In order to keep your copy of \pkg{dlsem} up to date, use the command: @ <>= update.packages("dlsem") @ The latest version of \pkg{dlsem} is 2.4.6. \section{Illustrative example} \label{sec:example} The practical use of package \pkg{dlsem} is illustrated through a simple impact assessment problem denoted as ``industrial development problem''. The objective is to test whether the influence through time of the number job positions in industry (proxy of the industrial development) on the amount of greenhouse gas emissions (proxy of pollution) is direct and/or mediated by the amount of private consumption. The DAG for the industrial development problem is shown in Figure \ref{fig:dag0}. The analysis will be conducted on the dataset \code{industry}, containing simulated data for $10$ imaginary regions in the period 1983-2015. \begin{figure}[!h] \centering \includegraphics[width=0.4\textwidth]{figures/dag0.pdf} \caption{The DAG for the industrial development problem. `Job': number of job positions in industry. `Consum': private consumption index. `Pollution': amount of greenhouse gas emissions.} \label{fig:dag0} \end{figure} @ <>= require(dlsem) @ \small @ <>= data(industry) summary(industry) @ \normalsize \subsection{Specification of the model code} \label{sub:example1} The first step to build a DLSEM with the \pkg{dlsem} package is the definition of the model code, which includes the formal specification of the regressions. The variables for which a regression is specified are called \textit{endogenous} variables. The other variables are referred as \textit{exogenous} variables (not to be confused with the unobserved disturbances). The model code must be a list of formulas, one for each regression. In each formula, the response and the covariates must be quantitative variables\footnote{ % Qualitative variables may be included only as exogenous variables, as described in Subsection \ref{sub:example3}. % }, and operators \code{ecq}$(\cdot)$, \code{qd}$(\cdot)$, \code{ld}$(\cdot)$ and \code{gam}$(\cdot)$\footnote{ % The operators \code{ecq}$(\cdot)$, \code{qd}$(\cdot)$ and \code{gam}$(\cdot)$ replace the old operators \code{quec.lag}$(\cdot)$, \code{qdec.lag}$(\cdot)$ and \code{gamma.lag}$(\cdot)$. If an old operator is employed, it is automatically replaced by the new one and a warning is returned. % } may be employed to specify, respectively, an endpoint-constrained quadratic, a quadratic decreasing, a linearly decreasing or a gamma lag shape. Operators \code{ecq}$(\cdot)$, \code{qd}$(\cdot)$, \code{ld}$(\cdot)$ and \code{gam}$(\cdot)$ have three arguments: the name of the covariate to which the lag shape is applied, and the two shape parameters $a_j$ and $b_j$ (see \citealp{magrini20} for details). If none of these two operators is applied to a covariate, it is assumed that its coefficient is equal to $0$ for time lags greater than $0$ (no lag shape). The group factor and exogenous variables must not appear in the model code (see Subsection \ref{sub:example3} for the way to include them). The specification of regressions with no endogenous covariates may be omitted from the model code (for example, one could avoid to specify the regression for the number of job positions). In this problem, all lag shapes are assumed to be endpoint-constrained quadratic lag shapes between 0 and 15 time lags: @ <>= indus.code <- list( Job ~ 1, Consum~ecq(Job,0,15), Pollution~ecq(Job,0,15)+ecq(Consum,0,15) ) @ \subsection{Specification of control options} \label{sub:example2} The second step to build a DLSEM with the \pkg{dlsem} package is the specification of control options. Control options are distinguished into global (applied to all the regressions) and local (regression-specific) options. Global control options must be a named list with one or more of the following components: \begin{itemize} \item \code{adapt}: a logical value indicating if adaptation of lag shapes must be performed, that is parameters of lag shapes must be chosen on the basis of fit to data. Default is \code{FALSE}, meaning no adaptation; \item \code{min.gestation}: the minimum gestation lag for all lag shapes. If not provided, it is taken as equal to 0; \item \code{max.gestation}: the maximum gestation lag for all lag shapes. If not provided, it is taken as equal to \code{max.lead} (see below); \item \code{max.lead}: the maximum lead lag for all lag shapes. If not provided, it is computed accordingly to the sample size; \item \code{min.width}: the minimum lag width for all lag shapes. It cannot be greater than \code{max.lead}. If not provided, it is taken as 0; \item \code{sign}: the lag sign for all lag shapes, that may be either \code{'+'} for positive or \code{'-'} for negative. If not provided, adaptation will disregard the lag sign. \end{itemize} Local control options must be a named list containing one or more among the following components: \begin{itemize} \item \code{adapt}: a named vector of logical values, where each component must have the name of one endogenous variable and indicate if adaptation of lag shapes must be performed for the regression of that variable; \item \code{min.gestation}: a named list. Each component of the list must have the name of one endogenous variable and be a named vector. Each component of the named vector must have the name of one covariate in the regression of the endogenous variable above and include the minimum gestation lag for its lag shape; \item \code{max.gestation}: the same as \code{min.gestation}, with the exception that the named vector must include the maximum gestation lag; \item \code{max.lead}: the same as \code{min.gestation}, with the exception that the named vector must include the maximum lead lag; \item \code{min.width}: the same as \code{min.gestation}, with the exception that the named vector must include the minimum lag width; \item \code{sign}: the same as \code{min.gestation}, with the exception that the named vector must include the lag sign (either \code{'+'} for positive or \code{'-'} for negative). \end{itemize} Local control options have no default values, and global ones are applied in their absence. If some local control options conflict with global ones, only the former are applied. Suppose that one wants to perform adaptation with the following constraints for all lag shapes: (i) maximum gestation lag of 3 years, (ii) maximum lead lag of 15 years, (iii) minimum lag width of 5 years, (iv) positive lag sign. Control options for these constraints may be expressed in several ways. The most simple solution is to specify only global control options, as the constraints hold for all the regressions: @ <>= indus.global <- list(adapt=T,max.gestation=3,max.lead=15,min.width=5,sign="+") indus.local <- list() @ In alternative, one may specify only local control options, by repeating them for each regression: @ <>= indus.global <- list() indus.local <- list( adapt=c(Consum=T,Pollution=T), max.gestation=list(Consum=c(Job=3),Pollution=c(Job=3,Consum=3)), max.lead=list(Consum=c(Job=15),Pollution=c(Job=15,Consum=15)), min.width=list(Consum=c(Job=5),Pollution=c(Job=5,Consum=5)), sign=list(Consum=c(Job="+"),Pollution=c(Job="+",Consum="+")) ) @ or both local and global control options: <>= indus.global <- list(adapt=T,min.width=5) indus.local <- list( max.gestation=list(Consum=c(Job=3),Pollution=c(Job=3,Consum=3)), max.lead=list(Consum=c(Job=15),Pollution=c(Job=15,Consum=15)), sign=list(Consum=c(Job="+"),Pollution=c(Job="+",Consum="+")) ) @ \subsection{Parameter estimation} \label{sub:example3} Once the model code and control options are specified, parameter estimation can be performed using the command \code{dlsem}$(\cdot)$. The user may indicate a single group factor (just one) to argument \code{group} and one or more exogenous variables to argument \code{exogenous}. By indicating the group factor, one intercept for each level of the group factor will be estimated in each regression, in order to explain the variability due to differences between groups. By indicating exogenous variables, they will be included as non-lagged covariates in each regression, in order to eliminate cross-sectional spurious effects. Each exogenous variable may be either qualitative or quantitative and its coefficient in each regression is $0$ for time lags greater than $0$ (no lag shape). The user may decide to apply the logarithmic transformation to all strictly positive quantitative variables by setting argument \code{log} to \code{TRUE}, in order to interpret each coefficient as an elasticity (percentage increase in the value of the response variable for $1\%$ increase in the value of a covariate). Before parameter estimation, differencing is sequentially applied until all the time series are made stationary. By default, the Kwiatkowski-Phillips-Schmidt-Shin (KPSS, \citealp{kwiatkowski92}) test is performed if the number of periods is less than 100, otherwise the Augmented Dickey-Fuller (ADF, \citealp{dickey81}) test is used. \footnote{ % If the group factor is specified, group-specific p-values are combined according to the method proposed by \citep{demetrescu06}. % }, and each missing value is replaced by its conditional mean computed through the Expectation-Maximization algorithm \citep{dempster77}\footnote{ % %The computation is performed after eventual logarithmic transformation and differencing, assuming group-specific means and time-invariant covariance matrix. Qualitative variables cannot contain missing values. %Each quantitative variable must have at least 3 observed values if the group factor is not specified, otherwise it must have at least 3 observed values per group. % }. The HAC correction of the covariance matrix of least squares estimators is applied by default (it can be disabled by setting argument \code{hac} to \code{FALSE}). In this problem, the region is indicated as the group factor, while population and gross domestic product are indicated as exogenous variables. Also, the logarithmic transformation is requested, and global and local control options are provided to arguments \code{global.control} and \code{local.control},respectively: @ <>= indus.mod <- dlsem(indus.code,group="Region",exogenous=c("Population","GDP"), data=industry,global.control=indus.global,local.control=indus.local,log=T) @ The results of command \code{dlsem}($\cdot$) is an object of class \code{dlsem}. Among the components of \code{dlsem} objects, we found: \begin{itemize} \item \code{estimate}: a list of objects of class \code{lm}, one for each regression; \item \code{call}: a list containing the call for each regression after eventual adaptation of lag shapes; \item \code{model.code}: the model code after eventual adaptation of lag shapes; \item \code{data}: data after eventual logarithmic transformation and differencing, which were used in the estimation. \end{itemize} The \code{summary} method for class \code{dlsem} returns the summary of the estimation: \small @ <>= summary(indus.mod) @ \normalsize ~\\ We see that the number of job positions in industry (\code{Job}) significantly influences, on one hand, the amount of private consumption (\code{Consum}) from 0 to 4 time lags and, on the other hand, the amount of greenhouse gas emissions (\code{Pollution}) from 2 to 6 time lags, while the amount of private consumption (\code{Consum}) significantly influences the amount of greenhouse gas emissions (\code{Pollution}) from 1 to 5 time lags. This result provides evidence that the influence of industrial development on pollution is both direct and mediated by private consumption. The \code{plot} method for class \code{dlsem} displays the DAG of the model where each edge is coloured with respect to the sign of the estimated causal effect (green: positive, red: negative, light gray: not statistically significant): @ <>= plot(indus.mod) @ \begin{figure}[!h] \centering \includegraphics[width=0.4\textwidth]{figures/dag1.pdf} \caption{The DAG where each edge is coloured with respect to the sign of the estimated causal effect. Green: positive causal effect. Red: negative causal effect. Grey: not statistically significant causal effect (no such edges here).} \label{fig:dag1} \end{figure} The result is shown in Figure \ref{fig:dag1}. Note that the DAG includes only the endogenous variables. \subsection{Assessment of causal effects} \label{sub:example4} After parameter estimation is performed by means of command \code{dlsem}$(\cdot)$, the command \code{causalEff}$(\cdot)$ can be used on the resulting object of class \code{dlsem} to compute all the pathwise causal lag shapes and the overall one connecting two variables. The main arguments of command \code{causalEff}$(\cdot)$ include the name of one or more variables generating the causal effect (argument \code{from}), and the name of the variable receiving the causal effect (argument \code{to}). Optionally, specific time to which computation should be focused may be provided to argument \code{lag}, otherwise the whole lag shapes will be considered. Cumulative causal effects may be returned by setting the argument \code{cumul} to \code{TRUE}. % Only exogenous variables can be indicated as starting or ending variables. Note that, due to the properties of the multiple linear regression model, causal effects are net of the influence of the group factor and exogenous variables. The cumulative causal effect of the number of job positions on the amount of greenhouse gas emissions may be obtained by means of the following code: \small @ <>= causalEff(indus.mod,from="Job",to="Pollution",cumul=T) @ \normalsize ~\\ The output of command \code{causalEff}$(\cdot)$ is a list of matrices including point estimates and asymptotic confidence intervals for all the pathwise causal lag shapes and the overall one connecting the starting variables to the ending variable. % Since the logarithmic transformation was applied to all quantitative variables, the resulting causal effects are interpreted as elasticities, that is, for a $1\%$ of job positions more, greenhouse gas emissions are expected to grow by $0.61\%$ after 5 years and by $1.11\%$ after 10 years. The influence ends after 11 years, as the cumulative causal effects at 11 and 12 years are equal. ~\\ A pathwise or an overall causal lag shape can be displayed using the command \code{lagPlot}$(\cdot)$. For instance, one may display the causal lag shape associated to each path connecting the number of job positions to the amount of greenhouse gas emissions: @ <>= lagPlot(indus.mod,path="Job*Pollution") lagPlot(indus.mod,path="Job*Consum*Pollution") @ or the overall causal lag shape of the number of job positions on the amount of greenhouse gas emissions: @ <>= lagPlot(indus.mod,from="Job",to="Pollution") @ The resulting graphics are shown in Figure \ref{fig:lagsh}. Note that a multi-edge pathwise causal lag shape is a mixture of different lag shapes, thus it may show an irregular aspect, like it is the case of the overall causal lag shape displayed in the lower panel of Figure \ref{fig:lagsh}. \begin{figure}[!h] \centering {\includegraphics[width=0.48\textwidth]{figures/lagsh1a.pdf}}\quad {\includegraphics[width=0.48\textwidth]{figures/lagsh1b.pdf}}\\ {\includegraphics[width=0.48\textwidth]{figures/lagsh1c.pdf}} \caption{The pathwise causal lag shapes (upper panels) and the overall one (lower panel) connecting the number of job positions to the amount of greenhouse gas emissions. $95\%$ asymptotic confidence intervals are shown in grey.} \label{fig:lagsh} \end{figure} \subsection{Comparison among alternative models} \label{sub:example5} We now fit two alternative models for the industrial development problem, such that all lag shapes are quadratic decreasing and gamma lag shapes, respectively. \small @ <>= # model 2: quadratic decreasing lag shapes indus.code_2 <- list( Job ~ 1, Consum~qd(Job,0,15), Pollution~qd(Job,0,15)+qd(Consum,0,15) ) indus.mod_2 <- dlsem(indus.code_2,group="Region",exogenous=c("Population","GDP"), data=industry,global.control=indus.global,local.control=indus.local,log=T,quiet=T) summary(indus.mod_2)$endogenous # model 3: linearly decreasing lag shapes indus.code_3 <- list( Job ~ 1, Consum~ld(Job,0.5,0.5), Pollution~ld(Job,0.5,0.5)+ld(Consum,0.5,0.5) ) indus.mod_3 <- dlsem(indus.code_3,group="Region",exogenous=c("Population","GDP"), data=industry,global.control=indus.global,local.control=indus.local,log=T,quiet=T) summary(indus.mod_3)$endogenous # model 4: gamma lag shapes indus.code_4 <- list( Job ~ 1, Consum~gam(Job,0.5,0.5), Pollution~gam(Job,0.5,0.5)+gam(Consum,0.5,0.5) ) indus.mod_4 <- dlsem(indus.code_4,group="Region",exogenous=c("Population","GDP"), data=industry,global.control=indus.global,local.control=indus.local,log=T,quiet=T) summary(indus.mod_4)$endogenous @ \normalsize Here the option \code{quiet} was set to \code{TRUE} to suppress messages on the estimation progress. We see that the four models provide different results. Method \code{compareModels} can be used to compare them according to information criteria: \small @ <>= compareModels(list(indus.mod,indus.mod_2,indus.mod_3, indus.mod_4)) @ \normalsize The model with endpoint-constrained quadratic lag shapes has the lowest value of each information criterion, and thus the best fit to data. Note that information criteria for variable \code{Job} are the same in each model because it has no endogenous covariates. \section{Final remarks} \label{sec:remarks} Lag shapes included in the package may represent a large number of real-world lag structures, nevertheless new lag shapes with further specific features may be added in the future. The same holds for further functionalities for linear models for time series data. Please, do not hesitate to contact me for questions, feedbacks or bug reports. \pdfbookmark{References}{sec:refs} \bibstyle{plain} \begin{thebibliography}{9} %\bibitem[Akaike(1974)]{akaike74} H. Akaike (1974). A New Look at the Statistical Identification Model. %\textit{IEEE Transactions on Automatic Control}, 19: 716-723. \bibitem[Almon(1965)]{almon65} S. Almon (1965). The Distributed Lag between Capital Appropriations and Net Expenditures. \textit{Econometrica}, 33, 178-196. \bibitem[Demetrescu \textit{et al.}(2006)]{demetrescu06} M. Demetrescu, U. Hassler, and A. Tarcolea. Combining Significance of Correlated Statistics with Application to Panel Data. \textit{Oxford Bulletin of Economics and Statistics}, 68(5), 647-663. \bibitem[Dempster \textit{et al.}(1977)]{dempster77} A. P. Dempster, N. M. Laird, and D. B. Rubin (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. \textit{Journal of the Royal Statistical Society}, Series B, 39(1): 1-38. \bibitem[Dickey and Fuller(1981)]{dickey81} D. A. Dickey, and W. A. Fuller (1981). Likelihood Ratio Statistics for Autoregressive Time Series with a Unit Root. \textit{Econometrica},49: 1057-1072. \bibitem[Granger and Newbold(1974)]{granger74} C. W. J. Granger, and P. Newbold (1974). Spurious Regressions in Econometrics. \textit{Journal of Econometrics}, 2(2), 111-120. \bibitem[Haavelmo(1943)]{haavelmo43} T. Haavelmo (1943). The Statistical Implications of a System of Simultaneous Equations. \textit{Econometrica}, 11(1): 1-12. \bibitem[Koopmans \textit{et al.}(1950)]{koopmans50} T. C. Koopmans, H. Rubin, and R. B. Leipnik (1950). Measuring the Equation Systems of Dynamic Economics. In: T. C. Koopmans (ed.), Statistical Inference in Dynamic Economic Models, pages 53-237. John Wiley \& Sons, New York, US-NY. \bibitem[Kwiatkowski \textit{et al.}(1950)]{kwiatkowski92} D. Kwiatkowski, P. C. B. Phillips, P. Schmidt and Y. Shin (1992). Testing the Null Hypothesis of Stationarity against the Alternative of a Unit Root. \textit{Journal of Econometrics}, 54(1-3): 159-178. \bibitem[Magrini (2020)]{magrini20} A. Magrini (2020). A Family of Theory-Based Lag Shapes for Distributed-Lag Linear Regression. To be appeared on \emph{Italian Journal of Applied Statistics}. \bibitem[Magrini \textit{et al.}(2019)]{magrini19} A. Magrini, F. Bartolini, A. Coli, B. Pacini (2019). A Structural Equation Model to Assess the Impact of Agricultural Research Expenditure on Multiple Dimensions. \emph{Quality and Quantity}, 53(4): 2063-2080. \bibitem[Magrini(2018)]{magrini18} A. Magrini (2018). Linear Markovian Models for Lag Exposure Assessment. \textit{Biometrical Letters}, 55(2): 179-195. \bibitem[Newey \& West(1978)]{newey78} W. K. Newey, and K. D. West (1978). A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix. \textit{Econometrica}, 55(3), 703-708. \bibitem[Pearl(2000)]{pearl00} J. Pearl (2000). Causality: Models, Reasoning, and Inference. Cambridge University Press. Cambridge, UK. %\bibitem[Schwarz(1978)]{schwarz78} G. Schwarz (1978). %Estimating the Dimension of a Model. %\textit{Annals of Statistics}, 6, 461-464. \bibitem[Wright(1934)]{wright34} S. Wright (1934). The Method of Path Coefficients. Annals of Mathematical Statistics, 5(3): 161-215. \end{thebibliography} \end{document}