Suppose we are interested in the effect of declining unionization on the distribution of wages. One might be tempted to simply compare the distribution of wages of union- and non-union members in order to learn about this effect. This is problematic, however, since these two groups might be quite different in terms of their distribution of age, education, gender, ethnicity, sector of the economy, state of residence, etc. A better approach would thus compare people who look similar along all these dimensions and differ only in terms of their union membership. This is the basic idea behind distributional decompositions, as pioneered by DiNardo et al. (1996), underlying the discussion in Fortin and Lemieux (1997), and reviewed in Firpo et al. (2011). Distributional decompositions provide the answer to hypothetical questions such as the following: what if (i) the distribution of demographic covariates (age, gender,...) had stayed the same, (ii) the distribution of wages given demographics and union membership status had stayed the same, but (iii) we consider actual historical changes of union membership for different demographic groups — how, in this hypothetical scenario, would the distribution of wages have changed? Intuitively, such distributional decompositions provide an answer to the question: To what extent is de-unionization responsible for the rise in inequality?
Suppose we observe repeated cross-sections with i.i.d. draws from the time \(t\) distributions \(P^t\) of the variables \((Y,D,X)\). Here \(X\) denotes covariates such as age, education, and location. As in CHAPTER 5, \(D\) is a binary "treatment" variable such as union membership. The variable \(Y\) denotes an outcome such as real income.
We are interested in isolating the effect of historical changes in the prevalence of union membership \(D\) on the distribution \(P(Y)\) of incomes \(Y\), and in the effect of these historical changes on statistics of the income distribution, \(\nu(P(Y))\). Possible choices for the statistics \(\nu\) include the mean, the variance, the share below the poverty line, quantiles or the Gini coefficient.
Let \(P^1(Y, D, X)\) in particular denote the joint distribution of \((Y,D,X)\) in period 1 (the year at the end of the historical period which we are considering), and \(P^0(Y, D, X)\) the corresponding distribution in period 0 (the year at the beginning of the historical period). Our goal is to identify the counterfactual distribution \(P^*\) of \(Y\) in which the effect of changing \(D\) is "undone," while holding constant the current (period 1) distribution of covariates \(X\) as well as the distribution of income \(Y\) given \(X\) and \(D\). The change from \(P^*\) to \(P^1\) will be interpreted as the causal effect of changing \(D\) on the income distribution. Formally, define \(P^*\) as
This expression asks us to consider the following scenario: take the share \(P^1(X)\) in the population of period 1 of 40 year old women with a high school degree living in the Midwest as given, and similarly for all other demographic groups. For each of these groups, however, suppose that their prevalence of union membership was the same as in period 0, \(P^0(D|X)\). Consider finally the distribution of incomes for each demographic group and each union-membership status of period 1, again, \(P^1(Y\leq y|X,D)\). Putting all groups together (formally: integrating out the distributions of \(X\) and \(D\)), we get the counterfactual income distribution \(P^*(Y\leq y)\).
We can rewrite the distribution \(P^*\) as defined in equation (1) in a useful way as follows. First, multiply and divide the integrand by \(P^1(D|X)\). Second, rewrite the probability \(P^1(Y\leq y|X,D)\) as an expectation \(E^1[\mathbf{1}(Y\leq y)|X,D]\). Third, give the fraction \({P^0(D|X)}/{P^1(D|X)}\) a new name, \(\theta(D,X)\). Finally, pull \(\theta\) into the conditional expectation, and use the "law of iterated expectations" to get an unconditional expectation. Executing these steps yields
where
In the exercises, you will be asked to do these calculations in some simple examples, to see that nothing very complicated is going on. Equation (2) states that \(P^*\) is a reweighted version of the current distribution, \(P^1\). Any counterfactual distributional characteristic \(\nu|0\) of \(P^*\) can be estimated based on estimates of \(P^*\). Estimating \(P^*\)requires estimation of the ratio in equation (3).
Consider an individual who is a union member, \(D=1\), and has covariate values \(X=x\). Suppose for this value of \(X\) it was more likely to be a union member in period 1 than in period 0. This implies that \(\theta(X,D) >1\) for this person — we should upweight that person's income to get the counterfactual income distribution \(P^*\), where union membership probabilities are assumed not to have changed over time. Consider another individual, who is also a union member, of a demographic group \(X=x\) for which union membership did increase over time. For this individual, equation (2) tells us to downweight her income to get the counterfactual distribution.
Our definition and discussion of counterfactual distributions was purely statistical, and did not make any reference to potential outcomes or causality. The counterfactual distribution \(P^*\) can however be interpreted causally under an assumption of conditional independence. Denote \(Y^d\) the potential outcome (e.g. wage, or income) of a person with treatment status \(D=d\); this is just the same kind of potential outcome which we encountered in CHAPTER 5. In CHAPTER 5 we showed that differences in means can be interpreted as average treatment effects if treatment is independent of potential outcomes. Something slightly more complicated works in the present context. Assume that treatment and potential outcomes are independent conditional on \(X\), that is
Under this assumption
Implementing an estimator of \(\nu(P^*)\) based on equation (2) involves two steps. First, we need to estimate the weight function \(\theta\). Second, we need to calculate \(\nu\) for the distribution \(P^1\) reweighted by the estimated \(\theta\).
In order to estimate \(\theta\), we need to estimate the ratio between \(P^1(D|X)\) and \(P^0(D|X)\), corresponding to the change in the prevalence of \(D\) within demographic groups defined by \(X\). Suppose first that \(X\) takes on only finitely many values \(x\). Then we can directly estimate these conditional probabilities by the corresponding population shares in our sample,
where we use the subscript \(n\) to denote sample shares.
If \(X\) takes on many values, or has continuous components, we can not do this anymore. In that case, however, we can use a logit model for the distribution of \(D\) given \(X\), with parameters \(\beta^t\) changing over time:
Based on estimates of the parameters \(\beta^t\), we can estimate the weights \(\theta\) by
The parameters \(\beta\) can be estimated using maximum likelihood, similarly to the estimators for the Pareto parameter that we came up with in CHAPTER 3:
where the product is taken over all time \(t\) observations. Implementations of this logit estimator are readily available in most statistical software packages.