When Can You Get both Fairness and Efficiency?

A commonly used criterion for what makes for a good allocation of resources in economics is whether it’s Pareto efficient. That is, are there any other possible allocations that would make someone better off without making anyone worse off. But since this fails to capture notions of fairness (giving all of the goods to one person may be efficient), several fairness criteria have been proposed. One rather strong one is envy-freeness which says that no one prefers the bundle of goods someone else got over their own (so nobody has any envy).

There are many situations such as assigning courses, housing, schedules, etc. where it may be useful to have an explicit allocation mechanism and in those cases it would be nice if we could guarantee that it would produce an allocation that satisfies both the efficiency and fairness criteria mentioned above. Unfortunately, this can’t be guaranteed in general. One simple reason why this may not be possible arises when the goods being assigned are indivisible. For example, take the simple case where there is only one indivisible good and several people who want it. Clearly, it’s inefficient to assign the good to no one because if you were to assign it to someone instead you’d make them better off without harming anyone else. But at the same time, assigning it to no one is the only envy-free option since we’re not able to divide it up and assigning it all to one person creates envy.

But what about when goods are perfectly divisible, so we can divide them up as finely as we like? (For now we’ll assume that each good is homogeneous, so when dividing up a good people don’t care which part they receive just how big that part is). Now can we make any guarantees? Well if don’t make any restrictions on what people’s preferences look like then we still can’t. For example, you could take the previous case and make the good divisible but if people only got positive value from receiving the the whole thing and not from only receiving a fraction of it then we’d run into the exact same problems as before.

So what conditions on people’s preferences will be sufficient to guarantee that there will be a Pareto efficient envy-free (PEEF) allocation? One fairly simple requirement that is sufficient is that everyone’s preferences must be convex and strongly monotone. Strongly monotone means that, all else equal, they always strictly prefer more of each good to less of it. If we’re thinking about representing the bundle of goods someone gets as a vector, having convex preferences means that they like convex combinations of any two bundles at least as much as those two bundles. More intuitively, this can be thought of as having a taste for variety since averages are preferred to extremes.

But why does this guarantee a PEEF allocation? Well it’s actually because it guarantees us a competitive equilibrium! We can image each person is assigned a budget and each good is assigned a price and then every person chooses how much of each good to buy to maximally satisfy their preferences given their budget constraint. If when everyone does this the market clears, i.e. the total amount demanded of each good equals the total amount available, we say those prices and the resulting allocation is a competitive equilibrium (CE). Assuming that preferences are strongly monotone is enough for the First Fundamental Welfare Theorem to hold. In other words, it guarantees us that any CE will be Pareto efficient. But what about envy-freeness? Well we get to choose how big everyone’s budget is so we can choose to make them all the same size. Then since everyone can afford the same bundles and everyone chooses the best bundle they can afford it’s impossible for someone to prefer someone else’s bundle to their own. Therefore, any competitive equilibrium with equal incomes (CEEI) will be PEEF!

So now in order to guarantee the existence of a PEEF allocation we need to make sure a CEEI will exist. This is where the convexity assumption is required. Similar to some other equilibrium existence results you can prove a competitive equilibrium will exist using a fixed point theorem. In this case it’s Kakutani fixed-point theorem which deals with a correspondence defined on convex set mapping back onto subsets of that set which is why convexity is important.

This is a cool result but unfortunately there aren’t fast algorithms for computing CE and it would be nice to have an method that wasn’t quite so all or nothing. A very different approach people sometimes take to find good allocations is to have some global welfare function we’re maximizing. This can be done if we can express everyone preferences using utility functions and then make a global function that takes all of these as inputs. For example, we could have a kind of utilitarian welfare function:

\sum_{n} u_{n}(x_{n})

But one that will be particularly interesting for our purposes is the Nash welfare function:

\sum_{n} log(u_{n}(x_{n}))

One nice thing about both the utilitarian and Nash welfare functions is that since they are strictly increasing in all of the individual utility functions whatever maximizes them will be Pareto efficient. But the maximal Nash welfare allocation, sometimes called the Nash optimal allocation, additionally tends to be fair in the sense of being close to being envy-free. One way to see what the Nash optimal allocation is doing is to see that it’s solving the following optimization problem:

\max \sum_{n=1}^{N} log(u_{n}(x_{n,1},... ,x_{n,J})) + \sum_{j=1}^{J} \lambda_{j}(G_{j} - \sum_{n=1}^{N} x_{n,j})

Where there’s N people and J goods and G_{j} is the total amount available of each good. Taking the derivative with respect x_{n,j} will give us the condition that the marginal growth rates of each person’s utility for receiving more of a particular good will be equal, at least among everyone who is already receiving some of it. This is different from the utilitarian optimal allocation which would be equating marginal utilities across people rather than marginal growth rates. It seems somewhat intuitive that equating growth rates in utility across people would lead to what seem like fairer outcomes but is it ever guaranteed to give us something envy-free?

Yes! If the utility functions are all homogeneous. More precisely, this means that for each the utility function there exist a constant k such that:

\forall \alpha \geq 0,\;\; u(\alpha x) = \alpha^{k}u(x)

For example, all linear functions as well as functions that are just minimums or maximums of finite things are homogeneous. But stating the result this way is underselling its coolness. What’s really going on is when everyone has homogeneous utility functions maximizing the Nash welfare finds a CEEI! Specifically the allocation will be the same as the allocation in some CEEI. Furthermore, the Lagrange multiplier from our optimization problem’s constraint for each good will be equal to the price of that good in that CEEI (if we normalize everyone’s budget and the supply of each good to be 1). This leads to an interesting interpretation of the equilibrium prices of each good as being about the growth rate of the marginal buyers utility if the supply were increased slightly. Additionally, there are other cases where this connection holds. For example, in the realm of “cake-cutting” problems where we have one heterogeneous good to divide up among people, we can define a variation on the concept a CE to fit the setting. In this case it would essentially mean finding a function that maps from subsets of the good to a price such that when everyone spends their fake income “the market clears” in the sense that some partition of the whole good is bought i.e. no subset doesn’t get sold and no intersecting subsets are bought. And in this case a partitioning up of the good is a CEEI allocation if and only if it maximizes the Nash welfare!

Perhaps these special cases don’t actually tell us much about more general cases but it would be quite interesting if they did. It gives a very nice interpretation of how competitive markets tend to work when people have similar levels of wealth and even beyond that I personally just find it neat when things as seemingly decentralized as competitive markets can actually be thought of as maximizing some global function.


Sponsored Post Learn from the experts: Create a successful blog with our brand new courseThe WordPress.com Blog

WordPress.com is excited to announce our newest offering: a course just for beginning bloggers where you’ll learn everything you need to know about blogging from the most trusted experts in the industry. We have helped millions of blogs get up and running, we know what works, and we want you to to know everything we know. This course provides all the fundamental skills and inspiration you need to get your blog started, an interactive community forum, and content updated annually.

The Benefits of Non-Gaussian Noise

Trying to do causal inference with observational data is quite difficult in general but one thing that helps make it more feasible is when there are no unobserved confounding variables. If we have some set of observable variables, we can represent different potential causal relationships between them with directed (and we’ll assume acyclic) graphs. The value of each variable i is determined by some function f_{i}(PA(i),\epsilon_{i}) where PA(i) are all of the parents of i in the graph and \epsilon_{i} is a noise term for i. Assuming no unobserved confounding, sometimes called Causal Sufficiency, means that all of the noise terms will be independent of each other.

This quite a strong assumption and it fails all the time, but even when it’s true it’s not necessarily the case that given enough observations for our set of variables that we will be able to uniquely pick out the true causal relationship. For example, say you have two variables x and y that are correlated and you’re able to see exactly what the joint distribution of the two variables is. You won’t necessarily be able to say whether this correlation is coming from x causing y or y causing x without making some further assumptions about the form of the causal function or what the noise distributions look like. So what kind of assumptions are useful to make? Well, some common assumptions to use historically were that the causal relationships are linear and the noise terms were all Gaussian. But despite this making the math nice in some ways these assumptions don’t actually allow you to pick out the unique causal graph in general. Luckily, people found sets of assumptions do work. One particularly surprising example of this involves simply taking the classical assumptions of linearity and Gaussian noise and switching it to assuming linearity and anything but Gaussian noise. Remarkably this incredibly weak assumption about what the noise looks like works for guaranteeing that you can uniquely pick out the correct causal graph. You can look at the paper for the LiNGAM algorithm if you want to see exactly how this can be done but here I want to just provide some intuition for what’s so special about normal distributions that makes this true.

Let’s start with understanding the less surprising part: that assuming linearity and Gaussian noise it’s good enough. One interesting property of normal distributions is that linear combinations of independent normal random variables are also normal. If all the random variables can be expressed as linear functions of their parents and their noise terms, then they can all also be expressed as linear functions of all of the noise terms. Therefore, if the noise terms are Gaussian and independent (Causal Sufficiency), then every random variable in the graph will be Gaussian as well so the joint distribution for the whole graph will an n-dimensional Gaussian random variable. So this seems nice in the sense that the distribution for the graph can be so easily described but in a certain sense that’s part of the problem. If we know a distribution is Gaussian, we can uniquely pin it down using just its mean and variance. Specifically for an n-dimensional Gaussian this will be an n by 1 vector of the means and an n by n covariance matrix. Essentially this means that all of the information there is for us to estimate causal relationships is captured in the covariance matrix. Thinking about it this way, it’s perhaps unsurprising that we cannot do better than looking at conditional and unconditional correlations between the random variables. This will help us narrow down the possibly graphs but in general it won’t uniquely pick one out (like in the simple case of two correlated random variables).

With this in mind let’s move on to thinking about the more surprising result that assuming non-Gaussian noise allows us to accomplish what assuming Gaussian noise couldn’t. We can do so by thinking about the following theorem:

Darmois–Skitovich Theorem: Let \epsilon_{i} for i = 1, 2,  … , n where n \geq 2 be independent random variables. Let a_{i} and b_{i} be non-zero constants. If L_{1} = \sum_{i=1}^{n} a_{i}\epsilon_{i} and L_{2} = \sum_{i=1}^{n} b_{i}\epsilon_{i} are independent of each other then the random variables \epsilon_{i} for i = 1,2,…,n must all be Gaussian.

The proof of this theorem relies on the idea of characteristic functions which are complex functions obtained by taking the Fourier transform of random variables. It’s a useful proof technique because you can prove things about a random variable by proving things about it’s corresponding characteristic function, but unfortunately it doesn’t give me much intuition for why this result is true. The theorem does however add some interesting evidence (along with other results like Central Limit Theorems) for the special relationship between linear functions and normal distributions.

In order to see how this result is useful for causal inference let’s assume we have Causal Sufficiency and linearity of the causal relationships and let’s think about the simple case of just two observable random variables again. We can call them x and y with there corresponding noise terms \epsilon_{x} and \epsilon_{y} . Now let’s think about if we could run into the same problem that we did in the Gaussian case where we accidentally get the causal relationship backwards. So we’ll assume the true model is:

x = \epsilon_{x}

y = \beta x + \epsilon_{y}

But let’s think about when we try to fit the backwards model:

x = \alpha y + \epsilon_{x}^{*}

y = \epsilon_{y}^{*}

If we rearrange the equation for the backward model to solve for our estimated value of \epsilon_{x}^{*} we’ll get that it’s equal to x - \alpha y and then plugging in our equations from the true model gives:

\epsilon_{x}^{*} = (1 - \alpha \beta) \epsilon_{x} - \alpha \epsilon_{y}

So what? Well we’ve shown that \epsilon_{x}^{*} will actually be a linear combination of \epsilon_{x} and \epsilon_{y} just like y actually is. Since we’re assuming Causal Sufficiency it’s going to actually be the case that \epsilon_{x} is independent of \epsilon_{y} but we’re going to be expecting it to be the case that \epsilon_{x}^{*} is independent of \epsilon_{y}^{*} and therefore independent of y. And this is where the theorem above comes in! Because we have two linear combinations of independent random variables and in order be able to fit the incorrect model while still satisfying our assumptions we would need those linear combinations to be independent as well. But based on the theorem this can only happen when the noise terms are Gaussian so if we correctly assume they are anything but Gaussian then this can’t be the case!

The above argument can be generalized to arbitrary acyclic causal graphs, it’s just the substituting in of equations to get the desired linear combinations of noise terms will of course get quite a bit more complicated.

Does Uncertainty Make Risk Averse People Save More?

I’d like to go through and analyze a question which seems straightforward but my initial thinking about it was actually incorrect. The question was: when the future becomes uncertain, how do we tell if people will want to save more or less?

The Basics of Risk Preferences

A key idea when talking about risk preferences is risk aversion. We general assume that people prefer consuming more to consuming less and if we can represent their preferences with a differentiable utility function this corresponds to u'(c) > 0 . Someone is risk averse if when you offer them a gamble with uncertain outcomes they would prefer getting the expected value of the gamble for sure over the gamble itself. It’s not too hard to show that this will correspond to their utility function being concave (and having a negative second derivative if it exists u''(c) < 0 ).  This is directly connected to Jensen’s Inequality which states that if a function f is convex then

E[f(x)] \geq f(E[x])

Or equivalently, if f is concave then

E[f(x)] \leq f(E[x])

What Happens When the Future Becomes Riskier?

At least for me when I was first thinking about it, it seemed intuitive that if someone is risk averse then this means they’ll save more when the future becomes uncertain but in fact this is not necessarily the case! If a consumer is risk averse and they would save more when the future becomes riskier then in economics we say they are prudent. It turns out that in order to tell if a consumer is prudent we have to not only look at the first and second derivatives of their utility function but also the third derivative.

In order to get some intuition for what’s going on here, let’s think about a simple model where of a consumer exists for two time periods. In the first time period they earn some income y_{1} which they can either spend on consumption that period or save so c_{1} + s = y_{1} . In the second period they spend the income they earn plus what they’ve saved so c_{2} = y_{2} + s(1+r) where 1 + r is the interest rate. Combining these constraints we can write their problem as:

max_{c_{1},c_{2}} u(c_{1}) + \beta u(c_{2})

s.t. \; c_{1}(1+r) + c_{2} = y_{1}(1+r) + y_{2}

where \beta is how much they discount the future.

Then plugging in the constraint for c_{2} :

max_{c_{1}} u(c_{1}) + \beta u(y_{1}(1+r) + y_{2} - c_{1}(1+r))

Taking the first derivative and setting it equal to zero, we can get:

u'(y_{1}(1+r) + y_{2} - c_{1}(1+r)) = u'(c_{1}) \frac{1}{\beta (1+r)}

Or plugging c_{2} back in:

u'(c_{2}) = u'(c_{1}) \frac{1}{\beta (1+r)}

Intuitively, this is just saying that the amount of utility they would get by spending a little more now is equal to the amount that their discounted future utility would increase by saving a little more. Now let’s try adding in some uncertainty to the future by adding some shock to their future income. So now in the second period they will receive y_{2} + X where X is a random variable with mean zero. Then their new problem is:

max_{c_{1},c_{2}} u(c_{1}) + \beta E[u(c_{2})]

s.t. \; c_{1}(1+r) + c_{2} = y_{1}(1+r) + y_{2} + X

So going through the same process as before we’ll get:

E[u'(y_{1}(1+r) + y_{2} - c_{1}(1+r) + X)] = u'(c_{1}) \frac{1}{\beta (1+r)}

Or equivalently,

E[u'(c_{2})] = u'(c_{1}) \frac{1}{\beta (1+r)}

If the left hand side is greater than it was before (for the same savings decisions), this would mean we would need to decrease the expected marginal utility of future consumption and increase the marginal utility of current consumption to make the equality hold. If were dealing with someone whose risk averse (so u''(c) < 0 ) this means we need to decrease current consumption and increase expected future consumption. In other words, if the left hand side is greater than before then they’ll want to save more.

As was mentioned above with Jensen’s Inequality, if u'(c) is convex, then

E[u'(y_{1}(1+r) + y_{2} - c_{1} + X)] \geq u'(E[y_{1}(1+r) + y_{2} - c_{1} + X]) = u'(y_{1}(1+r) + y_{2} - c_{1})

Since we’re thinking about the first derivative being convex this will correspond to the third derivative being positive. This is the intuition for why while the second derivative tells us about whether someone likes risk, we need to know about the third derivative to tell how their savings decisions will respond to it.


An Introduction to Duality Theory

An insight from optimization theory that is both extremely useful and conceptually interesting is that every optimization problem has what’s referred to as a dual problem. This is a closely related problem with one constraint for each variable in the original and one variable for each constraint in the original and it can be used to bound or even find the original’s optimal solution. Before we start exploring this concept though there are some conventions that it will be useful to set up. First off, we’ll assume the original optimization problem, referred to as the primal problem, is expressed in the form:

z^{*} = min_{x} f(x)

s.t. \; g(x) \leq 0

where f(x): \mathbb{R}^{n} \rightarrow \mathbb{R} and g(x): \mathbb{R}^{n} \rightarrow \mathbb{R}^{m} . Notice that this is very general in the sense that we can rewrite any maximization problem as a minimization problem and we can express any equality or inequality constraints in this form. One other convention we’ll be using is that having an infinite penalty for the variables taking on certain values is equivalent to constraining them to not be able to take on those values. In other words, the above optimization problem is equivalent to the problem:

min_{x} h(x)

where h(x) = f(x) when g(x) \leq 0 and h(x) = \infty otherwise.

What is a Dual Problem?

There are 3 steps to constructing the dual from the primal.

Step 1: Create the Lagrangian

Using our optimization problem above, we can construct a new function known as its Lagrangian.

L(x, u) := f(x) + u^{T} g(x)

These u variables are sometimes known as Lagrange multipliers and they will become the variables of our dual problem.

Step 2: Create the Dual Function

L^{*}(u) := inf_{x} f(x) + u^{T} g(x)

Step 3: Create the Dual Problem

v^{*} = max_{u} L^{*}(u)

s.t. \; u \geq 0

It’s worth noting that in general the dual of the dual may not being equivalent to the primal.

Weak Duality

Any point that satisfies the constraints of an optimization problem is known as a feasible solution. One interesting feature of the relationship between the primal and the dual is that for any feasible solution x' to the primal and any feasible solution u' to the dual,  f(x') \geq L^{*}(u') . The proof is very simple. If x' and u' are feasible solutions then,

f(x') \geq f(x') + u'^{T}g(x') \geq min_{x} f(x) + u'^{T}g(x) = L^{*}(u')

What makes this so useful is that it’s the problem that’s trying to minimize something that’s always taking larger values than the problem trying to maximize something. This means that by finding feasible solutions to the dual we can bound how good of a solution to the primal we can obtain and better solutions to the dual will provide us with better bounds. This also means that if the primal and dual have optimal solutions then z^{*} \geq v^{*} . The difference between these optimal solutions is known as the duality gap. It also naturally follows from that if you can do unboundedly well on one of the problems then the other has no feasible solutions.

One last thing that’s useful about weak duality is that it gives us a sufficient condition for an optimal solution. If we can find a feasible solution to the primal x' and feasible solution to the dual u' such that f(x') = L^{*}(u') , then we know that x' and u' must actually be optimal solutions. This property of having the duality gap equal zero is known as strong duality and while optimization problems don’t have it in general there are useful special cases that do.

An Example: Linear Programming

It’s helpful to look at an example and an interesting special case is when our objective function and constraints are linear. In that case we can express the primal problem as:

min_{x} c^{T}x
s.t. \; Ax \geq b

Taking advantage of the linearity, we can rewrite its Lagrangian.

L(x, u) = c^{T}x + u^{T}(b - Ax) = u^{T}b + (c - A^{T}u)^{T}x

Then the dual function will be:

L^{*}(u) = u^{T}b if A^{T}u = c and it will be equal to negative infinity if A^{T}u \neq c .

So therefore following the convention of infinite penalties being the same as constraints, the dual problem will be

max_{u} u^{T}b

s.t. \; A^{T}u = c, u \geq 0

This means that the dual problem will also be linear and is easily computable from the primal. In fact in this case the dual of the dual will be equivalent to the primal. However, what’s really nice about linear programming is that not only does weak duality hold but we are guaranteed that when the primal and dual have optimal solutions strong duality will hold as well.

Saddle Points and the KKT Conditions

Even outside of special cases like linear programming this notion of a Lagrangian and duality can be used to provide some necessary or even sufficient conditions for an optimal solution. First we’ll need to introduce the idea of a saddle point of the Lagrangian. This is a point (x',u') such that u' \geq 0 and

\forall x \;\forall u \geq 0, \;\; L(x', u) \leq L(x', u') \leq L(x, u')

It turns out (x',u') is a saddle point if and only if x' and u' are optimal solutions to the primal and dual problems respectively and the duality gap is zero. This observation is extremely useful because it is also the case that (x',u') is a saddle point if and only if all four of the following conditions hold:

1. L(x',u') = L^{*}(u')

2. g(x') \leq 0

3. u'^{T}g(x') = 0

4. u' \geq 0

Let’s go through and quickly prove this! Let’s start by assuming (x',u') is saddle point. It follows directly from the definition of a saddle point that (1) and (4) will be true. Also, \forall u \geq 0,\; f(x') + u'^{T}g(x') \geq f(x') + u^{T}g(x') holds. Therefore, g(x') \leq 0 since otherwise there would be some u \geq 0 that violates that inequality, so (2) holds. By letting u = 0 we get u'^{T}g(x') \geq 0 but by (2) and (4) u'^{T}g(x') \leq 0 so u'^{T}g(x') = 0 giving us (3). Now notice that (2) and (4) mean that x' and u' are feasible solutions to the primal and dual and (1) and (3) mean that f(x') = L(x', u') = L^{*}(u') which means based on the stuff we talk about with weak duality that x' and u' are optimal solutions and the duality gap is zero.

Now let’s suppose we have (x',u') that satisfy conditions 1-4. Then (1) directly gives that \forall x,\; L(x', u') \leq L(x, u') and (2) and (3) together give us \forall u \geq 0,\; L(x', u') = f(x') \geq L(x', u) . So therefore (x',u') is a saddle point. Lastly, let’s assume x' and u' are optimal solutions and there’s no duality gap. The feasibility of x' and u' directly gives us (2) and (4) and since L^{*}(x') = f(x') + u'^{T}g(x') = f(x') , (1) and (3) must hold. Thus all three of these things are equivalent.

This result, in addition to helping us understand when strong duality will hold, also can be used to give us a set of conditions, 1-4, that are collectively sufficient for having an optimal solution. Conditions 2-4 are very easy to check, the only one that presents some difficulty is (1). Luckily though we’ve still managed to reduce thinking about constrained optimization to thinking about unconstrained optimization. In the special case where f(x), g_{1}(x), \dots, g_{m}(x) are differentiable and convex then setting the first derivative of the Lagrangian with respect to x equal to zero is sufficient to satisfy (1). Therefore, these four conditions together will be sufficient for an optimal solution in that case:

1. \nabla f(x') + u'^{T} \nabla g(x') = 0

2. g(x') \leq 0

3. u'^{T}g(x') = 0

4. u' \geq 0

These are known as the Karush–Kuhn–Tucker (KKT) conditions. Perhaps unsurprisingly, not only do they provide sufficient conditions for a constrained global minimum when things are convex, but they can provide necessary conditions for an optimal solution given some fairly weak “regularity” conditions. I won’t get into any more of the details of the different versions the KKT results here but hopeful you can see the important role duality plays in understanding them.

Elections vs. Sortition

When most people think of representative democracy, they usually think of a system in which people vote on who they want their representatives to be and then those representatives create and vote on policies. However, an alternative system that makes use of representatives and is, at least under some definitions, democratic is sortition (also known as demarchy or lottocracy). Sortition is a system in which the representatives that craft and vote on policy are selected at random from the population. In this post I’ll be focusing on comparing the use of these two systems for the legislative branch (not executive or judicial positions). I have a fair amount of skepticism about the prospects of sortition but I’ve recently become more interested as I’ve thought more about the flaws of election-based representative systems.

If you’ve seen some of the other posts on this blog, you could probably guess that I think more consideration should be given to how elections are conducted in representative democracies because it does seem like many of design details have the potential to significantly impact the performance of our institutions. To take one example, changing what voting methods are used seems like it could change how much people’s votes reflect their true preferences, how powerful political parties are, and how proportional the representatives selected end up being of the overall population along various dimensions. However, there does seem to be at least one major problem that any election-based representative system will run into, which is that as the population gets large the probability of any individual vote being pivotal rapidly approaches zero. It seems like it will always be challenging for this not to lead to problems of rational ignorance. That is, if the probability that your vote will affect the outcome is negligible, then someone could reasonable conduct a cost-benefit analysis and decide it’s not worth the effort to become well informed about all of the candidates’ positions and the effects of policies. Of course, some people enjoy thinking about and engaging with politics but even then there may be the potential for “rational irrationality” in the sense that if the main consequences of holding certain political views are primarily social or how they get to think about themselves then they may be biased to approach issues differently than if the main effects were actually changing policy.

This seems like one area where sortition does have an advantage, since while there would still be some voting among the randomly selected subset, perhaps it could be small enough to have the expected value of each member’s decision be sizable in general. Any kind of pure sortition based political institution would be so different from are current system that it’s hard to be able to anticipate all of the potential problems or benefits but here are a few others that seem like they are worth considering:

More Representative?

As I indicated before, election-based representative democracies can vary significantly in terms of how well the elected officials end up representing the population ideologically or along other dimensions. It seems like sortition could offer stronger representativeness guarantees, since with a reasonably sized legislative group (say several hundred people) a basic law of large numbers argument could guarantee that with a high probability the proportion of people who, for example, support higher taxes or are under 55 would be approximately the same as in the population they were sampled from. It would be reasonable to have some rules restricting who could be selected (e.g. not young children) and this choice would have important consequences for the system similar to the choice of who can vote in the election-based version.

If we assume that the randomly selected people are given the option to decline then we would likely want to make sure that people are not financially or legally penalized for participating. Part of this would plausibly be that they would paid well during whatever time period they’re working as part of the legislator. However, there would still likely be some selection bias in terms of who accepts and by paying them well this would then change them to make them unrepresentative of the population in a pretty important way (at least while in office). Of course, this same unrepresentativeness also exists, usually being even worse, with the election-based system but one could argue that since the poor can still vote those representatives will have more of an incentive to care about their problems.

Less Accountable?

This leads to one of the main concerns people have about sortition, which is that since the representatives don’t have to worry about re-election they are less accountable to the general public. This positive incentive for elected politicians certainly does exist to some extent but it seems like the positive effects here can be considerably undermined by the rational ignorance/rational irrationality effect mentioned earlier. Additionally, it may also lead to perverse incentives (e.g. when politicians can be more successful by prioritizing short-term political gain over tackling long-term problems). Also, if the reason why we care about this kind of accountability is the we want the politician to reflect the values of the voters then we should only be critical of sortition on this front in so far as it fails to achieve this end through other means (with the unrepresentativeness of incomes being one potential case of this like mentioned above).

Some people have suggested that there should be a community consultation phase built into the process where after learning about the issue at hand and putting together an initial draft of the legislation, the selected members return to their communities and host town halls or other events where the selected members get to explain their decisions and what they’ve learned to their communities and community members can offer feedback.  I’m uncertain of how useful these would be in practice and the selected members would be in no way forced to follow the feedback they receive but perhaps it would help to have a built-in mechanism by which the selected members interact with their communities while they’re in the legislature.

Less Vulnerable to Corruption/Political Capture or More?

Once we understand that, usually, the goal of the selected members in sortition isn’t to try to choose what they think people in general think is best but rather what they themselves think is best, it’s easier to understand why some advocates of sortition actually suggest that the selected members’ votes should be anonymous. One benefit of this is that it makes the members harder to bribe for the same reason that making the voting of people in election-based representative systems anonymous helps prevent bribery. Additionally, many election-based systems require candidates to raise money to run for office and it may be easier to evaluate when there is quid quo pro between the representatives and the special interests in sortition than in those systems.

However, depending upon how the system is set up the biggest concerns about political capture may not be about the capture of selected members themselves. Many of the proponents of sortition also suggest having some way to have experts present to and advise the selected members when they are working on policy issues related to their field. This could be very helpful to inform the members about the issues they’re dealing with and help people without any prior experience crafting legislation do it for the first time but it also opens up the opportunity for the influence of special interests if they are able to influence the experts or how the experts are selected. There are a number of approaches to try to protect against this such as requiring experts to be accredited by international recognized institutions and having strong disclosure requirements. We could also have the expert selection process be conducted at least in part by an accredited community of experts (like what the American Bar Association does in the U.S. by giving ratings for proposed United States Supreme Court nominees). Certainly, none of these methods would be foolproof though. It is worth keeping in mind that expert capture is also going to be a concern in any election-based alternative but perhaps it would be an even bigger concern in a system where experts planned a more explicit role in the process and were dealing with novices.

No Standards for Competence, Interest, Values, etc.?

This leads to another one of the largest concerns that people have about sortition which is that there are absolutely no restrictions (or extremely mild restrictions) on who can be become a representative. In other words, people are worried about there being almost no minimum standards of intelligence, interest in the issues, moral character, *insert trait you value in your leaders here*, in order to obtain a significant amount of power. A key point here though is to consider what exactly do these standards look like, or what could they look like, in the election-based alternative. It certainly isn’t hard to come up with anecdotes of when the standards for who could gain power in election-based representative democracies weren’t nearly as high as some people would have hoped. However, I do certainly get where people are coming from with this concern. Currently, an important crux of this issue for me is how well the expert information sessions and advising could function. One thing that might make this process of trying to raise the competence of those selected easier is to have single-issue legislators. This could significantly lesson the burden of what the selected members are expected to learn about and understand and would probably be more feasible than in an election-based system where you could be dramatically increasing the number of candidates people have to know about. However, it certainly raises its own questions, like how do you properly partition up the policy space and how do these single-issue legislators coordinate with each other. Additionally, there may be some benefit to having representatives that have to think about things holistically.

Empowering Ordinary People?

One thing some people like about sortition deals with the flip-side of the previous concern. Another potential benefit over system where candidates have to raise money to run for office is that it greatly broadens who can be a political leader. This is because the selection procedure does not favor those who have pre-existing political connections, wealth, and fame which, although it could be affected by design choices like whether campaigns are privately funded, seems likely to be an issue in most election-based systems. On the other hand, some people worry that not having elections could decrease political engagement and result in an even less political informed and active populous.

Dis-Empowering Political Parties?

Once again, this is dimension that election-based representative systems already vary along considerably with, for example, voting methods like single-winner plurality-rule seeming to encourage a two-party system and penalizing independents more than others. However, a sortition system seems like it might lead to the loss of political parties all together. Currently at least this seems like a strong upside to sortition for me.

Not Viewed as Illegitimate?

Another concern about sortition is that the representatives may not be viewed as legitimate leaders since the general public didn’t play any role in selecting them. If people feel like elections are necessary for them to give consent to be governed over then they will feel like the randomly selected people don’t have the consent of the governed. For me personally, this way of thinking doesn’t hold too much sway since a system that will almost certainly do the same thing regardless of what you say hardly feels like allowing individuals to give consent in any way in which we would usually think about the concept. Additionally, for me, pretty much all related procedural concerns about sortition reduce to consequentialist concerns about how good the outcomes it leads to will be. But again that’s only how I personally think about the legitimacy of government and if others did view sortition as less legitimate and this led to viewing the laws it produced as less legitimate and more political instability then that seems like it’s worth taking into account. This gets at a larger issue which is that the performance of political and economic institutions is inevitably tied up in how social norms evolve around them which seem very hard to predict in general.

Design Choices

One point that I’ve been trying to make salient throughout this is that these two options are each a large category of institutions among which there may be a significant amount of variation in terms of performance based on specific design choices. I’ve also alluded to several choices that arise when designing a sortition legislature. Here are those ones as well as a few others:

  • Who exactly is in the pool the candidates are selected from?
  • Are there multiple legislative bodies and if so are certain bodies targeted towards specific issues?
  • How long are terms and are members for a chamber selected all at once or in a staggered way?
  • How well are the selected members compensated and what rules are in place to prevent them from being penalized for accepting the position? Are there any penalties for not accepting the position when selected?
  • Is there some sort of community consultation phase or other mechanisms for the general population to offer feedback and be involved?
  • Is voting on proposed legislation anonymous?
  • What kind of disclosure requirements and ways of monitoring for corruption are there?
  • Is there some sort of “learning phase” where members hear from experts in fields related to the policy issues they’re dealing with, if so how are these experts vetted and how involved in the process are they?

A Blending of Systems

So far I’ve been discussing things as a choice between a purely election-based and a purely random selection-based representative system but there are also a wide range of institutions in between (as well as ones that combine other elements such as those of direct democracies or more technocratic aspects). A simple way to combine these two systems would be to have some representatives randomly selected and others elected such as having a bicameral legislature like in the U.S. but with one chamber randomly selected and the other having elected officials. It could also be more specialized such as having the elected representatives create legislation and having the randomly selected members be responsible for voting on it or having single-issue randomly legislators like mentioned above and having the elected officials play a role in setting the agendas of and creating these single-issue groups.

An interesting variant of this kind of combination is to have the ratio of elected legislators to randomly selected legislators be directly determined by voter turnout. So for example if there was 65% voter turnout then 35% of the legislator would be assigned by random selection. In a sense then each absent voter is implicitly voting for a randomly selected person over the available candidates.

There are also a number of other ways to combine these systems such as randomly selecting the people eligible to be candidates in each election or only using sortition for certain kinds of decisions or only under certain political conditions (e.g. to resolve a legislative stalemate or institute certain political reforms).

While it can be hard to tell how institutions very different from what we can currently observe would work in practice, in some cases it may be possible to experiment with some of these alternatives on a small scale with ways to scale them up if they work well. Like I said, this is something I’ve only recently started considering so I’m interested to hear more points that I haven’t thought about and pin down more cruxes of this debate.

Incentivizing Honesty without Access to the Ground Truth

There are many settings in which we would like to elicit some information from people that we will be able to directly verify the accuracy of later. For example, we may wish to elicit predictions from meteorologists about the probability of it raining next Friday even though we will of course be able to see for ourselves whether it actually does rain later. In a context like this, a natural way to try to incentivize truthfulness is to have a way to calculate the accuracy of their prediction and reward them more for more accurate predictions. For example, if the thing we are trying to elicit is a probability distribution, this can be done with a function that calculates how much reward (e.g. money) they should receive based on the probabilities they assigned to each potential outcome and which outcome actually occurred. Such a function is known as a scoring rule and a scoring rule is called strictly proper if their expected score is always made strictly largest by reporting their true beliefs.

For example, the scoring rule shown below, known as the quadratic scoring rule, is strictly proper:

S(q, i)  = q_{i} - \frac{1}{2}\sum_{j \in M}q_{j}^{2}

where M is the set of potential outcomes and q_{i} is your reported probability of outcome i.

However, often times eliciting truthful information from others is most valuable precisely when you won’t have access to any kind of ground truth that their reports can be compared with. This may occur either because the ground truth is too costly to access, such as eliciting labels from people for large data sets for supervised learning, or because the information is necessarily subjective such as asking people to rate their enjoyment of a movie. In the latter case it’s not clear there’s much that can be done to “verify” the truthfulness of the reports but when dealing with the former case one approach that has been analyzed a lot in recently years is to reward people for their reports based on how they compare to the reports of others.

The general idea is that if we are in a setting where people’s beliefs are positively correlated with the truth and this is common knowledge then we may be able to get them to report what they believe is the truth by rewarding them more for giving answers more similar to others. Here’s one way we might try to model this:

Let’s assume we’re dealing with a setting where we have N people, or players, we want to answer some multiple choice question (e.g. selecting a label for some surprised learning task). Then call person i’s truthful answer Si. We’ll model people coming up with their answers by there being some N-dimensional joint distribution D of  answers that is sampled from and we’ll assume this distribution is common knowledge. Each person i then reports an answer xand there is some mechanism that gives them a payment Pi(x1, … , xN) to each person. We will also assume that D is symmetric, meaning that zooming in on two players i and j the joint distribution of their answers can be represented by a symmetric matrix and this symmetric matrix is the same for every pair (i, j) of distinct players. These symmetry requirements essentially amount to assuming that the players have no evidence that specific players are more biased to give particular answers than other players.

For example, for two players answering a yes or no question the distribution D may look like:

 s_{2} = 0 \; s_{2} = 1

s_{1} = 0   \;\;\;\;\;   0.3  \;\;\;\;\;    0.1

s_{2} = 1  \;\;\;\;\;    0.1   \;\;\;\;\;   0.5

A natural first idea for a mechanism that builds on the intuition to reward output agreement is for each player i, the mechanism randomly picks some other player j and gives player i some reward if their answers agree but nothing otherwise. If you’ve played the Schelling point game, this mechanism is essentially doing the same thing. But does it incentivize truthfulness, in the sense that if all other players are reporting truthfully, then truthful reporting is the unique best response for a player? Well actually it depends on what the distribution D looks like. I’d recommend stopping to think about how this mechanism might fail and what restrictions we would need to be put on D.

The main intuition for why this mechanism fails to be truthful is that players can want to lie when they come up with especially uncommon answers. More specifically, this mechanism is truthful if and only if the most likely outcome for everyone given their true answer is a matching of answers, that is, if and only if Pr[s_{2} = x | s_{1} = x] > Pr[s_{2} = y | s_{1} = x] for every x and y ≠ x.

This is very restrictive and indeed we can do better.

The Peer Prediction mechanism can incentivize truthful answers without the restrictions on D mentioned above. It does come at a cost though in that it requires that the mechanism know D. The clever idea behind this mechanism is that if we have access to D just like the players do then we can interpret an answer given by player i as really a report of a probability distribution over player j’s answer. When D is the two-dimensional distribution of i’s and j’s answer, we can do this simply by conditioning D on i’s report being true (i.e. looking at the ith row or column of D). And now, if we take j’s answer as true, then we have a reported probability distribution over a set of outcomes from i and which of those was realized from j, so we can reward i use a strictly proper scoring rule!

Then since strictly proper scoring rules guarantee truthful reporting of probability distributions and i’s true probability distribution of j’s true answer will be based on i’s true answer, this mechanism will have truthful reporting as a Nash equilibrium (and a strict Nash equilibrium if every truth answer leads to a different distribution over the other person’s truthful answer). This mechanism does still have at least two flaws:

The first is that truthful reporting is only a Nash equilibrium and in general multiple other Nash equilibria will exist. It’s easiest to see with the first output agreement mechanism where players could receive the maximum collective payoff by just all reporting the first option no matter what so if they are able to collude, especially if finding their actual best guess takes some effort, they may try to coordinate on an equilibrium like this. Similar uninformative equilibrium can exist with Peer Prediction and they may similarly even Pareto dominate the truthful equilibrium for the players. In fact, players coordinating on these high reward uninformative equilibria has been observed in laboratory experiments. And to make things even more embarrassing for Peer Prediction, using a mechanism of this kind sometimes led to less truthful reporting than just giving the participants a fixed reward. This may be partially because using a more complicated mechanism nudged participants to start thinking strategically. People have done work to fix this by developing methods to tweak the mechanism, partially by carefully selecting the strictly proper scoring rule, so that the Nash equilibrium with the highest rewards for all players is the truthful Nash equilibrium.

The other potential problem is that in some settings expecting the mechanism to know the distribution D won’t be realistic. People have proposed a variety of approaches to address this but one that attempts to handle this problem as well as the previous one is a mechanism known as Bayesian Truth Serum. The idea here is to have everyone report their guess of what the overall distribution of answers will look like in addition to reporting their own answer. Players are then rewarded for giving predictions of the distribution of answers that are accurate and for their own answer being surprisingly common. The intuition behind rewarding surprisingly common answers is that people’s best guess of the distribution of answers will be made conditioning on their own so if people’s answers are positively correlated with one another they should each expect their own to be the most likely to be over-represented. This component of the BTS score is called the information score.

As for incentivizing them to honestly report their beliefs about the distribution of answers, this can be done by penalizing them based on the “distance” (more precisely the KL-divergence) from their reported distribution to the true distribution and this is called their prediction score. So let (x_{1}^{i}, \dots, x_{m}^{i}) be person i’s reported answer (e.g. (0,1,0,0,0,0)), (y_{1}^{i}, \dots, y_{m}^{i}) be person i’s reported distribution, \overline x_{k} be the fraction of players reporting answer k, and \overline y_{k} be the geometric average of the predictions of the fraction that would report k. The score will be:

BTS Score = Information Score + Prediction Score

BTS^{i} = \sum_{k=1}^{m}x_{k}^{i}log\frac{\overline x_{k}}{\overline y_{k}} + \sum_{k=1}^{m} \overline x_{k}log\frac{y_{k}^{i}}{\overline x_{k}}

So people will be incentivized to give truthful predictions of the distribution and being truthful maximizes their information score across all Nash equilibria as long as the number of players participating is sufficiently large. This requirement for there to be many players essentially comes from the fact that this truthfulness guarantee relies on each player being able to ignoring the effect that their answer has on the overall distribution.

This is still a very active area of research and I have a feeling there’s still a lot of interesting terrain among this region of mechanisms to be explored.

Understanding Stable Matching with Lattice Theory: Part 2

So as I discussed in the last post, it turns out that we can use some ideas from Lattice Theory to better understand the structure of stable matchings for some instance of the Stable Marriage Problem. For simplicity throughout this post I’ll refer to one side of the matching as men and the other as women.

As I mentioned previously, there will always be a stable matching that all of the men weakly view as the best and women weakly view as the worst and conversely a stable matching that all men weakly view as the worst and all women weakly view as the best. We can build upon this by arbitrarily picking a side, say the women, and then expressing the set of stable matchings as a partial ordering where the relation is defined so that for some stable matchings M and M’, M ≥ M’ if all women weakly prefer their partner in M to their partner in M’. The fact that this relation forms a partial ordering is not that surprising and would happen any we try to create a relation by aggregating individual preferences that are complete orderings like this, but what is more surprising is that this ordering is also a lattice!

The easiest way to understand this is to see what the methods for finding the join and the meet of any pair of stable matchings are. Take any pair of stable matchings M and M’ and for every woman match them with their more preferred partner between M and M’. It is not immediately obvious that this procedure will even yield a one-to-one matching but it does and this matching is also stable! Given that this process does give us a new stable matching it is not to hard to see that this will be the join of M an M’ (M ∨ M’). And we can take a similar approach to find the meet of M and M’ (M ∧ M’) by matching every woman with their least preferred partner between M and M’ which will again give us a new stable matching!

Also if you think about how these meet and join operations work we can see that they are distributive. For example, if we have three stable matching M1, M2, and M3 and some woman is paired with men m1, m2, and m3 in them respectively then the man she is paired with in M1 ∧ (M2 ∨ M3) will be Min{m1, Max{m2, m3}} and in (M1 ∧ M2) ∨ (M1 ∧ M3) will be Max{Min{M1, M2}, Min{M2, M3}} (where the Min and Max are done based on her preference list). So since Min{,} and Max{,} are distributive, these join and meet operations will be too. Thus we have a finite distributive lattice. In fact, this result goes in the other direction as well. That is, not only does every set of stable matchings from an instance of the Stable Marriage Problem form a finite distributive lattice, but for every finite distributive lattice there exist some instance of the Stable Marriage Problem such that it is isomorphic to the partial ordering formed by the set of stable matchings.

Since we have a finite distributive lattice, we know by Birkhoff’s representation theorem that there must be some partial ordering such that the lattice it generates is isomorphic to the lattice of stable matchings. Not only does this partial ordering exist though, but its elements can naturally be interpreted as a certain way we’re able to move between stable matchings! In order to see this though we’ll need one last concept.

Assume that we’ve used the Deferred Acceptance Algorithm to find the male-optimal and female-optimal stable matchings. Then we can use these to narrow down the possibilities for who each man and woman could be paired with in the rest of the stable matchings. Imagine then that for a given man we just crossed out every woman on his list above the one he’s paired with in the male-optimal stable matching and below the one he’s paired with in the female-optimal (i.e. male-pessimal) stable matching. We could also cross out that man from each of those woman’s list since we know he won’t be paired with them. We could then repeated this process for the rest of the men and do an equivalent one for all the women and we would then obtain what is sometimes referred to as the male and female shortlists. If everyone only has one person one their list then there must be only one stable matching and there really isn’t much else to say but if we assume this isn’t true then we can do the following procedure:

Pick a woman with at least two men still on her shortlist, then find a second woman whose top choice on her shortlist is the second choice on the first woman’s shortlist. Then find another woman whose top choice on her shortlist is the second choice on the second woman’s shortlist and so on. Notice when we do this we’ll never end up at a woman with only one man on her shortlist since that man must have already been removed from all of the other women’s shortlist. Therefore, since the women’s top choices are a matching we’ll eventually cycle back around to the first woman. If we now take this subset of woman and assigned them to be paired with their second choices we call this a rotation (since we’re rotating partners within this subset).

stable matching rotation

Now here’s the important part: when we applying one of these rotations we end up at a new stable matching! Additionally, if we cross out the old partners from the shortlists of the people involved in the rotation, we can then pick another woman with at least two people on her shortlist to construct another rotation that can get us to another stable matching. Each time we do this we do this we will move to a stable matching weakly better for all the men and weakly worst for all the women until we’ve removed all but one person from everyone’s shortlist which indicates we’ve moved all the way from the female-optimal matching to the male-optimal matching.

But can we reach any stable matching this way by picking the right rotations in the right order? Yes, and this is where the connection to Birkhoff’s representation theorem comes in. It turns out that there exist a partial ordering of these rotations such that the lattice generated by it is isomorphic to the lattice of stable matchings! More specifically, the elements of the lattice generated by this partial ordering L(P) will be a set of rotations and when we apply these rotations to the female-optimal stable matching in any order consistent with the partial ordering we will always arrive at a specific new stable matching and for any stable matching there’s a specific element of L(P) that will get us there from the female-optimal stable matching. For example, consider the lattice of stable matchings below:

Maches f.d.l.

Then the set of rotations must have some partial ordering such as this:

rotations p.o.

So that we can generate a lattice of sets isomorphic to the lattice of stable matchings like this:

lattice generated from rotations

Which then corresponds to being able to move between the stable matchings like this:


While in general the number of stable matchings may be exponential in the number of people, it turns out that any specific couple (m, w) can be used in only a single rotation so the number of rotations will be O(n^2). Furthermore, the set of rotations and their partial ordering can be computed in polynomial time.

An Application

Now we can begin to see how this can be helpful for finding good stable matchings. While this has many potential applications to finding a stable matching with good properties or even applying these ideas to other versions of the problem, I’ll focus on one example here. Let’s say we want to find what’s referred to as the egalitarian stable matching. We’ll assume that each person would get some cardinal value from each potential partner (which will induce an ordinal ranking that we can use to define stability the same as before). Our goal is to select from the set of stable matchings the one that maximizes the sum of the value people receive.

We can start off by computing the female-optimal stable matching and the partial ordering of rotations. Then for each of the rotations we can easily compute and assign it the change in total value that applying it causes just by summing up the changes for each person involved in the rotation. Now notice that the total value achieved in any stable matching is equal to the total value achieved in the female-optimal stable matching plus the sum of the value of each rotation that was applied to reach it. Thus finding the maximal value stable matching can be reduced to choosing among the subsets of the partial ordering of rotations that are closed under predecessor the one whose elements sum to the largest value. This can be done with a dynamic programming approach using a method fairly similar to backwards induction.

Start by examining the elements in the partial ordering with no successors. It is pretty clear that if these elements are negative then the optimal subset won’t include them since it could be improved just by removing them. Then look at each unexamined element that doesn’t have any successors that haven’t been examined yet and assign it the sum of its own value and the maximum value of its successors (or 0 if it has none). This allows you to determine the maximum value by adding these elements. This process can repeated of looking at unexamined elements who don’t have any unexamined successors and computing what is optimal value that could be gained by choosing to add them to the kind of subset we’re interested in. Once we’ve examined all of the elements we can see what the optimal subset of the partial ordering is and from there we simply have to apply those rotations in a way consistent with the ordering to the female-optimal stable matching. You may notice all of the algorithm’s parts here are polynomial time so the algorithm overall will be as well.

Hopefully you can now start to see how much elegant hidden structure this problem has and that understanding it is practically useful. I apologize for claiming so many things here without showing the proofs. For a more in depth look, I would recommend this and this.

Understanding Stable Matching with Lattice Theory: Part 1

There’s a famous problem commonly referred to as the Stable Marriage Problem. Imagine there are two groups of people that are both of size N that each contain people looking to get paired up with exactly one person in the other group and they have an ordinal preferences over who they get paired with (traditionally framed as men and woman looking to get married). One property that you might want a matching of the individuals on the two sides (hence forth just men and women) to have is that a man and woman that are not paired up would not want to leave their partners to be together. In other words, for any pair of couples in the matching (M, W) and (M’, W’) it is not the case that M prefers W’ to W and W’ prefers M to M’. If a matching satisfies this property it is called stable since there is a lack of incentive for this kind of pairwise deviation. The classical problem then is to find a stable matching given the preferences of the men and women.

A surprising result discovered in the 1960s was that a stable matching always exist and there is an efficient algorithm that finds one. This algorithm is sometimes referred to as the Deferred Acceptance Algorithm because it involves repeatedly having all of the currently unmatched people on one side propose to their favorite choice that hasn’t rejected them yet and having the other side provisionally accept the best offer they have received so far and reject any other offers. This algorithm is elegant in many ways but it also has the both interesting and unfortunate property that the stable matching that it finds is always the weakly best stable matching for everyone on the proposing side and the weakly worst stable matching for everyone on the accepting side. This is quite surprising because it’s not obvious that all of the people on each side should even agree with each other about which stable matching is best and worst (they are in some sense competing with each other after all), let alone that what’s best for one side is worst for the other. It also seems unfortunate in the sense that it seems like we can’t use this algorithm without biases the outcome towards one side of the market.

It turns out though the this result is only hinting at something deeper about the structure of the set of stable matchings for a given instance of the problem and we can actually use this hidden structure to search through the set of stable matchings to find matchings that satisfy nicer properties.

In order to understand this structure though we first have to understand some math related to Order Theory and Abstract Algebra called Lattice Theory. The rest of this post will be introducing this math in order to be able to apply it to the matching setting.

The first concept to know is a partial ordering (S, ≥). This is just binary relation ≥ defined on a set S such that ≥ is reflexive (x ≥ x), anti-symmetric (x ≠ y, x ≥ y ⇒ ¬(y ≥ x)), and transitive (x ≥ y, y ≥ z ⇒ x ≥ z). Some examples are the set of real numbers with the usual less-than-or-equal ordering, the set of  natural numbers with the relation of “is divisibility by”, and the set of vertices of a directed acyclic graph with the relation being “is reachable from.”

A lattice is a special kind of partial ordering where every pair of elements x, y has a join (i.e. least upper bound: z ≥ x, z ≥ y, and ∀ z’, z’ ≥ x, z’ ≥ y ⇒ z’ ≥ z) and a meet (i.e. greatest lower bound: z ≤ x, z ≤ y, and ∀ z’, z’ ≤ x, z’ ≤ y ⇒ z’ ≤ z).  Hence we can think of a Lattice as having two binary operations join(x, y) (represented with x ∨ y) that returns the join of x and y  and meet(x, y) (represented with x ^ y) that returns the meet of x and y .  In fact, there is an alternative way to define a lattice where we can view it as a set with two binary operations that it is closed under that satisfy:

  1. Commutative: a ∨ b = b ∨ aa ∧ b = b ∧ a
  2. Associative: a ∨ (b ∨ c) = (a ∨ b) ∨ ca ∧ (b ∧ c) = (a ∧ b) ∧ c
  3.  Absorption: a ∨ (a ∧ b) = aa ∧ (a ∨ b) = a

Some also satisfy a fourth property:

Distributivity: a ∨ (b ∧ c) = (a ∨ b) ∧ (a ∨ c), a ∧ (b ∨ c) = (a ∧ b) ∨ (a ∧ c)

A simple example of a non-distributive lattice is:

non-distributive lattice (2)

A good example of a distributive lattice is a set of sets that are closed under union and intersection where the relation is ⊇, join is union, and meet is intersection. In fact, an interesting representation theorem is that a lattice is distributive if and only if it is isomorphic to a set of sets like this. Another important thing to notice is that we can always generate such a sets of sets from a finite partial ordering. Say we have some partial ordering P pictured by the directed acyclic graph below:


Then we can think about the subsets of the vertices of P that are closed under predecessor, meaning if x is in the subset and z ≥ x then z is in that subset.

lattice genreted from a partial ordering

If you think about it you will realize that the collection of these subsets must be closed under intersection and union and we can describe them as their own partial ordering using ⊇ as the relation.


Thus this set of sets is a distributive lattice, so in other words we can generate a finite distributive lattice from any finite partial order. In fact, a result called Birkhoff’s representation theorem allows us to go the opposite direction as well! It states that for every finite distributive lattice L, there exist some partial ordering P such that the lattice generated by P, in the manner described above, is isomorphic to L.

At this point you’re probably understandably wondering what all of this could possibly have to with stable matching, and that will be the topic of the next post!



A Liberal Radicalism Mechanism for Public Goods

In a recent paper, Vitalik Buterin, Zoë Hitzig, and E. Glen Weyl propose what they refer to as a Liberal Radicalism mechanism for handling the provision of public goods.

The difference between a public good and a private goods has to do with rivalry and excitability. Private goods are rivalrous and excludable while public goods are non-rivalrous and non-excludable. A good is rivalrous if one person consuming it inhibits another person’s ability to consume it. For example, a sandwich is highly rivalrous because if someone eats it then no one else gets to eat it but information is non-rival because one person learning something doesn’t inhibit someone else’s ability to learn that thing. A good is excludable if there are effective ways of preventing people from consuming it. For example, most of your personal belongings are hopefully excludable in the sense that you can effectively control who gets to use them but a fireworks show is relatively non-excludable because it’s not really feasible for the people who put on the display to selectively choose who gets to see it. Of course, all goods really exist on a spectrum for both of these traits so there is no line in the sand at which something becomes a public good but, in general, the more something behaves like a public good the harder it is for traditional markets to handle.

This has to do with the fact that these markets rely on property rights. The less rivalrous something is, the more the benefits of it will come to people besides the person who paid for it and thus less of the good’s actual benefit will be taken into account by the market. The less excludable a good is, the harder it will be to define and enforce property rights in the first place. The combination of these two effects leads to what is commonly known as the free rider problem where individuals are incentivized to free-ride off of others contributions to public goods rather than contributing themselves which leads to the goods being under-provided overall.

There are variety of methods that have been used to try to handle this problem such as developing new technologies to make the good excludable (e.g. copyrights for information), having the good be provided or subsidized by the state, assurance contracts (in which people agree to contribute conditional on others contributing), or using moral, cultural, or social motives to compel individuals to contribute. One naive approach you might try if you’re a government or an influential philanthropist is to ask people how much they value the good and then look at the difference between the total amount that people value it and the current amount that people are paying for it and supply the difference. The problem with this is that people won’t be properly incentivized to tell you how much they really value the good. In fact, in such a situation people would most likely be incentivized to say they valued the good as much as possible since there’s no cost to giving a higher stated value.

One attempt to try to deal with this is matching programs where organizations agree to match individual contributions at some ratio such as 1:1 (this is usually only done up to some amount for each individual). While it seems reasonable to try to infer some information about how much people value the good from their contributions and compensate based on that, there is still arbitrariness in choosing the matching ratio and cap and there is no way to guarantee that the optimal amount is being compensated. Ideally, we would like to set up the system so that we can infer how much individuals value the good based on how much they contribute so we can know what the socially optimal amount is. The new mechanism being proposed in the paper attempts to do exactly this.

They do this by having a mechanism where each individual makes a contribution and then the organization implementing it makes an additional contribution so that the total amount contributed will equal the following:

CT = (C11/2 + C21/2 + C31/2 + … + CN1/2)2

(Where CT is the total contribution and Ci is the contribution of the ith individual out of N total people)

A couple of things to note initially about this system:

  1. The total contribution will always be greater than or equal to the sum of the individual contributions
  2. If there is only one individual then this behaves like a private good because the total contribution is equal to their individual contribution
  3. The additional contribution made by the organization will be monotonic in all of the individual contributions (which means someone will never cause the total to decrease by contributing more themselves)
  4. If each individual multiples their contribution by X then the total will be multiplied by X as well (this means the mechanism will be unaffected by irrelevant factors such as the currency used, how goods are split or combined, and how often the mechanism is run)

These all seem like desirable properties but does this mechanism really guarantee that it will pick the socially optimal allocation?  It achieves this in the sense that the Nash Equilibrium strategy for each individual involves them paying amounts such that the total amount of the public good that is provided is the amount at which the sum of everyone’s marginal benefit is equal to the marginal cost of providing the good.

To see why let’s look at an example.

Let’s imagine we have a good x that is perfectly non-rival and costs $60 to produce per unit along with two individuals whose true willingness to pay can be described using the following functions:

B1(x) = 50x – (x2)/4                               B2(x) = 50x – x2

(So, for example, the second individual would be indifferent between losing $50 and gaining 1 unit of the good). We can get the marginal benefit each person receives for an additional unit of the good by taking the derivatives of these functions:

MB1(x) = 50 – x/2                                  MB2(x) = 50 – 2x

Because we are assuming that the good is perfectly non-rival, one of the individuals using it does not impair the other’s ability to use it at all so the total marginal benefit will simply be the sum of the individuals’ marginal benefits:

MBT(x) = 100 – 5x/2

We said that each unit costs $60, so the marginal costs will just be $60:

C(x) = 60x                                               MC(x) = 60

If we set the total marginal benefit of the good equal to the marginal cost we will get x=16. This is significantly greater than what the market equilibrium quantity would be (which in this case is actually x=0!). But now let’s think about what would happen under this new mechanism. Since the equation for the mechanism determines how much of the good both people get we can get new functions for their benefit by substituting the equation for the mechanism into their old benefit functions:

B1(x) = 50(x11/2 + x21/2)2 – (x11/2 + x21/2)4/4

B2(x) = 50(x11/2 + x21/2)2 – (x11/2 + x21/2)4

Getting the marginal benefits and setting them equal to the marginal cost again, we get the following two equations:

50(C11/2 + C21/2)/C11/2 – 0.5(C11/2 + C21/2)3/C11/2 = 60

50(C11/2 + C21/2)/C21/2 – 2(C11/2 + C21/2)3/C21/2 = 60

Each of these equations tells us how each individual can best respond to the other’s contribution, so by solving the system of equations we can find how much they will contribute when they are best responding to the each other’s contributions (i.e. the Nash equilibrium). Doing this we find that C1 = 196/25 and C2 = 36/25. Plugging these back into the equation for the mechanism we find that the total quantity will be 16, which is exactly the optimal amount!

Up until this point I’ve been assuming that the good in question was perfectly non-rival and this was why the LR mechanism was achieving the efficient allocation. However, as mentioned previously, rivalry comes on a spectrum and most things that fall into the fuzzy category of public goods are not perfectly non-rivalrous. Luckily, this can be accommodated by making the mechanism provide an amount of the good that is a weighted average of what the LR mechanism and a traditional market would provide:

CT = α(C11/2 + C21/2 + C31/2 + … + CN1/2)+ (1 – α)(C1 + C2 + C3 + … + CN) where (0 ≤ α ≤ 1)

They also offer some alternative versions of the mechanism to deal with budget caps and negative contributions in the paper. There are some rather standard concerns someone might have about such a mechanism including:

  1. Is social welfare (in the sense of the sum of people’s willingness to pay) what we should be optimizing?
  2. Will real people using this mechanism actually tend to move towards the Nash equilibrium strategy?
  3. How can we ensure that people aren’t able to collude and how robust is the mechanism to collusion?

Additionally, it’s worth noting that if this mechanism were being used by a state and funded through taxation then it would need to be the case that increasing the amount the government is contributing doesn’t directly lead to the people in the mechanism being taxed more as this would change the incentives. I’ll be excited to see it tested out (starting out either in the lab or small-scale real world settings presumably) and I’m happy to take it as a reminder that there is still progress to be made in designing mechanisms to improve cooperation and collective outcomes.

The Mother of All Big Data

An excerpt from the book Scale by Geoffrey West on the Large Hadron Collider and the importance of incorporating prior knowledge when using machine learning techniques:

The entire project represents an unprecedented engineering achievement whose output is the mother of all big data-nothing comes close. There are about 600 million collisions per second monitored by about 150 million individual sensors in each detector. This produces about 150 million petabytes of data a year or 150 exabytes a day (a byte being the basic unit of information). Let me give you a sense of what the scale of this means. The Word document containing this entire book, including all of its illustrations, is less than 20 megabytes (20MB, meaning 20 million bytes). This MacBook Air can store 8 gigabytes of data (8GB, meaning 8 billion bytes). All of the films stored by Netflix amount to less than 4 petabytes-which is 4 million GB, or about half a million times larger than the capacity of this laptop. Now to the big one: each day the total amount of data produced by all of the world’s computers and other IT devices taken together amounts to 2.5 exabytes; an exabytes is 1018 bytes, or a billion gigabytes (GB).

This is awesome and is often touted as a measure of the big data revolution. But here’s what’s truly awesome: it pales in comparison to the amount of data produced by the LHC. If every one of the 600 million collisions occurring each second were recorded it would amount to about 150 exabytes a day, which is about sixty times greater than the entire amount of data produced by all of the computational devices in the world added together. Obviously this means that the strategy of naively letting the data speak for itself by devising machine-learning algorithms to search for correlations that would eventually lead to the discovery of the Higgs mechanism is futile. Even if the machine produced a million times less data it is extremely unlikely this strategy could succeed. How, then, did physicists discover the proverbial needle in this mammoth haystack?

The point is that we have a well-developed, well-understood, well-tested conceptual framework and mathematical theory that guides us in where to look. It tells us that almost all of the debris resulting from almost all of the collisions are actually uninteresting or irrelevant as far as searching for the Higgs particle is concerned. In fact, it tells us that out of the approximately 600 million collisions occurring every second, only about 100 are of interest, representing only about 0.00001 percent of the entire data stream. It was by devising sophisticated algorithms for focusing on only this very special tiny subset of the data that the Higgs was eventually discovered.