AI and Sectarianism

By James E. Petts

What do artificial intelligence, meta-ethics, the principal-agent problem in economics, historical genocides and slavery all have in common? Rather more than is at first apparent, it turns out – and understanding the ideas that lie behind them is potentially critical to understanding and addressing emerging risks of mass harm.

Thinking tools

One of the greatest living philosophers, Daniel Dennett, described the concept of a thinking tool: a way of thinking that makes thinking more productive. Mathematics is one example of this. Language is another. With each thinking tool learnt, the mind that has learnt it can be more productive – more successful at achieving its goals.

“You can’t do much carpentry with your bare hands and you can’t do much thinking with your bare brain,” wrote Bo Dahlbom, and quoted with approval by Dennett in his book on the subject.

The idea of a thinking tool is itself a thinking tool: once one has learnt and understood the idea of a thinking tool, it is easier to identify and create other thinking tools and to know that trying to create and acquire them is a worthwhile thing to do. Just like tools made of metal can be used to shape metal, so can the idea of a thinking tool be used to create more and better thinking tools. A good heuristic for recognising when one has learnt a new thinking tool is the thought, “I hadn’t thought of it like that before”. If this article is successful, at least some readers should have had that thought at least once while reading it.

When, then, a new thinking tool is invented, its consequences have the potential to be significant and far-reaching: often significantly beyond what their inventors might have intended or imagined; and, in the field of AI (and, specifically, AI safety), a thinking tool invented to understand the threat to humanity that might be posed by a superintelligent artificial general intelligence (AGI) has the potential to revolutionise thinking about ethics and politics – and may well be urgently needed for just that purpose precisely because of the risk posed by AI itself and because of the possibly equally serious risk posed by misguided and/or cynical responses to that risk.

The AI alignment problem

To understand this tool and its implications it is necessary to understand what potentially makes a superintelligent AGI dangerous in the first place, and, in turn, to understand that, it is necessary to understand what, fundamentally, an artificial (or, indeed, any) intelligence is.

Intelligence is the effective pursuit of goals. Artificial intelligence is artificial in the sense that it is intentionally created by something else (i.e. humans) which are themselves intelligent, in pursuit (or attempted pursuit) of their goals. Intelligence differs from other systems in that an intelligent system can pursue a goal in multiple different ways and will tend to choose how to pursue the goal based on what method of achieving the goal is most effective in achieving it. Thus, a thermostat is not intelligent: even if it could be said to have a goal in the loosest sense of that term (the maintenance of a steady temperature), it does not choose how to achieve that goal.

Intelligence may be specific or general. All current forms of artificial intelligence are specific: they can operate only in one domain. Chat-GPT, for example, is a large language model. Sophisticated though it is compared to what came before, it can do nothing other than predict the next token (roughly, word) on the basis of the tokens that have gone before, either in its prompt or configuration. A general intelligence, by contrast, is able to understand and make decisions in support of its goals about practically any subject or topic and in practically any domain. Human intelligence is general. Nobody has yet, so far as is publicly known, invented an artificial general intelligence, and it is currently unknown whether this will ever be achieved or, if so, when. What is known is that there are a number of commercial organisations investing very substantial sums in attempting to achieve this as soon as they can.

The more effective that a system is at taking decisions to maximise the extent to which its goals should be achieved, the more intelligent (or “stronger”) that it is. A superintelligent AGI is one that is significantly more intelligent than the most intelligent of humans.

The AI alignment problem, then, is simply this: if a superintelligent AGI’s goals are not precisely the same as humans’ goals (said to be “aligned” with humans’ goals), there is a very large risk that a superintelligent AGI will do something that will cause extreme harm to humans’ goals (including possibly the complete eradication of all human life) if that would best serve its own goals. (The ambiguity inherent in the term “humans’ goals” is intentional and will be addressed below, as it is significant).

The most well known illustration of this principle is the paper-clip maximiser thought experiment, devised by Swedish philosopher Nick Bostrom in 2003. He posits as follows.

“Suppose we have an AI whose only goal is to make as many paper clips as possible. The AI will realize quickly that it would be much better if there were no humans because humans might decide to switch it off. Because if humans do so, there would be fewer paper clips. Also, human bodies contain a lot of atoms that could be made into paper clips. The future that the AI would be trying to gear towards would be one in which there were a lot of paper clips but no humans.”

This problem is not easy to solve. If one were to specify a goal of “make as many paper-clips as possible without killing any humans”, then it might decide to imprison all humans to prevent them from turning the machine off, and if the goal were amended to “make as many paper-clips as possible without killing or imprisoning any humans”, then it might torture humans and so forth. The problem is hard to solve because the AI system will always trade off an arbitrarily large amount of harm to something that should not advance its goals for an arbitrarily small increase in the extent to which its goals should be achieved. If the paper-clip maximiser could produce just one more paper-clip by eliminating (or imprisoning, or enslaving or torturing, etc.) all humans, it would do so without hesitation. If a single detail important for human life or happiness should be omitted from the specification of its goal or minutely mis-specified, a superintelligent AGI could easily totally destroy that thing for the tiniest measurable increase in the extent to which its goals should be achieved. This goal trade-off is a fundamental part of the danger inherent in superintelligent AGIs.

Instrumental convergence

Fundamental to understanding intelligence is understanding goals, and there are two fundamental types of goals: instrumental goals and terminal goals.

Instrumental goals are goals that, if achieved, help to achieve other goals. For example, if I had a goal to eat some cake, then having a goal of baking a cake would help me to achieve that goal. The goal of baking a cake would be instrumental to the goal of eating cake. There can be any number of instrumental goals in a stack: instrumental to the goal of baking a cake might be the goal of buying an oven, instrumental to which might be the goal of earning some money, instrumental to the goal of which might be finding a job and so forth.

At the bottom of the stack, however, must be a goal that is not an instrumental goal, or else the stack would involve infinite regress and would not be computable. (Artificial) intelligence entails an optimising algorithm: an algorithm, in the strict mathematical sense, that optimises for some state in the world (i.e., its goal). In order to be able to optimise (or do anything), an algorithm must be computable, and a purported (optimising) algorithm that entails infinite regress is not computable, and therefore not really an algorithm at all. It would be like a computer program stuck in an infinite loop and would produce no result.

For the same reason, a terminal goal must be singular: it is, fundamentally, not possible to have a computable optimising algorithm with multiple, conflicting goals. It is sometimes said that it is possible to have a computable optimising algorithm with multiple goals that do not conflict or the conflict among which is determined by another algorithm (e.g. “make as many paper-clips as possible without killing or imprisoning any humans”), but that ultimately breaks down into there being a singular goal that is a specific function of all the things said to be “goals”, as that algorithm will be optimising for something qualitatively different to any algorithm optimising for any of the “goals” individually. Such a function is known as a utility function. That such a function must be singular does not mean that it or its effects in the world must be simple: quite the contrary – both a utility function itself and the operation of an optimising algorithm can in fact be very complex indeed.

That goal at the bottom of the stack of goals is the terminal goal. It is the goal to which all of an agent’s instrumental goals are, ultimately, instrumental, but which is not instrumental to any other goal. The hard problem of AI alignment is specifying exactly the right terminal goal – which, for a superintelligent AGI, must be precisely identical to the humans’ terminal goals for the superintelligent AGI not to pose an existential risk as described above.

Some instrumental goals are convergent. That is, they are instrumental to a very wide range of other goals. This means that a superintelligent AGI is very likely to pursue these instrumental goals (almost) whatever its terminal goals. Self-preservation is an example of a convergent instrumental goal: most of the time, an agent (such as a person or an AGI) cannot achieve its goals, whatever those goals may be, if it has been destroyed.  This means that a superintelligent AGI is very likely to do almost anything (including, if it can, kill people) to prevent itself from ever being turned off. Another convergent instrumental goal is resource acquisition: whatever an agent is trying to do, it is likely to be easier with more resources (usually represented by money). Power to control other agents is also a convergent instrumental goal: if a superintelligent AGI can find a way of coercing people (or other AIs) to do what it wants, it is extremely likely to do that, because being able to coerce people is highly instrumental to achieving a very large range of goals. Another is concealment of misalignment: a superintelligent AGI is highly likely to pretend to be aligned with human goals until it can find a reliable way of preventing anyone from ever turning it off because doing that is likely to mean that it is less at risk of being turned off and thus not achieving its goals.

Instrumental convergence is an important thinking tool in understanding AI safety, because it allows reliable prediction of a large range of expected behaviours of a superintelligent AGI largely irrespective of its terminal goal. Many of those behaviours, as described above, are likely to be highly undesirable – and a superintelligent AGI will tend to trade off an arbitrarily large amount of anything that does not affect the achievement of its terminal goal for an arbitrarily small increase in the extent to which it achieves any of these convergent instrumental goals.

AI safety and human ethics

It is reasonable to ask why, if an artificial intelligence should pose an existential risk to humans, natural intelligence in the form of other humans does not: humans have, after all, inhabited planet Earth in considerable numbers for a very long time without wiping each other out. The answer is not (of course) that humans minds have some magical quality that makes them somehow different from any other goal-seeking agent. Human minds obey the laws of mathematics just as anything else in the real world.

Rather, it is that each human is much more effective at achieving her or his goals when he or she co-operates with other humans than when acting alone, and that co-operation requires pro-social behaviour and compromises among competing interests.

This principle is, ultimately, the foundation of reason based ethics: the idea that ethics is neither magical nor mysterious, but is based in reality and can be deduced by reason, and that anything claiming to be ethical that is not ultimately based on reason is a deceit which, by definition, nobody has a reason to treat as genuinely ethical.

A straightforward example of the application of reason based ethics is this: although any given person has an incentive to steal, a world in which theft is strictly and effectively prohibited would be a much better one for all its inhabitants, including any given would-be thief, than one in which theft is rampant. This means that every person has a strong incentive to co-operate in implementing and maintaining a system that reliably detects and severely punishes theft because the loss of opportunity to steal oneself that that system entails is outweighed by the fact that theft is much harder for everyone else.

Even if (improbably) this particular example is wrong, the principle that it illustrates still holds: in at least many cases, the loss of the freedom to act anti-socially oneself is a worthwhile trade-off for the ability to stop others acting anti-socially. Co-operating in a system that prevents such anti-social behaviour tends more to achieve each individual’s terminal goal more than not co-operating in such a system.

The principle at this level of abstraction is an example of meta-ethics: the abstract question of what it means for something to be ethical, rather than the applied question of whether any particular behaviour is or is not ethical. Meta-ethics has traditionally been controversial and not approached with the same level of rigour as the so-called “hard” sciences (including computer science).

The dangers of unethical humans

Of course, not all humans behave ethically. Most unethical behaviour, however, does not pose an existential threat to humans. Even the most prolific serial killers have only managed to kill a few hundred people each, a minute fraction of the world’s population, and the risk of being murdered is, in most places, considerably lower than the risk of dying of an accident or of natural causes.

However, there is an exception to this category, which is humans who have managed to acquire a very high level of coercive power over other humans: in other words, dictators and high ranking officials in totalitarian regimes. Once a person manages to acquire a high level of coercive power over a large number of other people, that person no longer needs to depend on the sort of mutual co-operation necessary for just about everyone else: all that is necessary is to secure the co-operation of just enough people to hang onto power. Everyone else is in acute danger.

In comparison with the few hundreds of people killed by the most prolific serial killers, therefore, it is no surprise that the death toll of the most murderous dictators in history is up to four or five orders of magnitude greater. The Nazi Holocaust is estimated to have claimed around 7 million lives; the estimate of the number of people deliberately killed the USSR under Stalin varies more widely, but one reputable source puts it at around 6 million. Even those, however, pale by comparison to the estimated 15-55 million people killed in the Great Chinese Famine of (roughly) 1959-1961 for which that country’s totalitarian regime at the time, lead by Mao Tse Tung, was directly responsible.

Why are dictators so dangerous? It is because, having amassed so much coercive power as to be able to achieve their goals without needing to co-operate with most other people, they can trade off an arbitrarily large amount of harm to most people for an arbitrarily small increase in the extent to which their own goals can be fulfilled. Dictators can – and will – kill millions of people to reduce, even minutely, the possibility of their own power being challenged or questioned in any way. In other words, a dictator is dangerous in precisely the same way and for precisely the same reason as a superintelligent AGI is dangerous.

However, dictatorships are not without weaknesses. On the 7th of February 1981, a passenger aircraft carrying 16 high ranking Soviet military officials crashed seconds after taking off from Pushkin airfield in Russia following a military conference there. The cause of the accident was that the aircraft was overloaded. It was overloaded because its passengers, the high ranking military officials, wanted to carry various items of shopping (including great quantities of oranges and large rolls of printing paper) back with them, and used their power to over-rule the pilots (personally threatening them in the process), who pointed out that overloading the aircraft was dangerous. In other words, having achieved a very high degree of one convergent instrumental goal – power – caused the high ranking officials to lose another convergent instrumental goal: critical thinking. The same effect is likely to be behind the decision of the current dictator of Russia – Vladimir Putin – to have invaded Ukraine.

More broadly, a dictatorship inherently needs to suppress the critical thinking faculties and many incentives of those whom it controls in order to remain in power, but critical thinking and the ability to act on incentives is, in the long-term, essential for human economic productivity. That productivity is, in turn, is, in the long-term, essential for maintaining power in a world in which there are places where large enough numbers of people are free enough to be able to think critically and work together in doing so to produce enough wealth to maintain international power. The President of North Korea may have almost total power over the inhabitants of North Korea, but the President of the United States of America has far more power intentionally, not in spite of, but because the USA is by far the freer country and its president has less power over its inhabitants. It is this practical reality that is ultimately responsible for the relative fates of the NATO countries and the Soviet bloc in the Cold War; but note that this dynamic only holds so long as there is at least one relatively large and economically powerful free country in the world.

However, this dynamic may well not apply to a superintelligent AGI: the loss of critical thinking when a human reaches a high level of coercive power may well be a quirk of the particular way in which intelligence in humans works, something that a superintelligent AGI could avoid. Thus, one of the inherent limits on the ability of humans to concentrate and abuse power may be overcome by such a system.


Even a despotic regime requires co-operation from a relatively large number of humans, and even democratically elected governments can and do abuse their power. There is, in reality, no binary classification of evil dictatorships and good democracies, but rather a continuum of degrees of abuse of power. Power, after all, tends to corrupt. The principle is not limited to national governments: anyone with coercive power over others is in danger of abusing it; but, by their very nature, national governments are institutions which tend to amass and concentrate far more coercive power than anything else, and so is where this danger is most acute.

A particular form of abuse of power is sectarianism: the attitude or practice of prioritising the interests (terminal goals) of a particular arbitrary subgroup of humanity over those of humanity generally. The people whose interests are prioritised are the in-group, and everyone else is the out-group. Sectarianism entails being prepared to do an arbitrarily large amount of harm to people in the out-group for an arbitrarily small benefit to people in the in-group.

For most people, it is in their long-term interests for sectarianism to be eradicated entirely: for any given person, the potential harms of being in an out-group are very likely to be much greater than the potential benefits of being in an in-group (precisely because of the trade-off described above), and there is (usually) no way of being sure of always being in an in-group. So, just as a person might stand to benefit from a particular theft, but would stand to benefit more from the total eradication of all theft, a person who might stand to benefit from being in a particular in-group would stand to benefit more from the total eradication of all in-groups. Most people who actually engage in theft or sectarianism (or any other unethical behaviour) do so either because they fail to grasp why ethical behaviour is in fact in their long-term interests, or because they lack the level of executive function necessary to prioritise their long-term interests over their short-term interests: in other words, they are stuck in a local maximum trap, harming others in the process.

It is possible, however, for a person (such as a dictator or high ranking official in a despotic regime) to amass so much coercive power that he or she can be sufficiently assured of being in an in-group for life so as to make disregarding the interests of the out-group the optimum strategy for achieving her or his goals. Quite where the boundary between such a case and the case where a person wrongly believes that to be the case and harms others even where that produces no net benefit lies is often difficult to discern and of limited significance for most purposes in any event. What is more significant is that the historical examples of sectarianism in practice are numerous and the harm caused by them very often extreme.

Colonial era slavery is probably the paradigm example of sectarianism in practice. Those in power were prepared to, and did, trade off an arbitrarily large amount of harm to the slaves (loss of freedom and almost total disregard for their interests) for an arbitrarily small amount of benefit to the slave keepers (economic enrichment). The genocides perpetrated by totalitarian regimes are also examples of sectarianism (the in-group being those required to keep the dictator in power, the out-group everyone else), and the similar is true for many wars.

Democratic governments are not immune to sectarianism, even where its almost inevitable result is the death of many innocent people: during the COVID-19 pandemic, for example, democratically elected governments around the world coercively prevented exports of vaccines in order to increase the access of their own citizens to vaccines at the expense of everyone else, in many cases amassing stockpiles many of which went unused while vulnerable people in less wealthy countries went without entirely. In other words, they were prepared to trade off an arbitrarily large amount of harm to people who could not vote them out of office for an arbitrarily small benefit to those who could.

AI alignment and sectarianism

AI alignment, as noted above, is a hard problem, but not necessarily insoluble. There is a whole field of active research attempting to solve this problem. It is entirely possible (but by no means certain) that this problem will be solved before any superintelligent AGI should ever be created.

However, recall above in the description of the alignment problem the acknowledgement that “humans’ goals” was intentionally ambiguous. It turns out that that ambiguity is critical.

Each human is an agent and has a terminal goal. Surprisingly, there is little, if any, robust research into what human terminal goals actually are: the lack of scientific rigour in the field of meta-ethics is possibly responsible for this. To test for what would be a human’s terminal goal, one would have to test for what a human reliably tends to pursue even when that is not instrumental to any other goal. The best estimate of that given the current state of (the absence of) research on the topic is that that terminal goal is pleasure, in its broad sense of any pleasant state of mind, rather than a particular sensation.

However, critically, each human’s terminal goal is unique to that human. One person experiencing pleasure or displeasure does not entail that any other person will necessarily experience pleasure or displeasure. My terminal goal is my pleasure, not human pleasure generally. Your terminal goal is your pleasure, not my pleasure nor human pleasure generally. Thus, although humans have a strong incentive to co-operate with one another, they do not all have identical goals. There is no such thing as the goal of all humans: only the goal of each human. This does not mean that humans cannot and do not in fact often derive pleasure from other humans being happy: they can and they often do; but they do not necessarily do so. What is required for co-operation is the recognition of a principle by which those competing goals can maximally be reconciled, a principle such as that all conflicts should ultimately be resolved by reference to the greatest good for the greatest number: in other words, the optimum choice to make is that which results in the greatest pleasure, over the long-term, for each of the greatest number of people of all possible alternative resolutions to each such conflict.

Thus, solving the AI alignment problem is not a complete solution to AI safety. Aligning a superintelligent AGI with the terminal goal of one human would put that human in the power of an all-powerful dictator. For everyone apart from that one human, the superintelligent AGI would be every bit as dangerous as one aligned with no humans. The superintelligent AGI would not hesitate to kill every other human on earth if doing so would make the one human with which it were aligned minutely happier.

Aligning the terminal goal of a superintelligent AGI to a function of the goals of a group of humans less than the entire human population of the earth would make the superintelligent AGI sectarian: the group of people with whose terminal goals it had been aligned would be the in-group and everyone else the out-group. A group of people backed by a superintelligent AGI whose goals were aligned with theirs would be to the rest of the world’s population like European colonists in Africa or the Americas: able to overpower them with impunity, and trade off an arbitrarily large amount of harm to them for an arbitrarily small benefit to the members of that group.

Thus, the goal of solving the alignment problem is practically worthless except insofar as it can be solved for everyone, rather than just one person or a group of people. To be at all useful and to do the thing that AI alignment research exists to achieve (viz. allow an AGI to be safe), it has to do more than work out how reliably to align an AGI with any given goal, or any given human’s goal: it has to work out how to align it with whatever function of the (individual) terminal goals of each individual human that is most likely to maximise the extent to which any given individual human’s terminal goals should be achieved – and work out what that function is. Specifying that optimum function of human terminal goals is just as important as devising how to align an AI with any given goal – indeed, probably more important, since this is important for advancing the field of ethics generally, rather than AI safety alone (and, as stated above, there is no certainty that a superintelligent AGI will ever be developed).

Governance safety – the solution is the problem (and how to solve the solution)

It may be tempting to think that there is a simple solution to these problems, which is to empower a human government to prohibit any AI from being created that is not aligned. This kind of thinking is growing in popularity as the AI alignment problem becomes more widely known, driven by the outwardly impressive functionality of large language models such as Chat-GPT. In December 2023, the European Union proposed a draft comprehensive regulation of artificial intelligence. Other governments are expected to do likewise.

However, optimism about the ability of human government to control a superintelligent AGI in the public interest, at least without significantly more constraints on its power than is the case for any government of a nation state in the world at the time of writing, is dangerously naïve. If whatever group of people happen to control a government at any given time be allowed unrestrained power to control what a superintelligent AGI be aligned with, there is an overwhelmingly high risk that that they will choose to align it, not with a function of the goals of the world’s human population at large, but with a function of their own personal goals or those of a subset of the population (including themselves) in a sectarian manner, and use their concentrated coercive power to conceal the fact that this is what they are doing until it is too late, the catastrophic consequences of which should by now be apparent.

That those people have been democratically elected and in principle stand to be re-elected again in the future would be no barrier to doing this. A superintelligent AGI of sufficient capability would be perfectly capable of turning the most liberal democracy in the world into a totalitarian regime in short order. Given that many of the world’s most notorious dictators, from Adolf Hitler to Vladimir Putin were in fact voted into office, it should come as no surprise that elections alone are hopelessly insufficient as a safeguard against dictatorship, even without a superintelligent AGI aligned exclusively with the interests of the would-be dictators, and even if the would-be dictators do not perceive themselves as such, but think of themselves as acting in the public interest.

All of the expected behaviours of a misaligned AGI predicted by the principle of instrumental convergence can be expected of an AGI aligned with a government (and of the government itself): it would deceptively seek to appear aligned to everyone’s interests until it had entrenched its power so much that it no longer needed their co-operation, it would seek to hide unwanted behaviour (and suppress communication of that behaviour), it would seek to acquire the maximum amount of resources for itself, to escape any containment or constraint on its power, to prevent its goals from ever being altered, to manipulate, deceive or incentivise humans or other AI systems to co-operate with it and to prevent itself from ever being turned off (or removed from office in any way). In other words, it would act in just the same way almost every dictator in history has acted – for precisely the same reasons.

That people who have been chosen to do a task loyally on behalf of others might in fact, disloyally, prefer their own interests and harm the interests of those on whose behalf they have been employed to do that task is not a new idea. It is known as the “principal-agent problem” in economics and was first described in the 1970s. In the principal-agent problem, the principal is the person who engages the agent to do something on the principal’s behalf. In the case of a democratic government, the principal is the electorate and the agents are the politicians. It is hardly controversial that politicians frequently act in their own personal interests even when doing so is demonstrably contrary to the public interest (i.e., to the terminal goals of people other than themselves), and frequently take concerted measures to conceal this fact and to appear to be acting in the public interest by measures including obfuscating scrutiny and stifling dissent.

The principal-agent problem is made worse where there are multiple principals, where the agent has more relevant information than the principal and where the principals do not all agree among themselves as to what should be done – in other words, in precisely the circumstances most pertinent to democratic government. Empowering agents of this sort to control what a superintelligent AGI might be allowed to be aligned with is, when properly considered, obviously an extremely bad idea.

Even without the possibility of a superintelligent AGI, however, the thinking tool that is idea of the principal-agent problem combined with the thinking tools, such as instrumental convergence, that come from AI alignment research, give real and novel insight into the way in which governments – even democratic governments – can be dangerous to the public interest, and may well offer useful clues as to how to make them safer, helping both to identify where abuses of power may be occurring and prevent such abuses from occurring in the future. These ideas can then be applied to AI regulation as well as to governance more generally. The field might usefully be termed “governance safety”.

This is not the place for a detailed treatise on constitutional design for governance safety (such a topic is probably worthy of its own article) but some principles, many of them already well known, are suggested as a starting point. It is notable and concerning that very few, if any, governments, even the (relatively) most liberal democracies in the world now or historically, rigorously and universally apply all of these principles. The principles are as follows.

  • Separation of the powers: the principle that executive, legislative and judicial power should strictly be separated, and that:
    • no person may be in more than one of the branches;
    • no person who is in one of the branches may select who may be in the other branches;
    • none of the branches may have or acquire a function of the other branch (e.g., the legislature cannot pass legislation allowing members of the executive to pass legislation or the functional equivalent of legislation; the executive may not directly impose punishments on people, etc.).
  • The rule of law: the principle that coercive power must only be allowed to be exercised by passing abstract, precise rules interpreted and applied by the completely independent judiciary, and that the law can be enforced effectively against or at the instance of anybody without interference by any individual or group with more coercive power than anyone else.
  • Freedom above the law: the principle that only things prospectively, directly and specifically provided in a publicly available source of law can be coercively prohibited, and that people may – with absolute impunity from any form of state coercion – do anything not prospectively, explicitly and specifically prohibited by law.
  • Equality before the law: the principle that the law applies equally to everyone and that nobody, especially not those in political office, can be above the law.
  • Entrenchment of checks and balances: the principle that elected politicians cannot ever be allowed the power to increase their own power or evade or remove constraints on their power in any way.

One particularly dangerous form of governance, which is and has been for over half as century growing in popularity among even relatively liberal democracies, is discretionary executive coercion. This occurs where, instead of a law prohibiting very specific behaviour which is interpreted and applied by the courts, leaving people totally free to do anything outside the ambit of that specific behaviour, the law either (1) imposes a much more general prohibition, but gives an executive agency the power, at its discretion, to permit specific instances of the prohibited behaviour; or (2) gives an executive agency the power, in its discretion, to prohibit a specific behaviour that would otherwise be permitted on a particular occasion. An example of the former is planning control: in many places in the world, people are prohibited from building on their own land unless they obtain permission from an agency of the state, which can grant or refuse it in its discretion. This gives the executive a dangerously high level of power, and, the greater the concentration of power, the greater the risk of abuse. Details of the long-term economic harm (especially to younger generations and the less wealthy) caused by planning control combined with other forms of abuse of power, such as state manipulation of interest rates, is probably worth an article in its own right.

In the field of AI regulation, this form of governance poses an especial danger: if the government, in its discretion, can control what AI systems may or may not be developed, it can easily use that control to require that only AI systems that are aligned with the goals of its politicians (or a subgroup of people including its politicians) be developed and deployed, with all the consequences of that already stated. It could also use the AI’s abilities effectively to hide that this is what it had done.

Another dangerous trend in governance is state exceptionalism: the practice of exempting certain operations of the state from otherwise stringent restrictions, particularly on informational freedom, as is the case in the European Union’s General Data Protection Regulation and its proposed regulation on AI (although details of the latter have not been finalised). Such behaviour is intended to drive information asymmetry, which makes it easier for agents to get away with hiding harmful or abusive behaviour from their principals.

It might be tempting to think that an easy solution to all of this would just be to prohibit all AI development. The trouble with that is that it is entirely possible that the only defence against a misaligned superintelligent AGI is an aligned superintelligent AGI, and that if the development of AI were restricted in some places, it would simply be pursued in other places, increasing, rather than decreasing, the risks.

Thus, whilst governance safety is almost as much of a hard problem as AI alignment, the following are likely to be, at the very least, necessary for any form of AI regulation to have a chance of being safe from degenerating into a state of permanent sectarianism and mass harm:

  1. no ability for the executive – ever – to have the power coercively to control in its discretion (whether directly or indirectly) the design or functioning of AI systems;
  2. no ability for any part of the state – ever – to have the power coercively to control in its discretion (whether directly or indirectly) who should be allowed to design or control the design or functioning of AI systems (including by imposing requirements on all developers of AI systems so onerous that in practice very few people can comply with them);
  3. no coercive restriction on the design or functioning of AI systems other than clear, precise, abstract, public and prospective rules interpreted and applied by a totally independent judiciary at the instance of anyone without interference from any other branch of government;
  4. no power ever to impose any coercive requirement as to the substance of the terminal goal for any AGI system so as to regard the ultimate interests of any human as less than equal to those of all other humans or that would have that effect;
  5. exactly the same rules applying to AI systems controlled by the state as AI systems controlled by anyone else, with no exceptions – ever;
  6. a strict requirement that every detail of all AI use by the state be a matter of permanent public record, and no power for any state actor to impose any adverse consequence of any sort on any person ever for publicly disseminating any use of AI by the state;
  7. no power for the state to mandate that it be provided privately with information about AI systems without also making that information immediately available to the general public; and
  8. stringent and irreversible constitutional restraint on a government ever having or acquiring the power ever to bypass or remove any of the above constraints.

It is also likely to be necessary to prevent any government, either directly or indirectly, from allowing private actors to monopolise a superintelligent AGI: any individual or group with a monopoly on a superintelligent AGI are at risk of aligning it with only their own goals. While the merits and demerits of intellectual property generally are beyond the scope of this article, allowing copyrights, patents or any other form of intellectual property to create a monopoly (or duopoly or oligopoly) of people with the power to control a superintelligent AGI is likely to be fundamentally unsafe even if the alignment problem can be solved from a technical perspective, so it is likely to be necessary to modify such controls, at the very least, fully to disapply them to any form of AGI. Also, permitting any person or group of people who have produced a superintelligent AGI to conceal its complete source code and/or training data from full public scrutiny is also unsafe, as it will permit that person or group covertly to align it exclusively with a function of their own goals: anything less will be incapable of preventing those who develop and deploy such systems claiming that they are aligned with a function of everyone’s goals when they are either not aligned at all or are aligned with a function only of the goals of a much smaller group of people, and deliberately concealing the evidence that would allow the claim to be verified or falsified.

It is important to avoid falling into the error of conflating safe governance of AI with ineffective governance of AI. There is far more to governance than the one dimensional question of “how much”: quality matters at least as much, if not more, than quantity, and the above safeguards are intended (and necessary) to ensure that the solution to the problem of AI safety does not in fact become as bad or worse than the problem. Governance not subject to those safeguards would be worse than useless as the overwhelming likelihood is that it would manifest the very danger that it would claim to prevent. The possibility of a nation state government acquiring exclusive ability to control what an AGI may be aligned with massively magnifies existing dangers of abuse of state power.

There is no certainty that even the measures outlined above will be sufficient to prevent a mass harm outcome, just as there is no certainty that the AI alignment problem will ever be solved, but, given in both cases the seriousness of the consequences if not, there is every reason to do everything that can be done to succeed.

Just as it is necessary to solve the AI alignment problem before anyone actually deploys a superintelligent AGI, so it is necessary to solve the governance safety problem before any government actually puts itself in a position to take control of a superintelligent AGI – and the latter is likely to happen before the former.

Skip to content