Finally, after weeks of work, the position paper is written! It contains a hopefully clear presentation of my ideas about AGI and a tentative outline how to build it. Franz (2015) Artificial general intelligence through recursive data compression and grounded reasoning
This post is addressed at weak/agnostic atheists and makes an argument for strong atheism. If you believe in God, stop right here. The post will be useless to you, since you are probably not being rational about your beliefs anyway. If you are honestly seeking the truth though, read on.
This post has had a long time to mature in my mind so that, finally, I feel that I have the ability to express the idea clearly. Let me summarize the idea right away: the probability of an almighty God existing is zero, due to Occam’s razor, which is a mathematical necessity. Now, usually, atheists take the position of weak atheism meaning that they merely lack the belief in God and find the probability of him existing to be low, a 6 out of 7, as Richard Dawkins would say, where 7 would be certainty that he doesn’t exist. I would like to take the final step from 6 to 7 and have the audacity to think to be well-grounded in taking that step. Bear with me, dear reader, since we will have to delve into the mathematical areas of probabilistic inference and algorithmic probability in order to arrive at a clear understanding of the argument.
Burdens of proof
Strong atheism, also known as “gnostic atheism”, has a much more difficult stance than weak atheism. The former carries the full burden of proof that God doesn’t exist, whereas the latter merely has to rebut theist claims of God’s existence, which, to be honest, is child’s play compared to the strong position. It merely suffices to point out that the theist can not prove the existence of God in order to remain unconvinced and to settle into the comfortable position of “I don’t know”, of agnostic / weak atheism.
That being said, the reader can take some pop corn and curiously wait for me to provide evidence for God’s non-existence. But I will not do so and instead conclude his non-existence without any evidence for it, while still making a completely rational step of reasoning. However, in order to understand it, I have to give a little introduction into probabilistic reasoning.
The mathematical discipline of probability theory provides us with a precise framework for making inferences. First, I will introduce it and then explain to it works intuitively. Let’s say, we have some (non necessarily finite) set of hypotheses . Maybe is the hypothesis that the world has been created by an almighty God. Further, let be the set of data or evidence that we have received about the world.
Let be a probability distribution that encodes our prior beliefs in one of the hypotheses. For example, suppose there are only two hypotheses, =earth is flat and =earth is round. Then expresses what we believe before we receive any evidence for either of the hypotheses, e.g. we might not have any preference for any of the two hypotheses, which is why it is called prior distribution. This would make and . Note the rule that probabilities have to sum to 1, which encodes our belief that some hypothesis has to be true, i.e. the earth has to be either flat or round. In a similar way, the evidence is distributed according to .
In order to make inferences, we have to connect the hypotheses with the evidence, since hypotheses explain evidence more or less well, which make various hypotheses more or less likely. This is expressed by the so-called likelihood , read as probability of data given hypothesis . For example, might be the observation that when we go out into space, we see a round earth; let’s abbreviate this observation data with ‘space’. Then, we have to to get encode our intuition that observing a round earth from space making the hypothesis of a round earth more likely. Hence we would write and , since if the earth is round, it is likely to look round and if it is flat, it is unlikely that it looks round. But it still may if we only had a single snapshot from above and the earth is a flat with round borders, like a plate. Hence, the we wrote instead of .
Now we finally want to make conclusions which are computed by the so-called Bayes rule:
The posterior probability of the hypothesis given data can be computed from the likelihood the prior and the evidence . The posterior encodes what you should believe after having seen the evidence, if you make a rational inference. Keep in mind that this is all very basic probability theory which can be read upon in any book on Bayesian inference.
The point of this exercise is to sharpen the readers attention to the prior . Several comments are important. First, there is no inference without priors. This is very important. You can not arrive at conclusions without having made any assumptions. You can assume that every hypothesis is a priori equally likely or any other distribution but any choice is an assumption. Second, the prior is in the nominator which makes the conclusion probability of a hypothesis proportional to the prior . This reflects the fact that if you are biased toward some hypothesis a priori, you will also conclude its truth more likely after seeing the evidence a posteriori. The evidence may weaken the hypothesis but it will weaken it less so, if the prior belief in it was strong.
Here is important observation to be made: if you are a priori convinced that some hypothesis can not be true, i.e. , then the posterior of that hypothesis will also be zero. The same happens, if you are to 100% certain about a belief: . Usually, this is what we call dogmatism: if you hold on to a belief with absolute certainty, no amount of contrary evidence can make you think differently. Usually, this is a sign of bad reasoning since one can usually not justify such a strong prior. After all, our prior beliefs, if we reason rationally, come from previous conclusions, i.e. , and have therefore been formed through likelihood and previous priors. If previous priors have not been zero or one, then previous conclusions are neither and neither our current prior. The bottom line is, usually, when we start at not knowing anything and make rational inferences, we never arrive at dogmatic positions from which can’t be shattered by any amount of evidence. A rational person always keeps his belief distributions slightly above zero and slightly below one, so that he is still open to change his mind in a rational way, if the evidence demands so.
Let’s consider the God hypothesis: the existence of an almighty creator God. Bayesian inference tells us, that there are only two ways of arriving at the non-existence of God, i.e. of . Either the likelihood is zero, or the prior is zero. Now the likelihood can not be zero since if we assume God’s existence, everything can be perfectly explained due to his interventions, wonders and creations. The only way to arrive at zero posterior is to set the prior to zero. But would that not amount to dogmatism as just argued? If I assume that God doesn’t exist then it is easy to prove that he doesn’t exist. We merely get a circular argument of a dogmatic atheist. However, as I will argue, this is not the case, since the laws of mathematics themselves do not allow to assign God a non-zero prior. Read on.
A core property of probability distributions is that probabilities must contain values between zero and one (inclusively) and also have to sum to one. Something has to be true. This means that if the number of hypotheses is large, they all have to squeeze themselves into the interval [0,1]. For example, if you toss a coin, the prior of seeing “head” might be 0.5, and for “tail” also 0.5. But when you throw dice, there are not 2 but 6 possible outcomes, which makes you assign a prior of 1/6 = 0.1666.. to any particular outcome. The more hypotheses there are, the more you have to smear out the probability mass among the hypotheses and the less probability each one of them will receive. Now, of course, the prior is somewhat arbitrary and nobody forces us to assume a uniform distribution. We could still set the prior of throwing a “3” to be 0.9 and give the other five values a probability of 0.02 each. Then it will still sum to 1 (0.02 + 0.02 + 0.9 + 0.02 + 0.02 + 0.02 = 1.0). Thus, the mere presence of a higher number of hypotheses does not force us to choose as uniform prior. However, in order to avoid being biased toward any particular hypothesis before considering the evidence, we should choose a “fair” distribution, such that we can remain maximally open for evidence.
To summarize, if we have hypotheses, then the uniformly distributed prior would assign the probability of to each hypothesis. But what if we have got an infinite number of hypotheses? After all, there can be an infinite number of possible explanations for a given observation. Then, one might want to assign the probability of to each one, but that is zero, and an infinite amount of zeros don’t sum to 1, this is mathematically a not defined operation. Therefore, we can’t help but assign a different prior to different hypotheses. For example, we might enumerate all hypotheses somehow and call the th hypothesis . Then we might define for example . Then, the sum will converge to a finite number and is therefore normalizable to 1 (that’s a theorem in calculus; don’t bother, if you don’t get this particular point)
We conclude that in the case of an infinite number of hypotheses, we have to assign them different prior probabilities, so that their sum equals 1. But on what basis should we do that? At that point we must further define, what we mean by hypothesis. It might be an almighty God, or the process of evolution or some other process or cause. In any case, hypotheses need to have a description expressed in some language. Some hypotheses are simple, which means that their description is short. Other hypotheses require long descriptions. Formally, in the so-called algorithmic information theory, descriptions are computer programs that are executed and lead to an output — the data or observation, that is explained/computed in this way. The length of the description is often referred to as its complexity.
This can be expressed much more precisely, but I don’t want to burden this post with too much mathematical detail. The point is this: the more complex a hypothesis is (the longer its description) the larger is the number of hypotheses of the same complexity. For example, when we talk about binary strings, the number of strings of length is . This is also intuitively obvious, if there are 10 possible murderers there are many more combinations of how a crime could have happened than if there was only one murderer. The number of hypotheses grows exponentially with their complexity.
Now, given that the sum of probabilities for each hypothesis has to be 1, it means that we have to assign lower probabilities to complex hypotheses. This is what is called Occam’s razor: assume a priori, that a complex explanation is less likely. It is a mathematical necessity since there are so many complex explanations and their probabilities have to sum to a number below 1. Let’s now take the final step: the mathematical limit. What about hypotheses of infinite complexity? There are infinitely many hypotheses with infinite complexity. If we assign a non-zero probability of each of them, the sum will also be infinite. Therefore, their probability has to be zero, if we want to treat them equally.
An almighty God is a hypothesis of infinite complexity. Why? Because otherwise, we could think of a world that he couldn’t create. Since infinitely complex hypotheses have zero prior probability, hence a priori false, an almighty God does not exist. That’s it.
Now, one could think that out of all infinitely complex hypotheses one could assign a non-zero prior probability to an almighty God. Mathematically, nothing speaks against it. However, this is what it means to be biased. If we want to make unbiased, rational inferences, we have to treat all hypotheses of equal complexity equally.
This argument could be formalized in much more detail and is essentially a consequence of Solomonoff’s theory of inductive inference which has been proven to find optimal and predictive explanations for data.
One could also choose a more intuitive way to express the thought: since there are an infinite number of possible ways to explain things and an almighty God is merely one of them, the probability that God is the correct explanation approaches zero in the limit (= equals zero), if one tries to be rational and unbiased.
Atheists usually understand this point intuitively, since most atheists do not assign a probability of 50% to God’s existence. After all, not having evidence in favor of God, should lead one to some position of indifference at 50%, right? Wrong. Most atheists understand the point made above intuitively, which is why they often say that they find it quite unlikely that God exsits (a 6.9 out of 7). The reason why they don’t go all the way to the 7 is simply that humans are bad at probabilistic reasoning especially when it comes to taking mathematical limits. Humans have difficulties discerning small probabilities from very small ones although there can be orders of magnitude between them. Similarly, we have troubles discriminiting infinity from a very large number. However, if one thinks clearly about these matters with the help of mathematics, one can’t help but taking the limit: God does not exist.
Another reason why atheists are reluctant to go all the way is because they want to practice openness and readiness to change your mind if evidence in favor of God is given. Richard Dawkins said at some point, if the stars rearranged and made up the words “I am your God” (something like that), then he would believe. However, someone else said that any sufficiently advanced technology must appear as magic to a society that is not as advanced. Hence, chances are that there are aliens possessing the ability to rearrange the stars and to make fun of us by arranging them into such messages. Since powerful, advanced and humorous aliens definitely could exist with non-zero probability, the likelihood that its actually aliens making jokes is much higher than that it is an almighty God. In fact, it is infinitely more likely (since any finite number divided by zero gives infinity in the limit). Thus again, our inability to discern small numbers from very small ones or zero is the reason for wrong conclusions. A similar argument can be put forward for any finite observation, even if it is mind boggling. Thus no finite amount of evidence can ever point to God.
Of course, it is difficult to say these things in public, since people who did not even arrive at weak atheism, will not understand these things and assume that one is simply dogmatic and doxastically closed, just the same way as believers are often accused to be. However, as this post has hopefully clarified, there are good reasons for believing in God’s nonexistence a priori.
It just dawned on me. When we want to compress, we have to do it in one or the other incremental fashion, arriving in description lengths of intermediate length during the process. Also, there will be different paths that we can pursue during compression leading to different competing programs that aspire to explain our data. This ambiguity stems from the fact that in practice we don’t have immediate guarantees that the next greedy step is the correct one. Therefore, one keeps various half-built versions of programs in memory in parallel. Further, in order to do proper induction, not only the shortest program is of interest but several other short programs as well. This leads to the problem of having to keep track of an array of programs and at each step decide, which part of which program is going to be compressed further.
This decision has to be made based on an internal estimate of likelihood that a particular program is likely to be easy enough to be compressed further. In other words, optimal compression will choose that program that on the one hand, offers much yet uncompressed data and on the other hand is sufficiently easy to compress further. This is exactly the search for “intermediate complexity” tasks that cognitive forms of attention employ! These attention mechanisms focus on tasks that are neither too easy nor too complex. We see that this property automatically emerges from the compression requirement.
Having read this long paper by James P. Crutchfield (1994) “Calculi of emergence”, I have to admit, that it is very inspiring. Let’s think about the implications.
The first achievement of the paper is the definition of statistical complexity. Note that Kolmogorov complexity is a misnomer, since it rather measures randomness than complexity. Complexity on the other hand, seems to be somewhere between simplicity and randomness. Crutchfield measures it by taking a string and mapping each entry on a set of states where two entries have the same state and the predict the same distribution of futures conditioned on the past. Given that definition, the string is mapped essentially on a Markov chain. Since Markov chains possess a marginal distribution of states after running for a long time, the entropy of this distribution is the definition of statistical complexity.
However, this definition is only appropriate, if we get a finite description. In many cases however, a Markov chain or finite automaton, is not sufficient. Essentially, when trying to increase modeling precision, such as looking at ever longer past histories and future predictions, the number of states grows without bounds. When it grows to large, the system is pressed to “invent” a new computational class by grouping the states together to a new automaton, e.g. one with a register or other forms of memory. And, it is suggested to do that until a finite description is achieved. The statistical complexity is defined as the Shannon entropy of the state distribution in this highest description (lowest level finite description to be precise).
The cool part is that it somehow achieves filtering out the random part, somehow even including the pseudo-random part. After all, a theme running through the paper is also the following. The dichotomy between simplicity and randomness is somehow paralleled by the dichotomy between order and chaos. This parallel has been intriguing me for a long time already. The last time was when I asked myself, which strings are compressible, but are not incrementally compressible, i.e. that don’t have any properties. Another reason is that throughout the experience in machine learning, computation seems to happen best at the edge between order and chaos. The failure of Kolmogorov “complexity”, let’s call it (algorithmic) entropy, is evident here as well: chaotic sequences often have a low algorithmic entropy, even though they appear random. They appear so that convincingly that we use chaotic sequences in order to generate pseudo-random numbers in our computers. Somehow statistical complexity gets rid of those complications, slices away the random part and uncovers the “true” computational aspects of the system.
Another aspect of the discussion is the interesting interplay between different perspectives: an observer modeling a natural system / data set, the AI aspiration of automatizing this very modeling as a means to implement universal induction and nature’s creation of living beings that model their environments while getting ever more complex. The latter aspect is quite mesmerizing because of the interplay between order and chaos in nature, since nature seems to be a processes of self-organized criticality which means that the edge/border between order and chaos is an attractor, i.e. a stable state. If that state is exactly where innovation happens then it could explain why living beings become increasingly complex. And here is another contribution of that paper to that aspect: it is at the edge of chaos where the number of states tends to diverge into infinity. After all, both orderly/regular and chaotic regimes seem to lead to finite descriptions. But it is the divergence of state numbers that brings the modeling to its resource limits, forcing it to switch to a higher representation, a more powerful machine, a higher complexity level.
A yet another inspiring aspect of the paper is the parallel between “state grouping”, which is referred to as some processing that models the symmetries of the data, and detecting features in my incremental compression (IC) theory. I mentioned above that I suspect that it may be precisely the complex strings that are incrementally compressible. Given that hierarchical epsilon machine reconstruction is declared as a process the incrementally detects symmetries, the parallel is fairly strong. Here, in the questions about incremental/hierarchical construction, in the interplay between order and chaos, is a deep mystery to be uncovered, about evolution, the origin of complexity in nature and the essence of intelligence. I am quite confident about this.
There seems to be a deep parallel between the evolutionary incremental innovation process and the incremental process of induction. In the paper, innovation seems to consist of two parts. One is state grouping, which is a well-defined procedure, independently of the data. The other part is the “true novelty” which is found by random search in a small space, which results from the state grouping. In IC-theory no such random search is necessary once the regularity/feature/state grouping is found. Another aspect that comes to my mind is that this hierarchical epsilon machine process will often lead to a high hierarchy for natural data, since there is structure on all scales, hence there we have this open-ended hierarchical construction process. This reveals the role of the modeling precision: the more precisely we model, the more we sort-of zoom in on the data, the higher our hierarchies will have to become. Hence, it is not just the living beings that become ever more complex during evolution, but it is the structure of matter itself that organizes itself in such a way that it can only be modeled by a high hierarchy. The author views living beings as “dual” to their environment (matter), which is what it means to fill a “niche” in the environment. Thus it seems that this self-organizing criticality (SOC) process of nature forces matter into complex states which forces living beings to increase in complexity as well if they want to model those states properly in order to predict their environment and survive. Unfortunately, SOC is not sufficient for complexity, as evidenced by the simplicity of the showcase SOC model: the sand pile. Therefore, we don’t really know how nature forces matter into complexity.
Another interesting aspect of the paper is the following. The process of hierarchical epsilon machine reconstruction demands to build a machine by grouping those histories that entail the same future distributions. By increasing the precision, i.e. decreasing epsilon, the length of the look-back and look-ahead sequence increases which increases the number of reconstructed states. If this number grows without bounds as precision increases then it is advised to switch to a more powerful machine by grouping the new states again or some other innovation and do so repeatingly until a finite representation is reached.
This process reminds me of the “universal approximation” property of neural networks, see previous post on my ideas on that. The number of neurons generally also grows without bounds with increasing precision. Could it also be due to the simple fact that the present representation is not powerful enough? Maybe the particular neural network has to switch to some more powerful computation class? After all, my observations in that post implied that universal approximation is not the same as algorithmic completeness. Crutchfield’s paper helps to flesh out the difference: a simple representation can often approximate the data arbitrarily precisely, but at the expense of a large number of parameters, while more powerful computational classes can find a finite representation. The paper suggests a way of doing this, but I’m afraid that this “state grouping” procedure is too specific. Can we formally and more precisely describe what happens here?
Now that I think about it, there are several ways, in which hierarchical epsilon machine reconstruction (HEMR) differs from IC theory. First, although the author claims that the goal of state grouping is to find symmetries (features) of the data, this is not what happens. Rather, the attempt is to find a full description of the data in a single step in the hierarchy by using a not-so-powerful machine. After all, the goal of HEMR is to separate the random part from the “intrinsic computation” part, i.e. signal from noise. In IC theory, however, one tries to decompose the signal part into various features one of which will turn out to be noise which is carried through the parameters up to the highest layer since it doesn’t have features and is not compressible. Thus, IC decomposes the signal into crisp separate parts using functions/features from a Turing-complete set right at the beginning, while HEMR tries to model the whole signal at once but uses a not-so-powerful machine and steps up in the computation hierarchy if the machine doesn’t do it. If we combine IC and HEMR in a smart way, we might get a procedure that decomposes a string into parts which can be looked for by increasingly powerful machines. This would alleviate the IC-problem of having to search for features in a Turing complete space and the HEMR-problem of having to find a full description in a single step. Could we posit HEMR on the firm grounds of algorithmic information theory?
And what is the fundamental reason for the statistical complexity to peak at intermediate entropies? It seems that this peak corresponds to the most complex intrinsic structure which has to be modeled by the more powerful machines.
A simple but deep question. Why do we want to know this?
In order to speed up compression we have mainly considered two ideas, exploiting the compositionality of functions that generate data (incremental compression) and exploiting power laws in natural data (hierarchical compression). When looking at the human visual perception hierarchy we realize that in the lowest level of the hierarchy a very narrow set of functions is used. We mainly have edge detectors for various orientations and moving speeds and a few others like centre-surround cells, end-stopped cells, grid cells etc., in order words, a set of functions far too narrow for processing a wide range of data.
What’s the trick? Having a narrow set of functions definitely increases the ability to reduce data dimension if it is possible to generate the data by those functions. Let’s say, we only have edge detectors and we have to perceive a circle. How do we do that? It looks like our visual system is not particularly prepared to recognize arcs of various curvature. Then only way that comes to mind is to approximate the circle by a set of lines, which form a regular polygon. However, every such polygon has corners. So how do we know that a circle doesn’t have corners when all we can perceive are things made up of straight lines? After all, we know the difference between a regular 100-gon and a circle, although our representations can not detect a difference.
The reason is probably that there are many ways to approximate a circle with a polygon. The existence of edges and their position and quantity directly depend on the choice of approximation. They are a property of the approximation, not of the world. In practice, the system would try to fit a polygon and will realize that there are many possibilities to fit a polygon with arbitrary numbers of corners and lengths of the connecting lines. Also, the system will realize that those are not arbitrary polygons but quite regular ones, whose corners always have the same distance to the centre, which is an achievement of compression. Nevertheless, it is not exactly clear to me, how the “cornerlessness” property is inferred.
Solving this class of problems seems important to me since it would allow to achieve quite general induction abilities while using a narrow set of detectors are lower levels of processing that can reduce the dimensionality of the input significantly.
Finally, it has dawned on me. A problem that I had troubles conceptualizing is the following. On the one hand, for the purposes of universal induction, it is necessary to search in an algorithmically complete space. This is currently not possible because of the huge number of possible programs to search through. My mind then automatically wandered toward algorithms that don’t go through the trouble of searching through the whole search space available to them. Any parametrized machine learning algorithm, be it neural networks or linear regression, finds solutions by gradient descent or even simpler, but none bothers to loop through all weight combinations.
The next thought was that this kind of efficient search is only possible within a fixed algorithmically incomplete representation. However, it is well-known that neural networks are universal function approximators, which sounds like algorithmic completeness, does it not? After all, Turing machines merely compute a integer functions. Why then would it be so hard for neural networks to represent some structure not optimally suitable for them? For example, a circle can be represented by the equation , which is difficult to represent with, say, sigmoidal neurons. I attributed this difficulty to the fact that there are different reference machines after all and some things are easy to represent with one machine but difficult in another and vice versa.
But now I realize that there is something fundamentally different between universal function approximators and algorithmically complete languages. After all, polynoms are also universal approximators, too! But ever tried to fit a polynom to an exponential function? It will severely overfit and never generalize.
So, what is the difference between an algorithmically incomplete universal function approximator and a sub-optimal algorithmically complete reference machine? While the latter differs from another alg. complete machine merely by the constant length needed to encode a compiler between those machines, the former requires ever more resources when the approximation precision is increased. For example, every alg. complete language will be able to represent the exponential function exactly; some languages will require longer, some shorter expressions, but the representation will be exact. And the additional length in some languages is merely due to the fact that some functions are cumbersome to define in them, but once done, your function will be represented exactly. Conversely, a universal non-complete approximator will require ever longer representations when you increase precision and goes even to infinity when precision requirements are highest. More precisely, if is the maximum approximation error that you allow and is the size of the network increases as decreases. My point is that it should not depend on . For example, if you can represent the data perfectly in some representation with bits, then your representation should take bits where the constant does not depend on (i.e. const is just the length of the compiler that translates between the two representations). And this is just not the case for neural networks. And that is also why the Taylor expansion of the exponential function is an infinite degree polynomial — polynomials are simply not suited to represent the exponential function.
But why should precision matter? Why do we care about this effect? Should we not rather care about the generalization abilities? For example, imagine a family of exponentials , where is the width of the exponential. If I fit a polynom to it, I would need a different set of parameters of the polynom for each value of ! Therefore, polynoms don’t really capture the structure of the function. On the other hand, an alg. complete language will always have such a single parameter available and once the representation is constructed, the whole family is covered. That’s actually the point. In order to generalize, it is necessary to recognize the sample as an example of a family that it belongs to and to parametrize that family. And neither neural networks nor polynoms can parametrize the exponential family, without picking completely different infinite (not single!) parameter sets for the respective polynomials. An alg. complete language, however, can do this.
I think, essentially, the criterion is: if your data is drawn from a class which is parametrizable with parameters in some Turing complete representation, then your representation should be able to express that same data with not (much) more than n parameters. Otherwise, it is “just an approximation” which may fit the data well, but will certainly overfit, since it will not capture the structure of the data. This follows from universal induction. If your data is computed by a program , , then taking a longer program that computes some prefix of , , will hardly make able to predict the rest of : will not equal . And it is well known, that the shortest program computing x is the best predictor with high probability, so that every bit wasted in program length reduces your prediction abilities exponentially, due to the terms in the Solomonoff Prior!
Another point is that algorithmic completeness may not even be necessary, if the data created by Nature comes from an incomplete representation (if the world consisted only of circles, then you only need to parametrize circles in order to represent the data and generalize from them perfectly). That this is the case is not even improbable, since physics imposes locality and criticality (power law distributed mutual information in natural data), which probably makes the data compressible by hierarchies (as reflected by my theory of incremental compression. However, this class is still quite large, and we should be able to capture its structure appropriately.
I have presented this paper in the AGI conference in New York this year.
It presents a general way of speeding up the search of short descriptions of data that is made up of features – which is what our world is made of. This may well lead to a tractable solution of the problem of universal inductive inference.
The whole field of machine learning, and artificial intelligence in general, is plagued by a particular problem: the well known curse of dimensionality. In a nutshell, this curse means that whenever we try to increase the dimension of our search spaces in order to make our models more expressive we run into an exponentially increasing number of search points to visit until a satisfactory solution is found.
In practice, people are then forced to use smaller models with manageable parameter spaces to search through which sets up the whole problem of “narrow AI”: we end up being able to solve merely narrowly defined specific tasks.
We should know better though than searching more or less blindly through vast search spaces. Both decades of practice in machine learning, our philosophical thinking, scientific theories and the theory of computer science have demonstrated undeniably the validity of a well-known principle, Occam’s razor: “Entities must not be multiplied beyond necessity” (Non sunt multiplicanda entia sine necessitate). Especially in recent decades, Solomonoff’s theory of universal induction has shown that it is supremely important to find simple solutions to problems, i.e. that simple solutions are a priori exponentially more probable than complex ones.
Of course, we have known that before already and have been sincerely trying to reduce the size of our models and the number of their parameters. We have introduced various information criteria like the Akaike Information Criterion, the Bayesian Information Criterion, various penalty terms for model sizes. However, this is by far not good enough, as I will argue and will suggest a solution to it.
The problem with introducing a simplicity bias comes from the fact that we don’t know how to compute simplicity or “complexity” of any data set. Fortunately, we do have a formal definition of “complexity” – the Kolmogorov complexity, which I will call algorithmic entropy (since it actually is more related to entropy than to complexity, after all randomness is usually not considered complex in the usual way we use the word). The algorithmic entropy of a data string is given by
which is the length of the shortest program being able to compute that data set on a universal Turing machine . In practice, in order to know the algorithmic entropy of a data set, we’d have to find the shortest program being able to generate it and measure its length (the number of bits it takes to specify the program).
However, finding a short description is the reason why we started this whole thing in the first place: we need a the algorithmic entropy in order to evaluate the appropriateness of candidate solutions but we need the solutions in order to compute their entropy – a hen-egg problem. And as most hen-egg problems in computer science, this one may also be solvable in an iterative way.
Consider a typical set up of a machine learning problem. We have chosen some model and want to fit its parameters to the data using some cost function. The size of the model is defined by the number of parameters which we can vary. For example, a the size of a neural network can be regulated by the number of neurons involved which affects the number of weights that need to be fitted.
Ideally, we do not want to go through the parameter space blindly but to try first “simple”, low entropy, parameter settings. For example, a convolutional layer of a neural network consists of batches of a small number of weights per neuron where each neuron has got the same set of weights. “Simple” means that the program describing the weights will be short since it only needs to specify the small number of weights of a single neuron. By going through the simple first we ensure to fulfil the Occam’s razor principle and stop as soon as we find a solution that fits the data well.
The problem is, however, the following. For a fixed size of the parameter set, which can be represented as a binary string of a fixed length, the fraction of simple strings is very low. After all, a simple counting argument shows that there are at most programs of length at most , i.e. that are able to compress the string by at least bits. Since there are bits of length , the fraction of programs that compress a string by bits goes like . There are two things to learn from this. First, the fraction of simple parameter descriptions does not depend on the length of the parameter set , i.e. the number of parameters. And second, the fraction drops very fast, exponentially with every compressed bit. Hence, the number of simple descriptions is very low.
Therefore, if a successful search of the parameter space with appropriate bias toward simple parameter sets is to be performed, we’d rather find a way to fish out those very few simple sets out of a vast number of random ones. Here, we also note that merely reducing the number of parameters does not help much to find the simple ones: the fraction of simple ones remains constant for any number of parameters.
To get an intuition about the problem consider a parameter set in the form of a 100 bit string. Searching through all is tedious, since it is a large number. We would like to go through all strings with an algorithmic entropy less than 50 bits. Running up to 50 bit programs that print 100-bit strings is now manageable and we majorly reduce our parameter search space while giving priority to simpler parameter sets. However, there are two problems: more complex parameter sets cannot be represented that way and the 50 bit programs are themselves not ordered by their entropy. Thus, only a small fraction of them is simple and we end up searching blindly through a mostly random program space – it is the same problem as with the parameters. Now, if we represent the programs by a model with its parameters, we can proceed the same way as before: build another model on top of it that generates simple parameter sets. In essence, in order to achieve a simplicity biased search through the parameter space, it seems like a good idea to build a meta-model and a meta-meta-model and so on – an indefinite regress of models!
One may ask oneself why building a 50 bit meta-model to print 100 bit parameters of the first-level model to explain data? Why not taking simply 50 bit parameters without a meta-model which corresponds to the same entropy of the whole model pyramid? Since 100 bit parameters are simply more expressive, even though a small fraction, only sets are considered.
How could this Occam biased search be performed? Here is an outline:
- Take model with a variable set of parameters and the data set to be generated. Set .
- Set , and define a new meta-model that estimates the input:
- Use to compute an estimate of the input and use your favourite search algorithm to minimize the objective function by searching through the -dimensional parameter space ().
- If a termination criterion satisfied, then break and return the best parameter set.
- Else, set . If , go to (2), else go to (3).
This get’s the main idea across. The limit number of parameters should be chosen fairly small, pretty much as soon as there is a significant difference between simple and complex parameter sets.
Obviously, this approach requires a model that can describe itself, its own parameters. And at each step in the hierarchy has to compress its input somewhat. This approach is reminiscent of deep neural networks – where the model is simple, a sigmoid neuron – but the parameter space can be huge, which is the space where the weights live. Inputs are described by hidden neurons which are in turn described by yet other hidden neurons etc. However, note that there is a significant restriction usually in deep networks: usually the activation of hidden neurons is taken as representation/description of the input while the entropy in the weights is neglected. In order for this approach to work well, both the entropy in the hidden units and in the weights has to be taken into account.
Further note, that the meta-models will tend to be ever smaller and that the overall description length of the models is limited by the entropy of the input (there is no point in creating a hierarchy of models with higher entropy than the inputs, since it won’t generalize). Overall, the approach resembles a growing pyramid of models stacked onto each other as we go from simpler to more complex descriptions.
What can we hope to gain from that approach? Consider again the 100-bit parameter string example. What if the solution is a simple 100-bit parameter string? In such a case, the situation has three properties:
- The search space is too large to be searched through – exhaustively or any other way. That’s the curse of dimensionality.
- The solution is however contained in that space.
- The solution is algorithmically simple since it has a high prior probability of actually occurring.
If you think about it, this is a quite common situation. The classical solution to that problem was to consider narrow tasks where the search space is small enough to afford an non-Occam-biased search. Or a simple model has been picked such as linear regression, where a strong assumption of linearity allows an efficient search in a high-dimensional parameter space.
Here, I presented a way to go beyond those restrictions. Who knows, maybe a not-too-bad search would thus be possible in a Turing compete search space. It looks like this is what you have to do, in order to perform an efficient and Occam-bias-optimal induction. Interestingly, it entails models described by meta-models which are themselves described by other and yet other meta-models – in an indefinitely deep regress. Doing this in a Turing complete language requires homoiconicity – the property of being able to use programs as data, of which LISP is such a prominent example.
Does this idea finally break out of the narrow AI paradigm? I have come up with a simple test whether a system is narrow or not. Let a model have a set of parameters each taking bits to specify. Can I think of an input best fitted by a set of parameters whose description is smaller than ? Then we have found an input whose parameters can be described in a simple way albeit not in the present representation even though it allows the required entropy. For example, consider the space of 3-dimensional polynomials trying to fit points along an 6-dimensional polynomial of the form: . Of course, this is not possible. Let each parameter take integer values from -8 to +7, i.e. we need 4 bits to specify it. Then specifying a 3-dimensional polynomial takes bits. However, it takes much less information to specify that particular 6-dimensional polynomial since it is so regular. As of now, I have not met a single practical machine learning technique or model, for which a simple example could not be found that is outside the scope of the model.
How is my indefinite regress perform in that respect? If polynomials, limiting the entropy to 12 bits will include simple high-degree polynomials like the 6-dimensional above since it takes a 1-dim polynomial, a line, to specify the decline of numbers 6,5,4,3,2,1,0 and the 1-dim polynomial is easily in the scope of the 12 bit description. Thus, we seem to be holding an approach in our hands, that has the potential to be truly general and to break the curse of narrow AI.
On a more general note, the approach is reminiscent of the infinite regress of thought, where we can become aware or a thought, then become aware of that awareness and of that awareness etc. It suggests the thought provoking hypothesis that it may be the necessity of doing proper Occam-bias-optimal inference that has made awareness and self-awareness possible during the course of the history of human mind.