How I’d build an intelligent machine: a position paper

Finally, after weeks of work, the position paper is written! It contains a hopefully clear presentation of my ideas about AGI and a tentative outline how to build it. Franz (2015) Artificial general intelligence through recursive data compression and grounded reasoning

Posted in Uncategorized | Tagged | Leave a comment

Universal approximators vs. algorithmic completeness

Finally, it has dawned on me. A problem that I had troubles conceptualizing is the following. On the one hand, for the purposes of universal induction, it is necessary to search in an algorithmically complete space. This is currently not possible because of the huge number of possible programs to search through. My mind then automatically wandered toward algorithms that don’t go through the trouble of searching through the whole search space available to them. Any parametrized machine learning algorithm, be it neural networks or linear regression, finds solutions by gradient descent or even simpler, but none bothers to loop through all weight combinations.

The next thought was that this kind of efficient search is only possible within a fixed algorithmically incomplete representation. However, it is well-known that neural networks are universal function approximators, which sounds like algorithmic completeness, does it not? After all, Turing machines merely compute a integer functions. Why then would it be so hard for neural networks to represent some structure not optimally suitable for them? For example, a circle can be represented by the equation x^2+y^2=const, which is difficult to represent with, say, sigmoidal neurons. I attributed this difficulty to the fact that there are different reference machines after all and some things are easy to represent with one machine but difficult in another and vice versa.

But now I realize that there is something fundamentally different between universal function approximators and algorithmically complete languages. After all, polynoms are also universal approximators, too! But ever tried to fit a polynom to an exponential function? It will severely overfit and never generalize.

So, what is the difference between an algorithmically incomplete universal function approximator and a sub-optimal algorithmically complete reference machine? While the latter differs from another alg. complete machine merely by the constant length needed to encode a compiler between those machines, the former requires ever more resources when the approximation precision is increased. For example, every alg. complete language will be able to represent the exponential function exactly; some languages will require longer, some shorter expressions, but the representation will be exact. And the additional length in some languages is merely due to the fact that some functions are cumbersome to define in them, but once done, your function will be represented exactly. Conversely, a universal non-complete approximator will require ever longer representations when you increase precision and goes even to infinity when precision requirements are highest. More precisely, if \epsilon is the maximum approximation error that you allow and N is the size of the network N(\epsilon) increases as \epsilon decreases. My point is that it should not depend on \epsilon. For example, if you can represent the data x perfectly in some representation with M bits, then your representation should take N = M + \mbox{const.} bits where the constant does not depend on x (i.e. const is just the length of the compiler that translates between the two representations). And this is just not the case for neural networks. And that is also why the Taylor expansion of the exponential function is an infinite degree polynomial — polynomials are simply not suited to represent the exponential function.

But why should precision matter? Why do we care about this effect? Should we not rather care about the generalization abilities? For example, imagine a family of exponentials e^{-x/d}, where d is the width of the exponential. If I fit a polynom to it, I would need a different set of parameters of the polynom for each value of d! Therefore, polynoms don’t really capture the structure of the function. On the other hand, an alg. complete language will always have such a single parameter available and once the representation is constructed, the whole family is covered. That’s actually the point. In order to generalize, it is necessary to recognize the sample as an example of a family that it belongs to and to parametrize that family. And neither neural networks nor polynoms can parametrize the exponential family, without picking completely different infinite (not single!) parameter sets for the respective polynomials. An alg. complete language, however, can do this.

I think, essentially, the criterion is: if your data is drawn from a class which is parametrizable with n parameters in some Turing complete representation, then your representation should be able to express that same data with not (much) more than n parameters. Otherwise, it is “just an approximation” which may fit the data well, but will certainly overfit, since it will not capture the structure of the data. This follows from universal induction. If your data x=yz is computed by a program p, U(p)=x, then taking a longer program q that computes some prefix y of x, U(q)=yw, will hardly make q able to predict the rest of x: w will not equal z. And it is well known, that the shortest program computing x is the best predictor with high probability, so that every bit wasted in program length reduces your prediction abilities exponentially, due to the 2^{-l(p)} terms in the Solomonoff Prior!

Another point is that algorithmic completeness may not even be necessary, if the data created by Nature comes from an incomplete representation (if the world consisted only of circles, then you only need to parametrize circles in order to represent the data and generalize from them perfectly). That this is the case is not even improbable, since physics imposes locality and criticality (power law distributed mutual information in natural data), which probably makes the data compressible by hierarchies (as reflected by my theory of incremental compression. However, this class is still quite large, and we should be able to capture its structure appropriately.

Posted in Uncategorized | Leave a comment

My best paper

I have presented this paper in the AGI conference in New York this year.

Some theorems on incremental compression

It presents a general way of speeding up the search of short descriptions of data that is made up of features – which is what our world is made of. This may well lead to a tractable solution of the problem of universal inductive inference.

Posted in Uncategorized | Leave a comment

The merits of indefinite regress

The whole field of machine learning, and artificial intelligence in general, is plagued by a particular problem: the well known curse of dimensionality. In a nutshell, this curse means that whenever we try to increase the dimension of our search spaces in order to make our models more expressive we run into an exponentially increasing number of search points to visit until a satisfactory solution is found.

In practice, people are then forced to use smaller models with manageable parameter spaces to search through which sets up the whole problem of “narrow AI”: we end up being able to solve merely narrowly defined specific tasks.

We should know better though than searching more or less blindly through vast search spaces. Both decades of practice in machine learning, our philosophical thinking, scientific theories and the theory of computer science have demonstrated undeniably the validity of a well-known principle, Occam’s razor: “Entities must not be multiplied beyond necessity” (Non sunt multiplicanda entia sine necessitate). Especially in recent decades, Solomonoff’s theory of universal induction has shown that it is supremely important to find simple solutions to problems, i.e. that simple solutions are a priori exponentially more probable than complex ones.

Of course, we have known that before already and have been sincerely trying to reduce the size of our models and the number of their parameters. We have introduced various information criteria like the Akaike Information Criterion, the Bayesian Information Criterion, various penalty terms for model sizes. However, this is by far not good enough, as I will argue and will suggest a solution to it.

The problem with introducing a simplicity bias comes from the fact that we don’t know how to compute simplicity or “complexity” of any data set. Fortunately, we do have a formal definition of “complexity” – the Kolmogorov complexity, which I will call algorithmic entropy (since it actually is more related to entropy than to complexity, after all randomness is usually not considered complex in the usual way we use the word). The algorithmic entropy of a data string x is given by

C(x)=\mbox{min}(l(p):\; U(p)=x)

which is the length l(p) of the shortest program p being able to compute that data set on a universal Turing machine U. In practice, in order to know the algorithmic entropy of a data set, we’d have to find the shortest program being able to generate it and measure its length (the number of bits it takes to specify the program).

However, finding a short description is the reason why we started this whole thing in the first place: we need a the algorithmic entropy in order to evaluate the appropriateness of candidate solutions but we need the solutions in order to compute their entropy – a hen-egg problem. And as most hen-egg problems in computer science, this one may also be solvable in an iterative way.

Consider a typical set up of a machine learning problem. We have chosen some model and want to fit its parameters to the data using some cost function. The size of the model is defined by the number of parameters which we can vary. For example, a the size of a neural network can be regulated by the number of neurons involved which affects the number of weights that need to be fitted.

Ideally, we do not want to go through the parameter space blindly but to try first “simple”, low entropy, parameter settings. For example, a convolutional layer of a neural network consists of batches of a small number of weights per neuron where each neuron has got the same set of weights. “Simple” means that the program describing the weights will be short since it only needs to specify the small number of weights of a single neuron. By going through the simple first we ensure to fulfil the Occam’s razor principle and stop as soon as we find a solution that fits the data well.

The problem is, however, the following. For a fixed size of the parameter set, which can be represented as a binary string of a fixed length, the fraction of simple strings is very low. After all, a simple counting argument shows that there are at most 2^{n-c+1}-1 programs of length at most n-c, i.e. that are able to compress the string by at least c bits. Since there are 2^n bits of length n, the fraction of programs that compress a string by c bits goes like 2^{-c+1}. There are two things to learn from this. First, the fraction of simple parameter descriptions does not depend on the length of the parameter set n, i.e. the number of parameters. And second, the fraction drops very fast, exponentially with every compressed bit. Hence, the number of simple descriptions is very low.

Therefore, if a successful search of the parameter space with appropriate bias toward simple parameter sets is to be performed, we’d rather find a way to fish out those very few simple sets out of a vast number of random ones. Here, we also note that merely reducing the number of parameters does not help much to find the simple ones: the fraction of simple ones remains constant for any number of parameters.

To get an intuition about the problem consider a parameter set in the form of a 100 bit string. Searching through all 2^{100} is tedious, since it is a large number. We would like to go through all strings with an algorithmic entropy less than 50 bits. Running up to 2^{50} 50 bit programs that print 100-bit strings is now manageable and we majorly reduce our parameter search space while giving priority to simpler parameter sets. However, there are two problems: more complex parameter sets cannot be represented that way and the 50 bit programs are themselves not ordered by their entropy. Thus, only a small fraction of them is simple and we end up searching blindly through a mostly random program space – it is the same problem as with the parameters. Now, if we represent the programs by a model with its parameters, we can proceed the same way as before: build another model on top of it that generates simple parameter sets. In essence, in order to achieve a simplicity biased search through the parameter space, it seems like a good idea to build a meta-model and a meta-meta-model and so on – an indefinite regress of models!

One may ask oneself why building a 50 bit meta-model to print 100 bit parameters of the first-level model to explain data? Why not taking simply 50 bit parameters without a meta-model which corresponds to the same entropy of the whole model pyramid? Since 100 bit parameters are simply more expressive, even though a small fraction, only 2^{50} sets are considered.

How could this Occam biased search be performed? Here is an outline:

  1. Take model M with a variable set of parameters and the data set \vec{p}_0=\vec{x} to be generated. Set l=0.
  2. Set l\leftarrow l+1, n_l=1 and define a new meta-model that estimates the input: \hat{\vec{p}}_{l-1}=M(\vec{p}_l)
  3. Use \vec{p}_l to compute an estimate of the input \hat{\vec{x}}=M(\vec{p}_1)=M(M(\vec{p}_2))=M(\cdots M(\vec{p}_l)) and use your favourite search algorithm to minimize the objective function E(\vec{p}_l, \vec{x}) by searching through the n_l-dimensional parameter space (n_l=\mbox{dim}(\vec{p}_l)).
  4. If a termination criterion satisfied, then break and return the best parameter set.
  5. Else, set n_l\leftarrow n_l+1. If n_l>N, go to (2), else go to (3).

This get’s the main idea across. The limit number of parameters N should be chosen fairly small, pretty much as soon as there is a significant difference between simple and complex parameter sets.

Obviously, this approach requires a model that can describe itself, its own parameters. And at each step in the hierarchy has to compress its input somewhat. This approach is reminiscent of deep neural networks – where the model is simple, a sigmoid neuron – but the parameter space can be huge, which is the space where the weights live. Inputs are described by hidden neurons which are in turn described by yet other hidden neurons etc. However, note that there is a significant restriction usually in deep networks: usually the activation of hidden neurons is taken as representation/description of the input while the entropy in the weights is neglected. In order for this approach to work well, both the entropy in the hidden units and in the weights has to be taken into account.

Further note, that the meta-models will tend to be ever smaller and that the overall description length of the models is limited by the entropy of the input (there is no point in creating a hierarchy of models with higher entropy than the inputs, since it won’t generalize). Overall, the approach resembles a growing pyramid of models stacked onto each other as we go from simpler to more complex descriptions.

What can we hope to gain from that approach? Consider again the 100-bit parameter string example. What if the solution is a simple 100-bit parameter string? In such a case, the situation has three properties:

  1. The search space 2^{100} is too large to be searched through – exhaustively or any other way. That’s the curse of dimensionality.
  2. The solution is however contained in that space.
  3. The solution is algorithmically simple since it has a high prior probability of actually occurring.

If you think about it, this is a quite common situation. The classical solution to that problem was to consider narrow tasks where the search space is small enough to afford an non-Occam-biased search. Or a simple model has been picked such as linear regression, where a strong assumption of linearity allows an efficient search in a high-dimensional parameter space.

Here, I presented a way to go beyond those restrictions. Who knows, maybe a not-too-bad search would thus be possible in a Turing compete search space. It looks like this is what you have to do, in order to perform an efficient and Occam-bias-optimal induction. Interestingly, it entails models described by meta-models which are themselves described by other and yet other meta-models – in an indefinitely deep regress. Doing this in a Turing complete language requires homoiconicity – the property of being able to use programs as data, of which LISP is such a prominent example.

Does this idea finally break out of the narrow AI paradigm? I have come up with a simple test whether a system is narrow or not. Let a model have a set of n parameters each taking D bits to specify. Can I think of an input best fitted by a set of m>n parameters whose description is smaller than n\;D? Then we have found an input whose parameters can be described in a simple way albeit not in the present representation even though it allows the required entropy. For example, consider the space of 3-dimensional polynomials trying to fit points along an 6-dimensional polynomial of the form: 6x^6+5x^5+4x^4+3x^3+2x^2+1x^1+0x^0. Of course, this is not possible. Let each parameter take integer values from -8 to +7, i.e. we need 4 bits to specify it. Then specifying a 3-dimensional polynomial takes 3\times 4=12 bits. However, it takes much less information to specify that particular 6-dimensional polynomial since it is so regular. As of now, I have not met a single practical machine learning technique or model, for which a simple example could not be found that is outside the scope of the model.

How is my indefinite regress perform in that respect? If M= polynomials, limiting the entropy to 12 bits will include simple high-degree polynomials like the 6-dimensional above since it takes a 1-dim polynomial, a line, to specify the decline of numbers 6,5,4,3,2,1,0 and the 1-dim polynomial is easily in the scope of the 12 bit description. Thus, we seem to be holding an approach in our hands, that has the potential to be truly general and to break the curse of narrow AI.

On a more general note, the approach is reminiscent of the infinite regress of thought, where we can become aware or a thought, then become aware of that awareness and of that awareness etc. It suggests the thought provoking hypothesis that it may be the necessity of doing proper Occam-bias-optimal inference that has made awareness and self-awareness possible during the course of the history of human mind.

Posted in compression, Uncategorized | 7 Comments

Using features for the specialization of algorithms

A widespread sickness of present “narrow AI” approaches is the almost irresistible urge to set up rigid algorithms that find solutions in an as large as possible search space. This always leads to a narrow search space containing very complex solutions. As I have pointed out many times, we need an algorithm that covers a wide range of simple solutions, rather than going deep into complexity.

This approach however, requires to ability to construct new algorithms that are appropriate for the given task on the fly, since general intelligence means to be able to solve problems without knowing them in advance when the system is built. This on the fly construction is called specialization. I have referred to a similar process of data-dependent search space expansion. After all, if the algorithm at hand covers a narrow search space and the solution lies outside of it, the algorithm has to be modified such that its search space expands toward the solution and possibly narrows down in the opposite direction.

I suggest to use features. After all, enumerating an objects features leads to an incremental narrowing of the set of objects that still fall into the description. For example, if I say, the object is yellow, I have narrowed down the set of possible objects significantly, but there are still quite many. If I add that the object is of the size of a cat, the set is reduced to cat-sized yellow objects, and so on. Similarly, the set can be expanded by releasing some feature constraints.

Imagine a high-dimensional search space and the solution is part of a small subspace. For example, in probabilistic generative models, often a constraint is given by the data, such as X+Y=10, and a solution may be X=3, Y=7. To find the solution, it is much more efficient to parametrize the line represented by the constraint and search through the one-dimensional space rather than the two dimensional one.

When the way we travel through the search space is not fixed, it seems like we land in the field of policy learning like in reinforcement learning. Together with feature induction this reminds very much of AIXI.

The setting is the following. Let a generative model in form of a function f(X,Y)=X+Y be given and some loss function L(X,Y)=\left(f(X,Y)-10\right)^2. We can always sample from X and Y. Of course, we could now perform gradient descent or some other stupid, wannabe-general optimization algorithm. Instead, we want to system to construct it’s own algorithm that is well suited for this particular function. And it shall construct a different algorithm for a different function, for example g(X,Y)=\sqrt(X^2+Y^2). The first algorithm would ideally parametrize a straight line, the second a circle. This is what is meant by specialization.

Now, the point is not to do function inversion, e.g. the analytic or numeric construction of f^{-1} or g^{-1}, since first, it is difficult, and second generally intractable. After all, humans seem to be able to move efficiently around search spaces that are not easily representable in an analytic way. Or am I too confident about not using inverse functions? After all, they are the ones that exhibit the highest compression, and it is compression that I want to be guided by. Further, my incremental compression scheme involves the search for inverse functions. Let’s keep that question open.

Here is how features can come into play. Suppose I have found a few samples with zero or close to zero loss. For example, X=2,4,9 and Y=8,6,1, which form a sequence. Then one could search for an inverse function 10-X to compute the Y values in the parameters while have f as the feature function. However, this is a bad example, since the blind search for a feature inverse is just as hard as inverting the whole function, since this problem seems to have a single feature.

This looks much more difficult than it seemed at first glance.


Posted in Uncategorized | Leave a comment

The physics of structure formation

The entropy in equilibrium thermodynamics is defined as S \sim \ln(\Omega), which always increases in closed systems. It is clearly a special case of Shannon entropy H = -\sum_i p_i \ln(p_i). If the probabilities are uniform, p_i = const, then Shannon entropy boils down to thermodynamic entropy.

A system can be called structured, if some states are predicted with higher probability than others, which leads to lower entropy. As I have argued in a different post, the loss of the ability to predict the system is diagnosed by increasing entropy. In the extreme case, if microstates transition randomly with equal probability, chances are high to get to an unstructured state than to a structured one.

In order to describe both the structure and the transition laws, the concept of algorithmic complexity is needed. If the microstate i is described by a set of numbers, say the speed and position of N particles, then this set of numbers can be written as a sequence. Then a Kolmogorov complexity can be assigned to state i: K(i). Since K(i) is quite high for most sequences, starting out with low K and transiting randomly will increase K. Therefore the Kolmogorov complexity of the system will increase with time.

Interestingly, a low K allows a better (spatial) prediction of the sequence. However, one may have a model, a “law” if you wish, that governs the time evolution of the system. One may not restrict oneself with a microscopic law, but one about some emergent variables of the system. Even macroscopic ones that we have in classical thermodynamics. The “length” of such laws is, in some sense the Kolmogorov complexity of the system, neglecting all the micro- and mesoscopic details.

What happens when structures occur such as living systems or galaxies etc.? The system evolves into a low complexity state. Moreover, it looks like there is scale-free structure in the universe. Why is it not possible nowadays to plug-in the laws of physics, and see chemistry and biology evolve? Because, even with todays computers things would become too complicated.


Interestingly, the minimal entropy of a system corresponds roughly to the algorithmic complexity. Vitanyi and Li write on p. 187: “the interpretation is that H(X) bits are on average sufficient to describe an outcome x. Algorithmic complexity says that an object x has complexity, or algorithmic information, C(x) equal to the minimum length of a binary program for x.”

What would be the holy grail?

If we could update the theory of statistical physics to phenomena that create low complexity, i.e. structures, a theory of self-organization, really. If we could state the condition under which complexity of a subsystem will drop below some bound, then it would be a great thing. This may even become a theory of life.

The even holier grail would be to explain the emergence of intelligent life. To me, it seems sufficient to explain the emergence of subsystems that can develop compressed representations of some of its surroundings. For example, a frog that predicts the flight trajectory of a fly has achieved some degree of intelligence since it compresses the trajectory in its “mind”, which enables prediction in the first place.

What does that mean in physical terms? What is representation? In what sense is representation “about” something else? Somehow, it is the ability to create, to unpack, our representation into an image of the represented object, which is what it means to imagine something. Even if the frog does not imagine the future trajectory, it acts as if it knew how it will continue. Essentially, successful goal-directed action is possible only if you predict, that is only if you compress. However, actions and goals are still not part of physical vocabulary.

It will turn out that in order to maintain a low complexity state, the system will have to be open and exchange energy with the environment. After all, in a closed system, entropy and therefore complexity must increase. In order words, the animal has to eat and to shit 🙂

If you extract the energy from your environment in such a way that you maintain your simple state, does it not imply intelligent action already? Don’t you have to be fairly selective of what kind of energy you take and what you reject? If a large stone flies toward you, you may want to avoid collision: that type of energy transfer is not welcome, since it does not help to maintain a low complexity state. However, a crystal also maintains its simplicity. Probably because it becomes so firm that after a while it just does not desintegrate from the influence of the environment. It any case, it does not represent anything about it’s surroundings. It does not react to the environment either.


From Jeremy England (2013).

Irreversible systems increase the entropy (and the heat) of the heat bath.

\mbox{Heat production} + \mbox{internal entropy change} - \mbox{irreversibility} \ge 0  (8)

If we want internal structure (dS shall be negative) and high irreversibility, then a lot of heat would be released into the bath. Can this result be transformed into an expression with algorithmic complexity? If yes, and if we figure out how to construct a system such that it does create that, then we have figured out, how to create structure.

We can also increase beta, which is done by lowering the temperature. Thus, unsurprisingly, freezing leads to structure formation. But that’s not enough for life. Freezing is also fairly irreversible. So, maybe, structure formation is not enough. What we need is structure representation! What does it mean to represent and to predict in physical terms?

Let’s say a particle travels along a straight line. If a living organism can predict it, it means that it has somehow internally found a short program that, when executed, can create the trajectory and also expand it further in time. It can compute points and moments in future where the particle is going to be. It is the birth of intentionality, of “aboutness”. If there is an ensemble of particles with all their microstates, how can they be “about” some other external particle?

The funny thing is, you need such representations, in order to decrease entropy. After all, the more you compress, the less degrees of freedom are remaining, hence the state space is reduced and the entropy decreases therefore. There can also be hierarchical representations within a system, which means that there is an “internal aboutness” as well. Thus the internal entropy decreases once an ensemble of particles at one level is held in a macrostate determined by a higher level ensemble! Hence, predicting the “outer” world may simply be a special case of predicting the “inner” world, you own macrostates. Thus, in order to decrease the state space in such a way, a few high level macrostates have to physically determine all the microstate at lower levels. For example, in an autoencoder the hidden layer compresses and recreates the inputs at the input layer. In the nervous system neurons get active or not active and therefore take up a large part of the entropy of the brain. The physical determination happens through the propagation of an electric potential though the axons and dendrites of neurons. But, especially in the beginning of life, things have to work out without a nervous system.

Instead of thinking about a practical implementation, I could think of a theoretical description, such as the formula (8). I imagine all microstates of a level being partitioned with a macrostate assigned to each subset of the partition. Those macrostates would correspond to the microstates of the level above. Now, there is a highly structured outside world, which means that the entropy is low and there are much less states than are theoretically possible. If you have got sensors, some part of the outer world activates them. Their probability distribution is the one to be compressed. Which means, if the world has been created by running a short program, your job as a living being is to find that program. And why should that happen? It would be very cool to show that our world is made in a way such that subsystem will emerge that try to represent it and ultimately find out the way it has been made. Predicting food trajectories may be a start to do exactly that.

So, it is not just the goal of decreasing internal entropy, but to do it in such a way that it represents the outer world, the entropy of which is already decreased by the laws of nature. And what does represent mean in that sense? In the internal sense it means to physically determine ones own internal states. And for the outer world it means to have sensors somehow such that the states of the outer world are reflected by the states of your sensors. So, we can imagine the lowest layer/level to encode the states of the outer world, at least a part of it. And it does so in a non-compressing way, hence it is a one to one map of a part of the world. Can we show, that under such circumstances, compression in terms of algorithmic complexity is the best thing to be done? Basically, it means that some part of the animal is driven by some outer influences and therefore can not be changed, hence contributes majorly to the entropy of the animal. In order to decrease the entropy nevertheless, the animal has to find a way to recreate those same inputs.

Now, I have to clarify, what probability distributions are meant, when we compute the entropy. In a deterministic world, probabilities are always the reflection of our – the scientists’ – lack of knowledge. Those probabilities are different from the probabilities assigned by the animal: those are the animal’s knowledge. We should treat them as the same.

A way to reduce internal entropy is to couple all remaining internal states to the sensor inputs. Which does not necessarily mean to compress. Well, it does decrease the entropy, since only the sensor entropy remains, but it does not decrease it even further! In order to decrease it even further, the sensor entropy has to depend on internal states, which should be fewer in number. They have to GENERATE the microstates of the sensors.

Posted in Uncategorized | Leave a comment

Scientific progress and incremental compression

Why is scientific progress incremental? Clearly, the construction of increasingly unified theories in physics and elsewhere is an example incremental compression of experimental data, of the description of our world.

On the other hand, we know that the compression problem, the problem of finding the shortest program given data is optimally solved by Levin search. However, this search procedure is not incremental, but an all-or-none search where all programs are executed in a dovetailed fashion until a programs is found that generates the required sequence.

One explanation could be that our compression scheme in science is not optimal. However, progress in science is fairly fast and seems to be much faster than Levin search. Imagine, for example, the present quantum field theory incoded as a bitstring, and permutate all strings of that length. It certainly would take a gazillion years even with our fastest comupters.

What resolves that paradox then? This line of thought seem to show that the data that we are typically dealing with, when we try to explain our world, is drawn from a subset of all possible sequences, a subset, for which a much faster compression scheme exists than Levin search. I suspect that those are sequences with high logical depth.

Posted in Uncategorized | Leave a comment

Incremental compression

A problem of the incremental approach is obviously local minima in compression. Is it possible that the probability to end up in a local minimum decreases if the first compression step is large? It would be very cool, if that could be proved. What if greediness is even the optimal thing to do in this context? That would be sheerly amazing. What does “local minimum” actually mean in this context? Let’s say, we have an encoding y (with U(y)=x) those length is between the optimal and the original: K(x)<l(y)<l(x). A local minimum would be present if you can not compress the encoding further. But what is K(y)? Can it be random, i.e. K(y)\ge l(y)? No, after all, if the encoding is invertible, we can get y from x:y=U^{-1}(x). And since we can get x from the optimal program p with l(p)=K(x), it follows that K(y)=K(U^{-1}(x))=l(p), thus p can generate y as well. However, if only such a crook is possible, is it not what is mean by a local minimum? That the decoding path from p to x does not pass through y but instead goes via x and inverse encodings? Yes, that is exactly what is meant.

Thus, the question is whether there are suboptimal codes such that they cannot be compressed further without going back to x. Of course, abundantly. Imagine a string of n zeros x=000\ldots0 and a suboptimal code that splits it into two blocks by index 0<m<n, filling each with zeros. The optimal code takes about 1+\log(n) while the suboptimal one takes 2+\log(m)+\log(n-m). Since m is arbitrary, there is not way it can be compressed further: a truly random number can be taken around m\approx n/2 and the loss is maximal.

Thus, there are plenty of local minima. And that is maybe already the hint: if a compression step is suboptimal then it looks more random than the optimal compression step. But is this really true? Can we not find examples where finding a small and a big regularity are exchangeable? There are many such examples, exactly when we talk about orthogonal features that reflect compression steps of different size. For example, if the points are on a circle of fixed radius and the angle goes from 0° to 90°, we can first define the quadrant (upper left or so) and the define the angle or vice versa. Fully interchangeable. But we are not talking about features or partial compression. One step in the hierarchy is a fully generative model of the whole data set, not of a partial aspect or feature of it.

It is somehow like this: if you make a suboptimal compression step, you could have captured more regularity, but you did not do so, thereby introducing some randomness in our code, which will have to remain uncompressed. On the other hand, consider the sequence 1,2,3,…,n,1,2,3,…,n. Representing this as two concatenated incremental functions leads to (1,1,n), (1,1,n) which takes around 2\log(n), and we can keep going. Representing it as constants (i,2) defined on i=1,\ldots,n leads in a first step to \log(n!)+n\log(2) which is much larger. But ultimately both can be compressed as successfully. We need an example that would run into a local minimum, a dead end. This “introducing randomness” concept needs clarification. Why given a sequence 1,\ldots,n we introduce randomness if we split it into 1,\ldots,n-1 and n? This has actually happened in the current compressor version. Trying a random node on a sequence leads to such separations. Then we’d have to go back and try something different. There are many ways, a suboptimal path could be taken.

Maybe the line of argument should be that the longer the string is the more probable it is to get into a dead end? Therefore, choose to reduce the length of the string as much as possible? The number of different partitions definitely increases heavily with the string length (Bell number). But how should the probability of a dead end be computed? Via the number of programs able to print the sequence? Via the probability to get a random number? That is a number with low compressibility? Another argument may be that the number of random numbers is much greater for longer strings, since most are random anyway. But they are not truly random, they are just dead ends. How shall dead ends be defined?

I could still be in a dead end if I partly reconstruct the original. Hence, the criterion of incompressibility without going back to the original is not appropriate. A dead end could be if the only thing you can do is to unpack at least part of the data again in order to compress things further. Hence, in order to represent dead end, one has to break down compression into a set of programs. Let p_{0}=x be the original sequence to be compressed. Incremental compression is defined as the process of finding a list of programs p_{1},\ldots,p_{n} such that for i=1,\ldots,n:
l(p_{i})\le l(p_{i-1})

The compression is optimal if l(p_{n})=C(p_{0}). What is a dead end? If the length of some program p_{k} has to increase again temporarily. However, this temporary operation can always be subsumed into a single operation. Also the second condition can be always made true by subsuming non-decreasing program lengths into larger blocks.

This boils down into a more basic question: why would one use all those steps in the first place? Because it seems much simpler to find partial compression than the optimal one. Why is this so? Is it the nature of things? When doing practical compression, often one considers only a small part of the sequence and tries to find regularities there and tries to extend them to a larger part of the sequence, partitioning it on the way if necessary. Finding a compressing representation of a small part is much easier than for the whole sequence. However, then using universal induction, it is fairly probable that the sequence can be predicted at least to some extent and one gets more parts “for free”. And universal induction can predict sequences optimally! I think that something can come out of it.

The line of argument could be the following. Since it is easier to compress small subsequences of a sequence, it is reasonable to partition the sequence into such easily compressible subsets. The respective programs can then be concatenated and form a new sequence to be compressed. Doing so recursively may substantially decrease the time complexity of the compression / inversion algorithm. Let’s try some numbers. Assume a sequence x of length n to be divided into n/k subsets of length k. Finding an optimal program p_{i} for the ith subset that minimizes Levin complexity Kt takes 2^{l(p_{i})}t. Since we want to take small and simple subsets l(p_{i}) may be very small making that search tractable. Ignoring the encoding of the subset positions, we continue. The new sequence will consist of concatenated programs p_{1}\ldots p_{n/k} of length L=\sum_{i}l(p_{i}), with the time to find them being t\sum_{i}2^{l(p_{i})}. We keep going like this recursively until L=C(x). Assume that at every recursion step, the length decreases by factor \alpha. Then the number of recursion steps needed is restricted by n\alpha^{r}=C(x), thus r=\log\left(\frac{C(x)}{n}\right)/\log\left(\alpha\right). The total computation time then amounts to
This can be approximates further since l(p_{i})\approx k\alpha^{j}. Thus we get
T\approx t\sum_{j=0}^{r}\frac{n\alpha^{j}}{k}2^{k\alpha^{j}}<\frac{tn}{k}2^{k}\sum_{j}^{r}\alpha^{j}<\frac{tnr}{k}2^{k}

Thus, it is obvious that this approach is much better since k can be picked fairly small in practice. Yeah, that’s what my intuition told me: this approach should be exponential in the size of the small problems and otherwise grow linearly with the number of recursion levels, string length and execution time. This is much, much, much faster than Levin search.

This could be generalized fairly easily. The crux of the problem however, is to show that such shorter sets programs exist. In particular it is important to show that there always exists a partition of the sequence such that each subsequence can be compressed. But for that I will need universal induction, I guess. And I have to learn the theory much more thoroughly.

Let’s collect what we may need.

Definition 4.5.3 says: Monotone machines compute partial functions \psi : \{0,1\}^{*}\rightarrow S_{B} such that for all p,q\in\{0,1\}^{*} we have that \psi(p) is a prefix of \psi(pq).

Consider a sequence z=xy, consisting of subsequences x and y. Then, the subadditive property of prefix complexity dictates
K(z)\le K(x)+K(y)+O(1)
by Example 3.1.2. However, we need a partition where we have roughly equality. But equality is only reached in Theorem 3.9.1:
It is clear that the reason is the K-complexity of information in x about y (Definition 3.9.1), which is basically the algorithmic version of mutual information, except that it is not symmetric:

If it is zero, we get equality. Hence, if we want incremental compression of subsequences, we have to find partitions with minimal mutual information. But we have to take the complexity of the position sets into account as well. We may define a subsequence exactly as one minimizing mutual information between it and the rest plus the complexity of the position set on which it is defined. However, even if this leads to a subsequence not identical to the whole sequence, there is no guarantee that that kind of compression may lead to further incremental compression.

Could it be that the fractal and self-similar nature of the world may be exactly the sort of data that is incrementally compressible? Maybe the complexity of the world has been built up in “slices”?

Why is the math of SOC systems so horrendously complex? Maybe, because there is no short description of those phenomena?! If mathematics is a description then only simple and regular things can be described mathematically, otherwise, we don’t get our heads around it. On the other hand, there is chaos, which is used to model randomness. Why are chaotic systems good generators of pseudo-random numbers? After all, the law is often very simple, hence the true complexity is fairly low. Thus we get numbers that seems very random / complex, but are in fact very simple.

It looks like the term “complexity” is not really captured by Kolmogorov’s definition. A random number is not complex. Let’s read about “logical depth”. “Both gases and crystals are structurally trivial” (p. 589) is fairly revealing. But their Kolmogorov complexity is fairly different. It is about structure. “A deep object is something really simple but disguised by complicated manipulations of nature or computation by computer.”

That reflects my intuition. We need to restrict ourselves to the “deep” subset of strings with low algorithmic complexity. Deep and simple strings.

One way of relating incremental compression to string depth is to acknowledge that it takes time to recursively unpack a representation. There is a lot of recursive reuse of computation output.

I started to consider partitions but partitions are only one way of identifying different components of a string. For example, ICA identifies that a data set is the sum of several components. This is also a type of compression, of course. Just thinking of partitions is too restrictive. However, it could be useful for our function network approach to define the scope of the representation and to derive an expression for the time complexity of the algorithm.

It’s funny, ICA just identifies a stimulus as the sum of simple stimuli. Let’s say, we ignore that and research interval partitions first, then general partitions. What should be done is to investigate what fraction of all sequences are covered this way.

It depends on the number of levels m. If m=1 then the program p_{0} generates the output x=p_{1} directly. This corresponds to all sequences anyway. If m=2, the one intermediate level p_{1} is necessary to generate the output x=p_{2}. Here, we impose K(x)=l(p_{0})<l(p_{1})<l(p_{2})=l(x). What does that mean? Obviously, it means that the reference machine U is able to create the output x from p_{0}. In the context of output reuse, one could think of the number of times that a square of a single-tape machine has been read and rewritten before the machine halts. We can restrict that to stage-wise computation by requiring that a square is not read-written the n+1st time before all other square have not been read-written the nth time. This is how stage-wise computation can be defined.

We can define the “usage” of a square if it is read for the first time after it has been written.

Actually, the computation process of any deep string can be expressed as stage-wise computation. After all, if before writing to a square the n+1st time, read the content of all square at stage n-1 and write the very same content into them (that is without changing them). This way, they arrive at stage n. This proves that long computation time is equivalent to a computation with many stages!

What else would I like to prove? That compressing deep sequences is much faster than Levin search.


Let’s pose the question differently. Let’s say the shortest program generating x is p. And let q be an intermediate program, l(p)<l(q)<l(x), with U(p)=q and U(q)=x. Usually, Levin search allows finding p in time 2^{l(p)}t. The crucial question is whether the existence of intermediate stage q allows finding p faster.

The intuition is as follows. Levin search is so slow, because partial progress does not help in any way to achieve further progress. This is the case since p is random being the shortest program. Therefore, after having found, say, a partial program p_{1} with p=p_{1}p_{2}, the search for p_{2} is in no way easier, since p_{1} does not contain any information about p_{2}, otherwise p would not be random.

However, what if knowing q made things faster? That would require that the knowledge of q could in any way accelerate finding p. How so? After all, knowing x does not accelerate finding p through Levin search. Levin search for q is also deadly, since 2^{l(q)}\gg2^{l(p)}. What if we split q=q_{1}q_{2} with l(q_{1})\ll l(p) and that q_{1} generates x_{1} which is part of x: x=x_{1}x_{2}, on a monotone machine. In that case, Levin search will find q_{1} fairly quickly. Now, q is not random and can be predicted given a program generating q_{1}. The correct program would be p, of course. Hmm…

How can p ever be synthesized, if it is random? It has to depends on q. If p=p_{1}p_{2} could be concatenated from two independently searchable programs, then the cost of finding p would reduce drastically. Why can it not be done on a monotone machine? Is this not always the case on a monotone machine? No. p_{2} can not generate q_{2} unless p_{1} has been run before. This temporal contingency is mediated by the work tape. That’s why. No, the reason is different. Based on q_{1} there is no guarantee that we will find p_{1}. We will likely find a shorter program p_{1}', which is not a prefix of p. Thus, it won’t find p ever that way unless it tries all 2^{l(p)} strings.

That doesn’t help either. In my demonstrator, it does come to mind, that higher levels do get increasingly simpler, the entropy decreases. Maybe, q is easier to crack, since it is “simpler” than x? That does not make sense since q and x have got the same algorithmic complexity, which is l(p)! But in the demonstrator, the numbers tend to get smaller and the intervals narrower.

Let’s turn back to predicting q_{1}. Finding p from scratch is too difficult, since it takes 2^{l(p)} combinations. But one could find a smaller program p_{1} that not only explains q_{1} but also a part of q_{2}. That’s the whole point. The hierarchy is split up like a tree: every program part generates several parts. Therefore, the notation has to be different: x=x_{1}\ldots x_{8}, q=q_{1}\ldots q_{4} and p=p_{1}p_{2}. Of course, U(p)=q and U(q)=x. But it is also true that U(p_{i})=q_{2i-1}q_{2i}, U(q_{i})=x_{2i-1}x_{2i}. What follows? We can use x_{1}x_{2} to find q_{1}. From q_{1} we can already hypothesized about p_{1} and potentially find it. Or find a different program that extends q_{1} to q_{1}q_{2} correctly. The point is, that q_{2} comes for free given p_{1}. And x_{3}x_{4} are then predicted readily. Even x_{1} could be enough to find x_{2}x_{3}x_{4}. There has to be a synergy between those layers! Just like in the hierarchical Bayesian approach. And here is the synergy. In the bottom-up direction, although x_{1}x_{2} is necessary to find q_{1}, x_{1} narrows down the set of possible programs q_{1} generating x_{1}x_{2}. In the top-down direction, given p_{1} we can generate q_{2}, even if we only know p_{1}'.

The problem is, even if x_{1} narrows down the set of possible programs q_{1}, does it really relieve us from having to loop through all 2^{l(q_{1})} combinations? No, certainly not in the beginning. And we also don’t get around looping through 2^{l(p_{1})} or just 2^{l(p_{1}')}<2^{l(p_{1})} combinations. But how does it help to find p_{2}? Or even p_{1}, given that we found p_{1}' first? It would be already very helpful if we could search for p_{2} independently. And if it does it should do so only because of the presence of the intermediate layer.


The crux of the problem is that Levin search does not “use” the information in the sequence that it tries to compress. It only loops through all programs until one of it generates the sequence. That’s the most stupid thing you can do. What the demonstrator does is to detect regularities and uses the to compress the sequence incrementally. If detecting the regularities is sufficiently cheap, then this might lead to a substantial decrease in the time complexity of the algorithm. Cheapness is exactly what follows from the large amount of stages. We want to have those stages exactly because each such stage transition is much cheaper than finding the final shortest program at one step.

Can we use the formalization of the P-test in order to measure the partial regularities at each stage? According to Definition 2.4.1 we require for the P-test \delta
\sum_{l(x)=n}\left\{ P(x):\delta(x)\ge m\right\} \le2^{-m}
for all n. For a uniform test, we have
$latex d\left\{ x:\delta(x)\ge m,l(x)=n\right\} \le2^{n-m}$

The statement being, for a sequence x of length n drawn randomly from a uniform distribution, any feature occurs more than m times with probability less than 2^{-m}.

If we want a universal test for randomness, then we have to consider a \delta_{0} dominating \textit{all} such \delta‘s. However, if we fix a particular sequence x, then only a subset \{\delta_{1},\ldots,\delta_{\text{s}}\} is enough to enumerate all nonrandom features and a much smaller \delta_{0}' is enough to measure randomness. What if at each stage a different nonrandom feature is eliminated until a fully random shortest program is reached? The problem is that the \delta‘s are just tests for randomness and not full fledged representations. I have to show somehow that I can eliminate one nonrandom feature at a time.


I somehow have to formulate the intuition that it is much simpler to make an incremental compression step than to find the shortest program immediately. Why is this so anyway? Because the compression goes along the lines of the unpacking, generating of the sequence. It is just the reverse trajectory that is passed. Returning to the previous reasoning, the reason why it is simpler, is that one can take a fraction of sequence x, say x_{1} and use it to find large parts of q, say q_{1}. It would be much harder to find any part of p, because of the intricacy of the computation. But is exhaustive universal search not what we do for small fractions like x_{1}? Is that not the process of detecting regularities? Consider a sequence x=1,5,7,11,13,17,19,20,22,25,29,34,40,\ldots. Taking differences of neighbors leads to q=4,2,4,2,4,2,1,2,3,4,5,6,\ldots, which in turn leads to p=4,2,1,1 implementing an alternation function and a incremental one, neglecting lengths. In my demonstrator, the criterion for adapting q is the decrease in entropy. The hunch is that sequences with lower entropy are “easier” to predict. But is this true? After all, the algorithmic complexity has remained the same. The word “ease” is used here in the sense of the extent to which a small part of the sequence can be used to predict large parts of it. In our case, it is quite often the case. After all, 4,2,4 is much easier to predict than 1,5,7,11. Which means that a fairly small program comes to mind quickly in the first case. Does it mean that small parts of low entropy sequences have lower complexity than equally small parts of high entropy ones?

That’s a very interesting question. If it is true, then one could set up a function f(k|x)=K(x_{1:k}) to characterize the situation. For low level sequences such as our x, f(k) would increase quickly to the complexity of the full sequence: f(k)\approx K(x)=l(p) for small k already, while for q it takes longer. What does it mean? After all, the complexity is equal: K(x)=K(q). After all, easy to predict means that the initial complexity of the prefix is low. Or does it just mean that the depth of the sequence goes down? Probably the latter. After all, the high level sequences like q are more shallow than low level ones like x. But that’s true by definition. “Easier” means that it takes less time to find p given q than given x. After all, the only way to find p from x is via q, unless one uses universal search. But why? It must be because trying to explain a small fraction of q leads to finding p more probably than the corresponding fraction of x.

I have confused something. The program generating q is of course simpler and shorter than the one generating x via q! After all, it has to encode how to unpack q after having generated it! Probably, one can just concatenate those programs. Thus, we have U(p)=q with p=p_{1}p_{2} and U(p_{1})=q_{1}, U(p_{1}p_{2})=q_{1}q_{2} on a monotone machine. p does not generate x directly. It requires an additional program that tells it to execute whatever it outputs recursively.

Let’s imagine things concretely. Let’s say we have a universal monotone machine with one-directional, input and output tapes I and O, respectively, and to read and write bi-directional work tapes W_{1} and W_{2}. A program enters the input tape I and it includes a fixed subprogram s_{k} which tells the machine to execute a program recursively k times. Thus, when we write U(s_{k}p)=x, at each counter value i the machine takes the current program from W_{2} (or from I if i=1), copies it to W_{1} and treats it as input. It executes it, writes it to W_{2} and increases the counter i. When i=k the machine executes the program on W_{1}, writes it to O and halts. This procedure is encoded in s_{k}.

Thus, in our 2-stage case, we have U(s_{2}p)=x, while U(s_{1}p)=q and U(s_{1}q)=x. That’s the real relationship. Therefore, x=U(s_{2}p)=U(s_{1}U(s_{1}p)). And generally, U(s_{k}p)=U(s_{1}U(s_{1}\ldots U(s_{1}p))). But l(s_{k}) is fairly small and basically constant, thus the complexities of are still roughly equal: K(q)\simeq K(x).

How about the following example: x=1,1,4,4,7,7,10,10,13,13,\ldots. Detecting the regularity that two neighbors are always equal makes things simpler and reduces us to q=1,4,7,10,13,\ldots plus a short program telling to copy every entry, which we are going to neglect. This is much easier to compress to p=1,3 than to do that with x directly. But should we really neglect the copying program? Is it not exactly the one making things easier? Is it not the one striping away complexity? It is. It is exactly one of the decomposing factors of the sequence that is represented incrementally.

Let’s chose a different representation then and say that a program consists of an operator o and a parameter r, where the parameters are the part of the program being compressed further. In the above example, p=o_{q}o_{p}r_{p} consists of o_{p} that is the incremental function that uses r_{p}=1,3 to create q=o_{q}r_{q}, where o_{q} is simply copied! Then, o_{q} is the copying/constant function/operator and r_{q}=1,4,7,10,13. Together, they create x. Remarkably, the operators are probably not compressed any simply copied into the level above. In that case o_{q} is a part of p: p=o_{q}o_{p}r_{p}. It is the parameters r that devour most of the entropy / program length. Thus, the general rule is to compress like r_{n}\rightarrow o_{n-1}r_{n-1}, until we arrive at the highest compression level p=o_{n-1}\ldots o_{0}r_{0}. If p is decomposable in such a way, then it seems plausible that a compression algorithm tries to unravel this nested computation.

The idea was to somehow argue that a single compression step r_{n}\rightarrow o_{n-1}r_{n-1}

is “easier” in terms of time complexity that to find the shortest program p in a brute force way, requiring 2^{l(p)}t time steps. Could it mean that it is easier to find features of the data than a whole feature basis? Maybe that is the relation to features. Are features the correct term? Usually, those mean partial descriptions of data. I really mean full bases/ descriptions stacked on top of each other.

An important aspect of the demonstrator is that, in the previous notation, q_{1} is inferred from x_{1} together with a function \psi computing q_{1} from x_{1}. Given that, all the rest of q can be computed from x. That’s what makes things easy in uncovering the intermediate layer. And the only criterion that I have for such a function is that it create a shorter description than the original one: \psi(x)=q, with l(q)<l(x). Therefore, already fairly simple functions can do the job. Assume, we can find q_{1} through Levin search such that U(q_{1})=x_{1} and invert the computation to get U^{-1}(x_{1})=\psi(x_{1})=q_{1}, with q_{1} being sufficiently short, even much shorter than l(p), that it is possible fairly quickly. Then apply this function to other parts of x, if possible. This will lead to a shorter description q. If that’s possible, then we can be essentially as fast as n2^{l(q_{1})}t or so, where n is the number of levels.

Li and Vitányi write on page 403 “we identify a property of elements of S with a subset of S consisting of all elements having the property”. That means that if A\subset S is such a property, then the statement that x has property A simply means that x\in A. Of course, the description length is immediately bounded not just by \log\; d(S), but now by \log\; d(A). The subsequent discovery of such properties with each level means that x has many properties A_{i} and is in their intersection: x\in\cap_{i}A_{i}. Typically, the size of each such subset is exponentially smaller than the size of S. Hence, the central question is how quickly we can find a property. Let b_{i} be defined by d(A_{i})=d(S)2^{-b_{i}}, which basically means that l(q_{i})=l(x)-b_{i}. The idea is that a small part of x can be used to find A_{i}. Why so? Because it is improbable that a property holding for x_{1} does not hold for any other part of x. The intriguing part is that those other parts of q can be computed directly from those other parts of x without reverting to p! Why is it possible? But there is no guarantee that it is possible. What one can try is to extend the current explanation as far as possible and for the rest of the sequence and start afresh. I think, the significant compression is achieved by being able to extend the current little explanation q_{1} to other parts of x, like x_{2}, such that q_{1} can generate x_{1}x_{2}, while having been found using x_{1} only. Why is this possible? Or probable? Well, because of universal induction. If some compression is achieved, chances are, that it is predictive. Yes.

And why is this not possible directly with p? By definition. Since x is deep, the only way to generate it from p is through q or other intermediate stages! We could, for a start, restrict the set of sequences to hierarchies, with U(p_{i}^{l})=q_{i}^{l+1} and p^{l}=p_{1}^{l}\ldots p_{n_{l}}^{l}=q_{1}^{l}\ldots q_{m_{l}}^{l} and l(p^{l})<l(p^{l+1}), thus rendering the different parts of a sequence independent. Notice, that the p‘s are mapped to q‘s which may be different partitions than those used for further computation. Hence, those p_{i}^{l} will have to be looked for independently, taking up 2^{l(p_{i}^{l})}. In total for a level, we get \sum_{i}2^{l(p_{i}^{l})}, which is much, much less than 2^{l(p^{l})}. However, this is a trivial result, given the restriction. Hence, I get just what I have put in. Damn it.

Maybe I can just learn from it that if independent parts of a sequence occur, then things become easy very quickly.

It has to be related to those damn feature bases as a direct mapping from the data x to a slightly compressed description q, the parameters of which can be compressed further.

Posted in Uncategorized | Leave a comment