## Scientific progress and incremental compression

Why is scientific progress incremental? Clearly, the construction of increasingly unified theories in physics and elsewhere is an example incremental compression of experimental data, of the description of our world.

On the other hand, we know that the compression problem, the problem of finding the shortest program given data is optimally solved by Levin search. However, this search procedure is not incremental, but an all-or-none search where all programs are executed in a dovetailed fashion until a programs is found that generates the required sequence.

One explanation could be that our compression scheme in science is not optimal. However, progress in science is fairly fast and seems to be much faster than Levin search. Imagine, for example, the present quantum field theory incoded as a bitstring, and permutate all strings of that length. It certainly would take a gazillion years even with our fastest comupters.

What resolves that paradox then? This line of thought seem to show that the data that we are typically dealing with, when we try to explain our world, is drawn from a subset of all possible sequences, a subset, for which a much faster compression scheme exists than Levin search. I suspect that those are sequences with high logical depth.

## Incremental compression

A problem of the incremental approach is obviously local minima in compression. Is it possible that the probability to end up in a local minimum decreases if the first compression step is large? It would be very cool, if that could be proved. What if greediness is even the optimal thing to do in this context? That would be sheerly amazing. What does “local minimum” actually mean in this context? Let’s say, we have an encoding $y$ (with $U(y)=x)$ those length is between the optimal and the original: $K(x). A local minimum would be present if you can not compress the encoding further. But what is $K(y)$? Can it be random, i.e. $K(y)\ge l(y)$? No, after all, if the encoding is invertible, we can get $y$ from $x:$$y=U^{-1}(x)$. And since we can get $x$ from the optimal program $p$ with $l(p)=K(x)$, it follows that $K(y)=K(U^{-1}(x))=l(p)$, thus $p$ can generate $y$ as well. However, if only such a crook is possible, is it not what is mean by a local minimum? That the decoding path from $p$ to $x$ does not pass through $y$ but instead goes via $x$ and inverse encodings? Yes, that is exactly what is meant.

Thus, the question is whether there are suboptimal codes such that they cannot be compressed further without going back to $x.$ Of course, abundantly. Imagine a string of $n$ zeros $x=000\ldots0$ and a suboptimal code that splits it into two blocks by index $0, filling each with zeros. The optimal code takes about $1+\log(n)$ while the suboptimal one takes $2+\log(m)+\log(n-m)$. Since $m$ is arbitrary, there is not way it can be compressed further: a truly random number can be taken around $m\approx n/2$ and the loss is maximal.

Thus, there are plenty of local minima. And that is maybe already the hint: if a compression step is suboptimal then it looks more random than the optimal compression step. But is this really true? Can we not find examples where finding a small and a big regularity are exchangeable? There are many such examples, exactly when we talk about orthogonal features that reflect compression steps of different size. For example, if the points are on a circle of fixed radius and the angle goes from 0° to 90°, we can first define the quadrant (upper left or so) and the define the angle or vice versa. Fully interchangeable. But we are not talking about features or partial compression. One step in the hierarchy is a fully generative model of the whole data set, not of a partial aspect or feature of it.

It is somehow like this: if you make a suboptimal compression step, you could have captured more regularity, but you did not do so, thereby introducing some randomness in our code, which will have to remain uncompressed. On the other hand, consider the sequence 1,2,3,…,n,1,2,3,…,n. Representing this as two concatenated incremental functions leads to (1,1,n), (1,1,n) which takes around $2\log(n)$, and we can keep going. Representing it as constants $(i,2)$ defined on $i=1,\ldots,n$ leads in a first step to $\log(n!)+n\log(2)$ which is much larger. But ultimately both can be compressed as successfully. We need an example that would run into a local minimum, a dead end. This “introducing randomness” concept needs clarification. Why given a sequence $1,\ldots,n$ we introduce randomness if we split it into $1,\ldots,n-1$ and $n$? This has actually happened in the current compressor version. Trying a random node on a sequence leads to such separations. Then we’d have to go back and try something different. There are many ways, a suboptimal path could be taken.

Maybe the line of argument should be that the longer the string is the more probable it is to get into a dead end? Therefore, choose to reduce the length of the string as much as possible? The number of different partitions definitely increases heavily with the string length (Bell number). But how should the probability of a dead end be computed? Via the number of programs able to print the sequence? Via the probability to get a random number? That is a number with low compressibility? Another argument may be that the number of random numbers is much greater for longer strings, since most are random anyway. But they are not truly random, they are just dead ends. How shall dead ends be defined?

I could still be in a dead end if I partly reconstruct the original. Hence, the criterion of incompressibility without going back to the original is not appropriate. A dead end could be if the only thing you can do is to unpack at least part of the data again in order to compress things further. Hence, in order to represent dead end, one has to break down compression into a set of programs. Let $p_{0}=x$ be the original sequence to be compressed. Incremental compression is defined as the process of finding a list of programs $p_{1},\ldots,p_{n}$ such that for $i=1,\ldots,n$:
$U(p_{i})=p_{i-1}$
and
$l(p_{i})\le l(p_{i-1})$

The compression is optimal if $l(p_{n})=C(p_{0})$. What is a dead end? If the length of some program $p_{k}$ has to increase again temporarily. However, this temporary operation can always be subsumed into a single operation. Also the second condition can be always made true by subsuming non-decreasing program lengths into larger blocks.

This boils down into a more basic question: why would one use all those steps in the first place? Because it seems much simpler to find partial compression than the optimal one. Why is this so? Is it the nature of things? When doing practical compression, often one considers only a small part of the sequence and tries to find regularities there and tries to extend them to a larger part of the sequence, partitioning it on the way if necessary. Finding a compressing representation of a small part is much easier than for the whole sequence. However, then using universal induction, it is fairly probable that the sequence can be predicted at least to some extent and one gets more parts “for free”. And universal induction can predict sequences optimally! I think that something can come out of it.

The line of argument could be the following. Since it is easier to compress small subsequences of a sequence, it is reasonable to partition the sequence into such easily compressible subsets. The respective programs can then be concatenated and form a new sequence to be compressed. Doing so recursively may substantially decrease the time complexity of the compression / inversion algorithm. Let’s try some numbers. Assume a sequence $x$ of length $n$ to be divided into $n/k$ subsets of length $k$. Finding an optimal program $p_{i}$ for the $i$th subset that minimizes Levin complexity $Kt$ takes $2^{l(p_{i})}t$. Since we want to take small and simple subsets $l(p_{i})$ may be very small making that search tractable. Ignoring the encoding of the subset positions, we continue. The new sequence will consist of concatenated programs $p_{1}\ldots p_{n/k}$ of length $L=\sum_{i}l(p_{i})$, with the time to find them being $t\sum_{i}2^{l(p_{i})}$. We keep going like this recursively until $L=C(x)$. Assume that at every recursion step, the length decreases by factor $\alpha$. Then the number of recursion steps needed is restricted by $n\alpha^{r}=C(x)$, thus $r=\log\left(\frac{C(x)}{n}\right)/\log\left(\alpha\right)$. The total computation time then amounts to
$T=t\sum_{j=0}^{r}\sum_{i=1}^{n\alpha^{j}/k}2^{l(p_{i})}$
This can be approximates further since $l(p_{i})\approx k\alpha^{j}$. Thus we get
$T\approx t\sum_{j=0}^{r}\frac{n\alpha^{j}}{k}2^{k\alpha^{j}}<\frac{tn}{k}2^{k}\sum_{j}^{r}\alpha^{j}<\frac{tnr}{k}2^{k}$

Thus, it is obvious that this approach is much better since $k$ can be picked fairly small in practice. Yeah, that’s what my intuition told me: this approach should be exponential in the size of the small problems and otherwise grow linearly with the number of recursion levels, string length and execution time. This is much, much, much faster than Levin search.

This could be generalized fairly easily. The crux of the problem however, is to show that such shorter sets programs exist. In particular it is important to show that there always exists a partition of the sequence such that each subsequence can be compressed. But for that I will need universal induction, I guess. And I have to learn the theory much more thoroughly.

#### Let’s collect what we may need.

Definition 4.5.3 says: Monotone machines compute partial functions $\psi$ : $\{0,1\}^{*}\rightarrow S_{B}$ such that for all $p,q\in\{0,1\}^{*}$ we have that $\psi(p)$ is a prefix of $\psi(pq)$.

Consider a sequence $z=xy$, consisting of subsequences $x$ and $y$. Then, the subadditive property of prefix complexity dictates
$K(z)\le K(x)+K(y)+O(1)$
by Example 3.1.2. However, we need a partition where we have roughly equality. But equality is only reached in Theorem 3.9.1:
$K(z)=K(x)+K(y|x,K(x))+O(1)$
It is clear that the reason is the K-complexity of information in $x$ about $y$ (Definition 3.9.1), which is basically the algorithmic version of mutual information, except that it is not symmetric:
$I(x:y)=K(y)-K(y|x)$

If it is zero, we get equality. Hence, if we want incremental compression of subsequences, we have to find partitions with minimal mutual information. But we have to take the complexity of the position sets into account as well. We may define a subsequence exactly as one minimizing mutual information between it and the rest plus the complexity of the position set on which it is defined. However, even if this leads to a subsequence not identical to the whole sequence, there is no guarantee that that kind of compression may lead to further incremental compression.

Could it be that the fractal and self-similar nature of the world may be exactly the sort of data that is incrementally compressible? Maybe the complexity of the world has been built up in “slices”?

Why is the math of SOC systems so horrendously complex? Maybe, because there is no short description of those phenomena?! If mathematics is a description then only simple and regular things can be described mathematically, otherwise, we don’t get our heads around it. On the other hand, there is chaos, which is used to model randomness. Why are chaotic systems good generators of pseudo-random numbers? After all, the law is often very simple, hence the true complexity is fairly low. Thus we get numbers that seems very random / complex, but are in fact very simple.

It looks like the term “complexity” is not really captured by Kolmogorov’s definition. A random number is not complex. Let’s read about “logical depth”. “Both gases and crystals are structurally trivial” (p. 589) is fairly revealing. But their Kolmogorov complexity is fairly different. It is about structure. “A deep object is something really simple but disguised by complicated manipulations of nature or computation by computer.”

That reflects my intuition. We need to restrict ourselves to the “deep” subset of strings with low algorithmic complexity. Deep and simple strings.

One way of relating incremental compression to string depth is to acknowledge that it takes time to recursively unpack a representation. There is a lot of recursive reuse of computation output.

I started to consider partitions but partitions are only one way of identifying different components of a string. For example, ICA identifies that a data set is the sum of several components. This is also a type of compression, of course. Just thinking of partitions is too restrictive. However, it could be useful for our function network approach to define the scope of the representation and to derive an expression for the time complexity of the algorithm.

It’s funny, ICA just identifies a stimulus as the sum of simple stimuli. Let’s say, we ignore that and research interval partitions first, then general partitions. What should be done is to investigate what fraction of all sequences are covered this way.

It depends on the number of levels $m$. If $m=1$ then the program $p_{0}$ generates the output $x=p_{1}$ directly. This corresponds to all sequences anyway. If $m=2$, the one intermediate level $p_{1}$ is necessary to generate the output $x=p_{2}$. Here, we impose $K(x)=l(p_{0}). What does that mean? Obviously, it means that the reference machine $U$ is able to create the output $x$ from $p_{0}$. In the context of output reuse, one could think of the number of times that a square of a single-tape machine has been read and rewritten before the machine halts. We can restrict that to stage-wise computation by requiring that a square is not read-written the $n+1$st time before all other square have not been read-written the $n$th time. This is how stage-wise computation can be defined.

We can define the “usage” of a square if it is read for the first time after it has been written.

Actually, the computation process of any deep string can be expressed as stage-wise computation. After all, if before writing to a square the $n+1$st time, read the content of all square at stage $n-1$ and write the very same content into them (that is without changing them). This way, they arrive at stage $n$. This proves that long computation time is equivalent to a computation with many stages!

What else would I like to prove? That compressing deep sequences is much faster than Levin search.

***

Let’s pose the question differently. Let’s say the shortest program generating $x$ is $p$. And let $q$ be an intermediate program, $l(p), with $U(p)=q$ and $U(q)=x$. Usually, Levin search allows finding $p$ in time $2^{l(p)}t$. The crucial question is whether the existence of intermediate stage $q$ allows finding $p$ faster.

The intuition is as follows. Levin search is so slow, because partial progress does not help in any way to achieve further progress. This is the case since $p$ is random being the shortest program. Therefore, after having found, say, a partial program $p_{1}$ with $p=p_{1}p_{2}$, the search for $p_{2}$ is in no way easier, since $p_{1}$ does not contain any information about $p_{2}$, otherwise $p$ would not be random.

However, what if knowing $q$ made things faster? That would require that the knowledge of $q$ could in any way accelerate finding $p$. How so? After all, knowing $x$ does not accelerate finding $p$ through Levin search. Levin search for $q$ is also deadly, since $2^{l(q)}\gg2^{l(p)}$. What if we split $q=q_{1}q_{2}$ with $l(q_{1})\ll l(p)$ and that $q_{1}$ generates $x_{1}$ which is part of $x$: $x=x_{1}x_{2}$, on a monotone machine. In that case, Levin search will find $q_{1}$ fairly quickly. Now, $q$ is not random and can be predicted given a program generating $q_{1}$. The correct program would be $p$, of course. Hmm…

How can $p$ ever be synthesized, if it is random? It has to depends on $q$. If $p=p_{1}p_{2}$ could be concatenated from two independently searchable programs, then the cost of finding $p$ would reduce drastically. Why can it not be done on a monotone machine? Is this not always the case on a monotone machine? No. $p_{2}$ can not generate $q_{2}$ unless $p_{1}$ has been run before. This temporal contingency is mediated by the work tape. That’s why. No, the reason is different. Based on $q_{1}$ there is no guarantee that we will find $p_{1}$. We will likely find a shorter program $p_{1}'$, which is not a prefix of $p$. Thus, it won’t find $p$ ever that way unless it tries all $2^{l(p)}$ strings.

That doesn’t help either. In my demonstrator, it does come to mind, that higher levels do get increasingly simpler, the entropy decreases. Maybe, $q$ is easier to crack, since it is “simpler” than $x$? That does not make sense since $q$ and $x$ have got the same algorithmic complexity, which is $l(p)$! But in the demonstrator, the numbers tend to get smaller and the intervals narrower.

Let’s turn back to predicting $q_{1}$. Finding $p$ from scratch is too difficult, since it takes $2^{l(p)}$ combinations. But one could find a smaller program $p_{1}$ that not only explains $q_{1}$ but also a part of $q_{2}$. That’s the whole point. The hierarchy is split up like a tree: every program part generates several parts. Therefore, the notation has to be different: $x=x_{1}\ldots x_{8}$, $q=q_{1}\ldots q_{4}$ and $p=p_{1}p_{2}$. Of course, $U(p)=q$ and $U(q)=x$. But it is also true that $U(p_{i})=q_{2i-1}q_{2i}$, $U(q_{i})=x_{2i-1}x_{2i}$. What follows? We can use $x_{1}x_{2}$ to find $q_{1}$. From $q_{1}$ we can already hypothesized about $p_{1}$ and potentially find it. Or find a different program that extends $q_{1}$ to $q_{1}q_{2}$ correctly. The point is, that $q_{2}$ comes for free given $p_{1}$. And $x_{3}x_{4}$ are then predicted readily. Even $x_{1}$ could be enough to find $x_{2}x_{3}x_{4}$. There has to be a synergy between those layers! Just like in the hierarchical Bayesian approach. And here is the synergy. In the bottom-up direction, although $x_{1}x_{2}$ is necessary to find $q_{1}$, $x_{1}$ narrows down the set of possible programs $q_{1}$ generating $x_{1}x_{2}$. In the top-down direction, given $p_{1}$ we can generate $q_{2}$, even if we only know $p_{1}'$.

The problem is, even if $x_{1}$ narrows down the set of possible programs $q_{1}$, does it really relieve us from having to loop through all $2^{l(q_{1})}$ combinations? No, certainly not in the beginning. And we also don’t get around looping through $2^{l(p_{1})}$ or just $2^{l(p_{1}')}<2^{l(p_{1})}$ combinations. But how does it help to find $p_{2}$? Or even $p_{1}$, given that we found $p_{1}'$ first? It would be already very helpful if we could search for $p_{2}$ independently. And if it does it should do so only because of the presence of the intermediate layer.

***

The crux of the problem is that Levin search does not “use” the information in the sequence that it tries to compress. It only loops through all programs until one of it generates the sequence. That’s the most stupid thing you can do. What the demonstrator does is to detect regularities and uses the to compress the sequence incrementally. If detecting the regularities is sufficiently cheap, then this might lead to a substantial decrease in the time complexity of the algorithm. Cheapness is exactly what follows from the large amount of stages. We want to have those stages exactly because each such stage transition is much cheaper than finding the final shortest program at one step.

Can we use the formalization of the $P$-test in order to measure the partial regularities at each stage? According to Definition 2.4.1 we require for the $P$-test $\delta$
$\sum_{l(x)=n}\left\{ P(x):\delta(x)\ge m\right\} \le2^{-m}$
for all $n$. For a uniform test, we have
$latex d\left\{ x:\delta(x)\ge m,l(x)=n\right\} \le2^{n-m}$

The statement being, for a sequence $x$ of length $n$ drawn randomly from a uniform distribution, any feature occurs more than $m$ times with probability less than $2^{-m}$.

If we want a universal test for randomness, then we have to consider a $\delta_{0}$ dominating \textit{all} such $\delta$‘s. However, if we fix a particular sequence $x$, then only a subset $\{\delta_{1},\ldots,\delta_{\text{s}}\}$ is enough to enumerate all nonrandom features and a much smaller $\delta_{0}'$ is enough to measure randomness. What if at each stage a different nonrandom feature is eliminated until a fully random shortest program is reached? The problem is that the $\delta$‘s are just tests for randomness and not full fledged representations. I have to show somehow that I can eliminate one nonrandom feature at a time.

***

I somehow have to formulate the intuition that it is much simpler to make an incremental compression step than to find the shortest program immediately. Why is this so anyway? Because the compression goes along the lines of the unpacking, generating of the sequence. It is just the reverse trajectory that is passed. Returning to the previous reasoning, the reason why it is simpler, is that one can take a fraction of sequence $x$, say $x_{1}$ and use it to find large parts of $q$, say $q_{1}$. It would be much harder to find any part of $p$, because of the intricacy of the computation. But is exhaustive universal search not what we do for small fractions like $x_{1}$? Is that not the process of detecting regularities? Consider a sequence $x=1,5,7,11,13,17,19,20,22,25,29,34,40,\ldots$. Taking differences of neighbors leads to $q=4,2,4,2,4,2,1,2,3,4,5,6,\ldots$, which in turn leads to $p=4,2,1,1$ implementing an alternation function and a incremental one, neglecting lengths. In my demonstrator, the criterion for adapting $q$ is the decrease in entropy. The hunch is that sequences with lower entropy are “easier” to predict. But is this true? After all, the algorithmic complexity has remained the same. The word “ease” is used here in the sense of the extent to which a small part of the sequence can be used to predict large parts of it. In our case, it is quite often the case. After all, $4,2,4$ is much easier to predict than $1,5,7,11$. Which means that a fairly small program comes to mind quickly in the first case. Does it mean that small parts of low entropy sequences have lower complexity than equally small parts of high entropy ones?

That’s a very interesting question. If it is true, then one could set up a function $f(k|x)=K(x_{1:k})$ to characterize the situation. For low level sequences such as our $x$, $f(k)$ would increase quickly to the complexity of the full sequence: $f(k)\approx K(x)=l(p)$ for small $k$ already, while for $q$ it takes longer. What does it mean? After all, the complexity is equal: $K(x)=K(q)$. After all, easy to predict means that the initial complexity of the prefix is low. Or does it just mean that the depth of the sequence goes down? Probably the latter. After all, the high level sequences like $q$ are more shallow than low level ones like $x$. But that’s true by definition. “Easier” means that it takes less time to find $p$ given $q$ than given $x$. After all, the only way to find $p$ from $x$ is via $q$, unless one uses universal search. But why? It must be because trying to explain a small fraction of $q$ leads to finding $p$ more probably than the corresponding fraction of $x$.

I have confused something. The program generating $q$ is of course simpler and shorter than the one generating $x$ via $q$! After all, it has to encode how to unpack $q$ after having generated it! Probably, one can just concatenate those programs. Thus, we have $U(p)=q$ with $p=p_{1}p_{2}$ and $U(p_{1})=q_{1}$, $U(p_{1}p_{2})=q_{1}q_{2}$ on a monotone machine. $p$ does not generate $x$ directly. It requires an additional program that tells it to execute whatever it outputs recursively.

Let’s imagine things concretely. Let’s say we have a universal monotone machine with one-directional, input and output tapes $I$ and $O$, respectively, and to read and write bi-directional work tapes $W_{1}$ and $W_{2}$. A program enters the input tape $I$ and it includes a fixed subprogram $s_{k}$ which tells the machine to execute a program recursively $k$ times. Thus, when we write $U(s_{k}p)=x$, at each counter value $i$ the machine takes the current program from $W_{2}$ (or from $I$ if $i=1$), copies it to $W_{1}$ and treats it as input. It executes it, writes it to $W_{2}$ and increases the counter $i$. When $i=k$ the machine executes the program on $W_{1}$, writes it to $O$ and halts. This procedure is encoded in $s_{k}$.

Thus, in our 2-stage case, we have $U(s_{2}p)=x$, while $U(s_{1}p)=q$ and $U(s_{1}q)=x$. That’s the real relationship. Therefore, $x=U(s_{2}p)=U(s_{1}U(s_{1}p))$. And generally, $U(s_{k}p)=U(s_{1}U(s_{1}\ldots U(s_{1}p)))$. But $l(s_{k})$ is fairly small and basically constant, thus the complexities of are still roughly equal: $K(q)\simeq K(x)$.

How about the following example: $x=1,1,4,4,7,7,10,10,13,13,\ldots$. Detecting the regularity that two neighbors are always equal makes things simpler and reduces us to $q=1,4,7,10,13,\ldots$ plus a short program telling to copy every entry, which we are going to neglect. This is much easier to compress to $p=1,3$ than to do that with $x$ directly. But should we really neglect the copying program? Is it not exactly the one making things easier? Is it not the one striping away complexity? It is. It is exactly one of the decomposing factors of the sequence that is represented incrementally.

Let’s chose a different representation then and say that a program consists of an operator $o$ and a parameter $r$, where the parameters are the part of the program being compressed further. In the above example, $p=o_{q}o_{p}r_{p}$ consists of $o_{p}$ that is the incremental function that uses $r_{p}=1,3$ to create $q=o_{q}r_{q}$, where $o_{q}$ is simply copied! Then, $o_{q}$ is the copying/constant function/operator and $r_{q}=1,4,7,10,13$. Together, they create $x$. Remarkably, the operators are probably not compressed any simply copied into the level above. In that case $o_{q}$ is a part of $p$: $p=o_{q}o_{p}r_{p}$. It is the parameters $r$ that devour most of the entropy / program length. Thus, the general rule is to compress like $r_{n}\rightarrow o_{n-1}r_{n-1}$, until we arrive at the highest compression level $p=o_{n-1}\ldots o_{0}r_{0}$. If $p$ is decomposable in such a way, then it seems plausible that a compression algorithm tries to unravel this nested computation.

The idea was to somehow argue that a single compression step $r_{n}\rightarrow o_{n-1}r_{n-1}$

is “easier” in terms of time complexity that to find the shortest program $p$ in a brute force way, requiring $2^{l(p)}t$ time steps. Could it mean that it is easier to find features of the data than a whole feature basis? Maybe that is the relation to features. Are features the correct term? Usually, those mean partial descriptions of data. I really mean full bases/ descriptions stacked on top of each other.

An important aspect of the demonstrator is that, in the previous notation, $q_{1}$ is inferred from $x_{1}$ together with a function $\psi$ computing $q_{1}$ from $x_{1}$. Given that, all the rest of $q$ can be computed from $x$. That’s what makes things easy in uncovering the intermediate layer. And the only criterion that I have for such a function is that it create a shorter description than the original one: $\psi(x)=q$, with $l(q). Therefore, already fairly simple functions can do the job. Assume, we can find $q_{1}$ through Levin search such that $U(q_{1})=x_{1}$ and invert the computation to get $U^{-1}(x_{1})=\psi(x_{1})=q_{1}$, with $q_{1}$ being sufficiently short, even much shorter than $l(p)$, that it is possible fairly quickly. Then apply this function to other parts of $x$, if possible. This will lead to a shorter description $q$. If that’s possible, then we can be essentially as fast as $n2^{l(q_{1})}t$ or so, where $n$ is the number of levels.

Li and Vitányi write on page 403 “we identify a property of elements of $S$ with a subset of $S$ consisting of all elements having the property”. That means that if $A\subset S$ is such a property, then the statement that $x$ has property $A$ simply means that $x\in A$. Of course, the description length is immediately bounded not just by $\log\; d(S)$, but now by $\log\; d(A)$. The subsequent discovery of such properties with each level means that $x$ has many properties $A_{i}$ and is in their intersection: $x\in\cap_{i}A_{i}$. Typically, the size of each such subset is exponentially smaller than the size of $S$. Hence, the central question is how quickly we can find a property. Let $b_{i}$ be defined by $d(A_{i})=d(S)2^{-b_{i}}$, which basically means that $l(q_{i})=l(x)-b_{i}$. The idea is that a small part of $x$ can be used to find $A_{i}$. Why so? Because it is improbable that a property holding for $x_{1}$ does not hold for any other part of $x$. The intriguing part is that those other parts of $q$ can be computed directly from those other parts of $x$ without reverting to $p$! Why is it possible? But there is no guarantee that it is possible. What one can try is to extend the current explanation as far as possible and for the rest of the sequence and start afresh. I think, the significant compression is achieved by being able to extend the current little explanation $q_{1}$ to other parts of $x$, like $x_{2}$, such that $q_{1}$ can generate $x_{1}x_{2}$, while having been found using $x_{1}$ only. Why is this possible? Or probable? Well, because of universal induction. If some compression is achieved, chances are, that it is predictive. Yes.

And why is this not possible directly with $p$? By definition. Since $x$ is deep, the only way to generate it from $p$ is through $q$ or other intermediate stages! We could, for a start, restrict the set of sequences to hierarchies, with $U(p_{i}^{l})=q_{i}^{l+1}$ and $p^{l}=p_{1}^{l}\ldots p_{n_{l}}^{l}=q_{1}^{l}\ldots q_{m_{l}}^{l}$ and $l(p^{l}), thus rendering the different parts of a sequence independent. Notice, that the $p$‘s are mapped to $q$‘s which may be different partitions than those used for further computation. Hence, those $p_{i}^{l}$ will have to be looked for independently, taking up $2^{l(p_{i}^{l})}$. In total for a level, we get $\sum_{i}2^{l(p_{i}^{l})}$, which is much, much less than $2^{l(p^{l})}$. However, this is a trivial result, given the restriction. Hence, I get just what I have put in. Damn it.

Maybe I can just learn from it that if independent parts of a sequence occur, then things become easy very quickly.

It has to be related to those damn feature bases as a direct mapping from the data $x$ to a slightly compressed description $q$, the parameters of which can be compressed further.

## Learning spatial relations

The big demonstrator that I have in mind is the ability to talk about some line drawing scene, after having extracted various objects from it through compression. That is very ambitious indeed. One project that comes to mind is some kind of object learning with children. You point to an object, say it’s name and the system learns about it. Or it can ask you about objects. Then you can give commands and say “put the plate on the table”, “is there something in the plate?”, “cut away a leg of the chair”, “divide the line in half”. It would really be an interactive scene comprehension test. Or geometrical relations. “If a line intersects with two parallel lines, then the angles will be equal.” This is the assessment of the truth value of a statement. But also word learning would be possible. How about relations? This will be important, like spatial relations for example. ABOVE, LEFT, IN, ON, AT, ATTACHED. The simplest relation in one dimensional case is probably LEFT-OF and RIGHT-OF. When we have identified two objects on the tape, one is on the left of the right one. However, this relation will probably be difficult to learn, since it is not salient. But if the objects are attached then it is much more salient. For example, if all are zero, but one object is a 111111 and the other is 22222 then it is quite salient if you see ..00011111122222000… somewhere in the midst of zeros. The relation becomes an additional but optional constraint.

If I could only reproduce Jean Mandler’s “concept primitives”, it would be hilarious!

Hence, the AT(TACHED) relation is simple, since it is a compressing constraint. What about LEFT, RIGHT and BETWEEN? LEFT and RIGHT could be relations that occur as orthogonal relations to the AT relation, since they determine independent aspects of the spatial relation. Hence, a spatial relation is simple the determination of the spatial position relatively to an object. AT is a fairly compressing relations since it has a binary value but significantly narrows the set of residual positions. BETWEEN is simply a combination of left and right in the 1D case. In 2D things are more complicated. The object in between two objects is somewhere on a line between them. Hence it is not far away from the set of points defining that line. The details will kill me but so be it. The close-to-line scheme is fairly compressing and hence a useful determinant of a spatial relation. The ON relation is simply an AT plus ABOVE.

Now comes the really hard one: IN and OUT. It seems to require an understanding that an object defines a subset of the whole space. If your position is part of that subspace, then you are inside that object. And it is fairly compressing. Somehow, a closed path must be imagined along the border of the object. What is the inside of a closed path? It is also related to the concept of whether there is a path leading to the outside. Those are complex thoughts. What any path can or can not do is a type of statement whose truth value would be hard to assess. However, those any-path-statements have to be cracked somehow anyway. But let’s skip them for now.

The main lesson from those considerations is the need for features. The extensions of the present function network that are mentioned in the previous post are not enough. About features, see next post…

## Extending the function network compression algorithm

So, what are the next steps? I should expand my attempts with the function network. That, it seems, is the best path. One thing that I could estimate is the complexity of that line drawing world that I would like to analyse. Yes! Since, after all, I know the level of complexity that I am at now. Currently, I am at 20 bits for 2 state machines and 33 bits for 3 state machines. The next are 47 bits and 62 bits. For a line drawing of a house we need 1 bit for the background color and 5 points, which are 10 numbers in an N x N image. Lets say, N=128, so we need 7 bits for one number. Hence, we get 35 bits for the house. For more complex objects there may be some more bits necessary. Say 50-60. In any case, I am at the level where this sort of compression is well possible! It only has to be general enough! On the other hand, a single line would need 2 points, ergo 4 numbers, ergo 28 bits. It could be sufficient already now to represent lines. However, for this we have to implement exceptions, and positions, since lines are regularities in the position. But that’s about it! Then we can already proceed!

So, let’s summarize the drawbacks of the current state of the system.

1. Partition selection, interpretation rivalry.

2. Remove assumption of the concatenation of inputs.

3. Dealing with exceptions

4. Dealing with the position variable

5. Resolve primitives into arithmetic operations

6. Remove layers / execution levels

7. Interaction / regularities between inputs

8. Repeating strings

9. Function building

The most important restructuring is the introduction of the position variable and the successive execution that it requires. It could be achieved with a recursive node with two inputs one of which receives the output. How is this related to the position variable? Also, since we are removing concatenation, the system has to know somehow, where to write its outputs, which is also related to position. The positions could be generated as well. They should be, since they are just data, after all. Hence, we only need a new connection type, “entry taking”. Every output needs a position sequence to write into. But a different node may be computing it. Hence, those nodes should be paired somehow. No, not the nodes should be paired, but the final output strings. No. The position sequence could be complex itself. E.g. one may want it to write on positions 1, 4, 3, 8, 5, 12, 7, 16 which are interwoven themselves. But ultimately, a sequence has to be produced which has to be paired with the output of a node. Executing input tuples successively can be kept. We can just introduce a sequence class which contains a list of position-list / content-list pairs. The pairs are executed successively. To execute a pair, just write the content at the specified positions. This way, the representation of exceptions can be implemented, since exception pairs can be executed after the regular pairs. One such sequence feeds into one input of the child. Or a node just saves the pointers to sequences where it gets its inputs from.

And how about recursive sequences? Especially, what if two or more previous inputs are used in the computation such as in the Fibonacci sequence? Fuck it, let’s just skip that. I have found such a nice representation through the recursive node. What would be the alternative? Several previous sequence entries would have to be present as inputs. No, I don’t want to deal with that and it would also be useless for my purposes, I guess.

It looks fairly clear to me now, how to solve points 2, 3, 4 and 6. How would the primitives be resolved into arithmetic nodes? Let’s say, we have a PLUS node. It takes two inputs, adds them and gives one output. The special number is just a specialization of the PLUS node with a zero as a summand. The constant is the same as a special number, but just with n inputs fed in. I think, the constant needs to be a primitive though, since parametrizing n outputs with a single number is the seed of a 1-to-n map. Then combining a constant nodes (d, n) with a recursive plus node taking d in one input and its own output at the second input, with c being a starting value.

Thinking through the limitations of the previous model

So, resolving the incremental sequence is possible, albeit not so necessary. What about the alternating sequence? It is just a special case of repeating a subsequence. Or writing a constant every two positions. Hence, it could be a constant written on positions, whose sequence is described by an incremental sequence. This is something, that really should be dissolved. Or maybe not. I could first keep it, try to learn new functions and then dissolve it. It’s not essential anyway.

Point 8. A repeating sequence is just a generalization of the alternating sequence but with larger position increments. An entry then does not repeat every second position, but every n-th position. That’s all doable once positions are handled.

The concatenation of various outputs becomes non-trivial now and one has to compute positions. Let’s say, there are two nodes feeding into the same sequence in a concatenating way. The second one has to pay attention to where the first one ended, in order to compress optimally. As a first approximation, the first may write to positions 0-13 and the second to 13-37 and the third to 37-52, having to save 5 large numbers (13, 13, 37, 37 and 52). The position sequences will be incremental, thus being represented by (0, 1, 13), (13, 1, 37) and (37, 1, 52). Those would be listed into the same incremental node, which would create those position intervals. Therefore, we would get the input lists: [0, 13, 37], [1, 1, 1] and [13, 37, 52]. That’s where it’d be time to correlate different sequences. Hence, we conclude, if different sequences / inputs (point 7) can be solved, then the concatenation story is solved as well.

Point 7. The type of interaction that appeared here was is copy with a shift. One sequence can serve as a source. Then, we just need the specific number node. Which takes the first sequence as input and produces the second sequence as output. Its funny, that input and output are not at different levels now, but are both inputs to a single node! It’s just using the first input to compute the second one. Since we have got positions now, the SN node now also needs to know the position where to write its output. Therefore, I need both the position (0, 1, 2) and the content (0, 13, 37) of the source sequence. The positions are subjected to a subtraction of 1 while the content is copied. Negative positions are suppressed, so we get positions (0, 1) with the content (13, 37) which explains the first two numbers of the second input. The number 52 is then just invented by another SN. Yes!!! The representation capacity is strong enough to support that!

Point 9. Function building is fairly clear how to do that. I have worked on it already.

Point 1 is left. How to choose simple partitions. Partitions are nothing but the division of positions into subsets. The interval partition has been explained and represented above already. Similarly, the partition into (0, 2, 4, 6) and (1, 3, 5, 7) can be explained by two incremental sequences with parameters [[0, 1], [2, 2], [4, 4]] which can be compressed further.

Sounds good. The representation is coherent and consistent.

## Thinking about the next steps

Finding regularities is in fact not only something that leads to compression, it is also something that leads to the establishment of representations. After all, the residual variance is nothing but the variance within a fixed representation.

The question that I am heading for is how to join the present function network approach with the feature expansion story developed in the position paper. First, specialization seems simpler that generalization. This corresponds to finding constant features of the data set, regularities in other words. Hence, the present approach actually finds those features by compressing parts of the sequence or of the child inputs. That is fairly cool, however no quite general yet. One question is how to find generalizations? The other one is the following. Suppose we have got a narrow algorithm, such as the present one, that compresses all sequences up to some level (2 state TMs in our case) and some fraction of the sequences at the next level (80% of 3 state TMs in our case). What needs to be modified so that it gets the other 20% as well? Generally. Obviously, the narrow approach makes some assumptions that it is not supposed to make. Then the strategy is to loosen those assumptions and we don’t have a guarantee that we will get a tractable algorithm. Arguably, the present one only works because the sequences are fairly simple.

One could of course just keep going with the present approach. However, at least some theoretical guarantees would be very welcome. I have still not found the answer to the first question: how to find generalizations of a present representation? How to expand in the right direction? Suppose, one would like to expand the constant (c, n) toward the incremental sequence (c, d, n). Or the specific number (c) to a constant sequence (c, n)? In both cases the first representation is a special case of a more general representation. If the sequence is 7, 9, 11, … then the 7 can be fitted with a constant – 7. The others cause a discrepancy. Trying the difference between the constant and 9 leads to 2. Between 11 and the constant leads to 4. In any case, we would have to play around with differences between current and present symbols in order to discover the new regularity. It seems that the constant sequence does not help us in any way with this discovery. Why does it occur at all then that the constant becomes a special case of the new representation?

It doesn’t have to. Suppose the new representation is not a general incremental sequence, but a special one, the one starting with 0, i.e. it is the specialization (0, d, n) of the general one (c, d, n). The constant is the specialization (c, 0, n). Such sequences are 0, 4, 8, 12,… or 0, 3, 6, 9,… or 0, 11, 22, 33,… etc. They all start with zero. Thus, the constant 7, 7, 7,.. is never part of it, since it does not start with 0! Nevertheless, the same strategy of taking subsequent differences will discover it. The constant does not play an role here.

Is this the reason why narrow AI can not be generalized and one is rather advised to start from scratch?