Learning spatial relations

The big demonstrator that I have in mind is the ability to talk about some line drawing scene, after having extracted various objects from it through compression. That is very ambitious indeed. One project that comes to mind is some kind of object learning with children. You point to an object, say it’s name and the system learns about it. Or it can ask you about objects. Then you can give commands and say “put the plate on the table”, “is there something in the plate?”, “cut away a leg of the chair”, “divide the line in half”. It would really be an interactive scene comprehension test. Or geometrical relations. “If a line intersects with two parallel lines, then the angles will be equal.” This is the assessment of the truth value of a statement. But also word learning would be possible. How about relations? This will be important, like spatial relations for example. ABOVE, LEFT, IN, ON, AT, ATTACHED. The simplest relation in one dimensional case is probably LEFT-OF and RIGHT-OF. When we have identified two objects on the tape, one is on the left of the right one. However, this relation will probably be difficult to learn, since it is not salient. But if the objects are attached then it is much more salient. For example, if all are zero, but one object is a 111111 and the other is 22222 then it is quite salient if you see ..00011111122222000… somewhere in the midst of zeros. The relation becomes an additional but optional constraint.

If I could only reproduce Jean Mandler’s “concept primitives”, it would be hilarious!

Hence, the AT(TACHED) relation is simple, since it is a compressing constraint. What about LEFT, RIGHT and BETWEEN? LEFT and RIGHT could be relations that occur as orthogonal relations to the AT relation, since they determine independent aspects of the spatial relation. Hence, a spatial relation is simple the determination of the spatial position relatively to an object. AT is a fairly compressing relations since it has a binary value but significantly narrows the set of residual positions. BETWEEN is simply a combination of left and right in the 1D case. In 2D things are more complicated. The object in between two objects is somewhere on a line between them. Hence it is not far away from the set of points defining that line. The details will kill me but so be it. The close-to-line scheme is fairly compressing and hence a useful determinant of a spatial relation. The ON relation is simply an AT plus ABOVE.

Now comes the really hard one: IN and OUT. It seems to require an understanding that an object defines a subset of the whole space. If your position is part of that subspace, then you are inside that object. And it is fairly compressing. Somehow, a closed path must be imagined along the border of the object. What is the inside of a closed path? It is also related to the concept of whether there is a path leading to the outside. Those are complex thoughts. What any path can or can not do is a type of statement whose truth value would be hard to assess. However, those any-path-statements have to be cracked somehow anyway. But let’s skip them for now.

The main lesson from those considerations is the need for features. The extensions of the present function network that are mentioned in the previous post are not enough. About features, see next post…

Posted in Uncategorized | Leave a comment

Extending the function network compression algorithm

So, what are the next steps? I should expand my attempts with the function network. That, it seems, is the best path. One thing that I could estimate is the complexity of that line drawing world that I would like to analyse. Yes! Since, after all, I know the level of complexity that I am at now. Currently, I am at 20 bits for 2 state machines and 33 bits for 3 state machines. The next are 47 bits and 62 bits. For a line drawing of a house we need 1 bit for the background color and 5 points, which are 10 numbers in an N x N image. Lets say, N=128, so we need 7 bits for one number. Hence, we get 35 bits for the house. For more complex objects there may be some more bits necessary. Say 50-60. In any case, I am at the level where this sort of compression is well possible! It only has to be general enough! On the other hand, a single line would need 2 points, ergo 4 numbers, ergo 28 bits. It could be sufficient already now to represent lines. However, for this we have to implement exceptions, and positions, since lines are regularities in the position. But that’s about it! Then we can already proceed!

So, let’s summarize the drawbacks of the current state of the system.

1. Partition selection, interpretation rivalry.

2. Remove assumption of the concatenation of inputs.

3. Dealing with exceptions

4. Dealing with the position variable

5. Resolve primitives into arithmetic operations

6. Remove layers / execution levels

7. Interaction / regularities between inputs

8. Repeating strings

9. Function building

The most important restructuring is the introduction of the position variable and the successive execution that it requires. It could be achieved with a recursive node with two inputs one of which receives the output. How is this related to the position variable? Also, since we are removing concatenation, the system has to know somehow, where to write its outputs, which is also related to position. The positions could be generated as well. They should be, since they are just data, after all. Hence, we only need a new connection type, “entry taking”. Every output needs a position sequence to write into. But a different node may be computing it. Hence, those nodes should be paired somehow. No, not the nodes should be paired, but the final output strings. No. The position sequence could be complex itself. E.g. one may want it to write on positions 1, 4, 3, 8, 5, 12, 7, 16 which are interwoven themselves. But ultimately, a sequence has to be produced which has to be paired with the output of a node. Executing input tuples successively can be kept. We can just introduce a sequence class which contains a list of position-list / content-list pairs. The pairs are executed successively. To execute a pair, just write the content at the specified positions. This way, the representation of exceptions can be implemented, since exception pairs can be executed after the regular pairs. One such sequence feeds into one input of the child. Or a node just saves the pointers to sequences where it gets its inputs from.

And how about recursive sequences? Especially, what if two or more previous inputs are used in the computation such as in the Fibonacci sequence? Fuck it, let’s just skip that. I have found such a nice representation through the recursive node. What would be the alternative? Several previous sequence entries would have to be present as inputs. No, I don’t want to deal with that and it would also be useless for my purposes, I guess.

It looks fairly clear to me now, how to solve points 2, 3, 4 and 6. How would the primitives be resolved into arithmetic nodes? Let’s say, we have a PLUS node. It takes two inputs, adds them and gives one output. The special number is just a specialization of the PLUS node with a zero as a summand. The constant is the same as a special number, but just with n inputs fed in. I think, the constant needs to be a primitive though, since parametrizing n outputs with a single number is the seed of a 1-to-n map. Then combining a constant nodes (d, n) with a recursive plus node taking d in one input and its own output at the second input, with c being a starting value.

Thinking through the limitations of the previous model

So, resolving the incremental sequence is possible, albeit not so necessary. What about the alternating sequence? It is just a special case of repeating a subsequence. Or writing a constant every two positions. Hence, it could be a constant written on positions, whose sequence is described by an incremental sequence. This is something, that really should be dissolved. Or maybe not. I could first keep it, try to learn new functions and then dissolve it. It’s not essential anyway.

Point 8. A repeating sequence is just a generalization of the alternating sequence but with larger position increments. An entry then does not repeat every second position, but every n-th position. That’s all doable once positions are handled.

The concatenation of various outputs becomes non-trivial now and one has to compute positions. Let’s say, there are two nodes feeding into the same sequence in a concatenating way. The second one has to pay attention to where the first one ended, in order to compress optimally. As a first approximation, the first may write to positions 0-13 and the second to 13-37 and the third to 37-52, having to save 5 large numbers (13, 13, 37, 37 and 52). The position sequences will be incremental, thus being represented by (0, 1, 13), (13, 1, 37) and (37, 1, 52). Those would be listed into the same incremental node, which would create those position intervals. Therefore, we would get the input lists: [0, 13, 37], [1, 1, 1] and [13, 37, 52]. That’s where it’d be time to correlate different sequences. Hence, we conclude, if different sequences / inputs (point 7) can be solved, then the concatenation story is solved as well.

Point 7. The type of interaction that appeared here was is copy with a shift. One sequence can serve as a source. Then, we just need the specific number node. Which takes the first sequence as input and produces the second sequence as output. Its funny, that input and output are not at different levels now, but are both inputs to a single node! It’s just using the first input to compute the second one. Since we have got positions now, the SN node now also needs to know the position where to write its output. Therefore, I need both the position (0, 1, 2) and the content (0, 13, 37) of the source sequence. The positions are subjected to a subtraction of 1 while the content is copied. Negative positions are suppressed, so we get positions (0, 1) with the content (13, 37) which explains the first two numbers of the second input. The number 52 is then just invented by another SN. Yes!!! The representation capacity is strong enough to support that!

Point 9. Function building is fairly clear how to do that. I have worked on it already.

Point 1 is left. How to choose simple partitions. Partitions are nothing but the division of positions into subsets. The interval partition has been explained and represented above already. Similarly, the partition into (0, 2, 4, 6) and (1, 3, 5, 7) can be explained by two incremental sequences with parameters [[0, 1], [2, 2], [4, 4]] which can be compressed further.

Sounds good. The representation is coherent and consistent.

Posted in Uncategorized | Leave a comment

Thinking about the next steps

Finding regularities is in fact not only something that leads to compression, it is also something that leads to the establishment of representations. After all, the residual variance is nothing but the variance within a fixed representation.

The question that I am heading for is how to join the present function network approach with the feature expansion story developed in the position paper. First, specialization seems simpler that generalization. This corresponds to finding constant features of the data set, regularities in other words. Hence, the present approach actually finds those features by compressing parts of the sequence or of the child inputs. That is fairly cool, however no quite general yet. One question is how to find generalizations? The other one is the following. Suppose we have got a narrow algorithm, such as the present one, that compresses all sequences up to some level (2 state TMs in our case) and some fraction of the sequences at the next level (80% of 3 state TMs in our case). What needs to be modified so that it gets the other 20% as well? Generally. Obviously, the narrow approach makes some assumptions that it is not supposed to make. Then the strategy is to loosen those assumptions and we don’t have a guarantee that we will get a tractable algorithm. Arguably, the present one only works because the sequences are fairly simple.

One could of course just keep going with the present approach. However, at least some theoretical guarantees would be very welcome. I have still not found the answer to the first question: how to find generalizations of a present representation? How to expand in the right direction? Suppose, one would like to expand the constant (c, n) toward the incremental sequence (c, d, n). Or the specific number (c) to a constant sequence (c, n)? In both cases the first representation is a special case of a more general representation. If the sequence is 7, 9, 11, … then the 7 can be fitted with a constant – 7. The others cause a discrepancy. Trying the difference between the constant and 9 leads to 2. Between 11 and the constant leads to 4. In any case, we would have to play around with differences between current and present symbols in order to discover the new regularity. It seems that the constant sequence does not help us in any way with this discovery. Why does it occur at all then that the constant becomes a special case of the new representation?

It doesn’t have to. Suppose the new representation is not a general incremental sequence, but a special one, the one starting with 0, i.e. it is the specialization (0, d, n) of the general one (c, d, n). The constant is the specialization (c, 0, n). Such sequences are 0, 4, 8, 12,… or 0, 3, 6, 9,… or 0, 11, 22, 33,… etc. They all start with zero. Thus, the constant 7, 7, 7,.. is never part of it, since it does not start with 0! Nevertheless, the same strategy of taking subsequent differences will discover it. The constant does not play an role here.

Is this the reason why narrow AI can not be generalized and one is rather advised to start from scratch?

Posted in Uncategorized | Tagged | Leave a comment

AGI conference

I have submitted a paper to this year’s Conference on Artificial General Intelligence and it was fortunately accepted. The paper is available here: Franz (2015) Toward tractable universal induction through recursive program learning

Posted in Uncategorized | Leave a comment

How I’d build an intelligent machine: a position paper

Finally, after weeks of work, the position paper is written! It contains a hopefully clear presentation of my ideas about AGI and a tentative outline how to build it. Franz (2015) Artificial general intelligence through recursive data compression and grounded reasoning

Posted in Uncategorized | Tagged | Leave a comment

Artificial intelligence, quantum computation, path integrals and aliens

Some thoughts of a drunk physicist and AI fan…

As a matter of fact, intelligent brains emerged by natural processes in the universe. Therefore, obviously, the laws of physics allow such an emergence. One of my main assumptions is that a general compressor of data is the main part of an intelligent system. Taken together, this leads to the question, whether we could think of physical systems that self-organize to become general data compressors. Matter structured in such a way that it represents, encodes information about other matter. Of course, we do not want to deal with intentionality, really. There are no arrows pointing from one piece of matter to a different one, there is no real “aboutness”. What I mean is that one piece of matter encodes a program that can be run to generate the way, some other piece of matter is structured.

For example, the brain has got compact representations of an object. That means that if that representation becomes neurally active it gets projected onto the retina in such a way that the resulting activation structure of the retina is similar to its activation when the object is actually present in the visual field – imagination.

What if one could circumvent all the big deal with living systems, cells, brains etc. and build a physical system, that self-organizes to compress data? Of course, now, we are not able to do so, since we don’t have a general (engineered, programmed) compressor. But assume, we had one. Assume the compression rate is present as an objective function to be maximized. The crucial idea is then that physical systems are in general energy minimizers. After all, minimization of energy is the physical reason why anything at all happens in the world. Hence, if we achieve to build a system, whose energy corresponds to negative/inverse compression rate, then the latter will be minimized!

A current example of it is the D-wave quantum computer. If I understood correctly, it is built to solve large quadratic problems. In other words, there is an objective function – a quadratic form – that has to be minimized. Then a quantum system has been built in such a way that its energy potential equals this objective function.

The interesting part about quantum systems is that in the path integral formulation one can view them as searching all different paths from A to B simultaneously, therefore finding the energy minimum more effectively than classical systems. Also, local minima are less of a problem, since quantum systems can tunnel through walls. Path integrals simply represent depth-first and breadth-first search at the same time, parallel search.

How cool would it be to build a quantum computer that searches for maximal compression of a data set in such a way!

Since intelligent brains occurred in nature, maybe the universe is somehow deeply structured in such a way that in some part of it, intelligent systems emerge? Maybe the grand unified theory already entails the occurrence of intelligent parts of the universe and the way it happened on earth is just one of many ways? Maybe discovering this theory leads to ways to look for aliens?

Posted in Uncategorized | Leave a comment

On powerful sequence compressors

When I think about it, the very thing that I am trying to construct is a list of short programs that can compress a sequence, which brings me quite close to Marcus Hutter’s work. What is missing is a really cool compressor. A recursive meta-compressor.

The crucial question is how to enumerate all programs ordered according to their length, shortest programs first? Levin search or other sorts of universal search is what comes to mind. What those methods do not take into account is the fact that short substrings of a structured sequence can be used to compress the whole sequence to some extent. Then this compression can be used to make predictions and the remaining discrepancies for further learning. The idea is to reduce the search space online, on the fly, instead of searching through the whole space and halting only after one has found the final solution. If a sequence x has Kolmogorov complexity K(x), that means that the length of the shortest program p printing x is K(x). Usually, the strings we are interested in are the structured ones, the compressible ones which are exponentially small in number. Those have a small complexity. But does that not imply that a short substring already could recover most of the program p? And thereby capturing in some sense the largest principal components of the sequence? This really seems one of the things that my current sequence predictor does. It has all the crucial components, but is just not general enough. But would that not be a really cool proof, if I could show that programs of compressible sequences can be constructed incrementally, becoming increasingly longer as they approximate the sequence better and better?

Now, I can clearly see, what my current compressor does, since I have programmed it. I also could formalize that class of sequences that it can predict. Hence, maybe the best way to arrive at that general compressor it simply trying to program it. Even more so, since I now have a quite good idea of how to do that.

Interestingly, the assumption that any part of the sequence is sufficiently representative of the whole sequence, is called ergodic, if I remember correctly. Hutter expressed it as “the ability to recover from errors”. Wikipedia says “In econometrics and signal processing, a stochastic process is said to be ergodic if its statistical properties (such as its mean and variance) can be deduced from a single, sufficiently long sample (realization) of the process.” Generally one could sloppily say that a sequence is ergodic, if its largest “principal components” – it’s “essence” – can be extracted from a single sample of the sequence. After all, if the sequence was not ergodic, at some large position of the sequence a non-representative piece might occur. But in order for a program to print the whole sequence, the position of that piece would have to be encoded, which is a large number. Hence, the program will not be that short. Very informally/intuitively speaking.

Posted in Uncategorized | Leave a comment