5/6/12

Discussions with Ben Geortzel on AGI list


Some of my replies are edited for clarity:


Ben Goertzel: [agi] Finding the "Right" Computational Model to Support
Occam's Razor

Just some speculations about possible theoretical computer science I'd
do if I had the time ;p

http://multiverseaccordingtoben.blogspot.com/2012/08/finding-right-computational-model-to.html

On Mon, Aug 27, 2012 at 12:37 PM, Boris Kazachenko:

Maybe you should spend your time more wisely :).

All this confusion results from using languages designed for irrelevant
tasks. For GI, the language must be generated by the compressive Occam's
Razor algorithm itself, with incremental syntax produced by its past
iterations. Notice the POV difference: you don't have some fixed & final
complexity for GI to reduce. Rather, you go through indefinite number of
complexity accumulation / compression cycles. In my terms, that's a
current-level search / next-level evaluation cycle, iterated for as long
as you keep accumulating the data.

Ben Goertzel:

Yes, the iteration you mention is important... so at each stage in its
life, the AGI is doing Occam's Razor according to the "programming
language" implicit in what it's learned so far...

This is a matter of learning "computational models / programming
languages" on the fly, that are specifically adapted to the
environment of a given intelligent system, and the internal structures
emergent within that system as it learns/grows/develops...

And it may be that the "base", the initial measure of simplicity used
at the start of this iteration, is not all that relevant to the end
result -- so long as the base is something reasonable, and not totally
out of synch with the environment, goals and architecture of the
system...

However, that doesn't eliminate the theoretical/mathematical/philosophical question
I posted in that blog post...

Please note: I don't think that theoretical/mathematical/philosophical
question needs to be solved in order to create AGI. OpenCog already
embodies its own choice regarding the base language for measuring
simplicity, the Atom language...

I just think it's an interesting question, anyway...

And it may be relevant ultimately to creating a good general theory of
general intelligence, which we lack now...

However, I strongly suspect we can succeed at building powerful AGI
without needing to have a good general theory of general intelligence
beforehand..

Ben,

> Yes, the iteration you mention is important... so at each stage in its
> life, the AGI is doing Occam's Razor according to the "programming
> language" implicit in what it's learned so far...

"Stage of life" sounds very coarse, I meant a level of generalization, with
great many present simultaneously & data passing though each in
microseconds.

> And it may be that the "base", the initial measure of simplicity used
> at the start of this iteration, is not all that relevant to the end result --
> so long as the base is something reasonable, and not totally out of synch
> with the environment, goals and architecture of the system...

I'd say a measure of *complexity*, simplicity is a reduction thereof. And
that measure should be consistent across all levels, for global resource
allocation.

> However, that doesn't eliminate the theoretical/mathematical/philosophical question I
> posted in that blog post...

I think it does, - you measure complexity as data + syntax, & simplicity as
a difference between initial & compressed complexity. Of course, there that
Schmidhuber's of time (speed) & space (memory) components of complexity, but
I deal with it by averaging past productivity of each component. I then use
them as a measure of "opportunity cost" for data & syntax components of
complexity.

> OpenCog already embodies its own choice regarding the base language for
> measuring simplicity, the Atom language...

That's not basic enough. AtomSpace already assumes some initially known
structure in the data (hypergraphs), while I think the most basic data
should be simply a sequence of integers.

> However, I strongly suspect we can succeed at building powerful AGI
> without needing to have a good general theory of general intelligence
> beforehand..

I think that's wishful thinking :). Well, except by slavishly copying the
brain, but neither of us is doing that.

Boris,

Here's something that perplexes me about your approach...
Now, OpenCog is quite complicated. So, even if the design is workable, there's a good reason for it not being finished yet, and not doing anything amazing yet.
However, your approach seems very simple. So, if it's so simple and
it's workable, then why doesn't it work yet? You're a very smart guy
and have been working on it a while... ;p
> I think it does, - you measure complexity as data + syntax, & simplicity
> as a difference between initial & compressed complexity.

Yes, but you are making a choice about what language to use in the role
of "syntax", and what language to use to express compressed entities..

Ben,
> Here's something that perplexes me about your approach...
> Now, OpenCog is quite complicated. So, even if the design is workable, there's a good > reason for it not being finished yet, and not doing anything amazing yet.
You have no good reason to think that it is workable.

> However, your approach seems very simple. So, if it's so simple and
> it's workable, then why doesn't it work yet? You're a very smart guy
> and have been working on it a while... ;p

That's your last recourse, rhetorical questions.
I am trying to explain why your "complexity" is superfluous.
That doesn't mean I don't have problems, but how am I supposed to explain
them when you don't even get my solutions?

>> I think it does, - you measure complexity as data + syntax, & simplicity
>> as a difference between initial & compressed complexity.

> Yes, but you are making a choice about what language to use in the role
> of "syntax", and what language to use to express compressed entities..

Once again, I don't need any extraneous "language". I define operations: comparison, evaluation, & higher-order evaluations, which generate incremental syntax. These operations can be implemented directly in hardware, without any intermediate language. And they aren't so simple, if they are to be effective (scalable). I think I got comparison, though may have to extend it. Reasonably complete evaluation is a lot trickier, that's what I am working on now. Would be glad to discuss that, but probably already lost you here.


Monday, May 14, 2012 1:09 PM

Ben: Actually the use of context to help guide visual recognition is a key idea guiding the DeSTIN architecture (and probably Boris's architecture as well)...

Boris: In my approach, "context" is simply higher derivation (more distant association) levels of each pattern. This starts from the most basic match / miss differentiation for single-integer comparands.

Ben: (by "semi-specialized" I mean "biased to certain sorts of patterns, but still with the capability to overcome these biases given enough time & experience...)

Boris: Even in humans, such specialization exists only in primary (vs. higher association) cortices. It's not necessary for general intelligence, we have it for three reasons:
- it's an artifact of phylogeny,
- necessary for postnatal survival,
- a speed-up (at the expense of efficiency & scalability) to compensate for ridiculously slow wetware.

Ben:

My view is that "semi-specialization" (biasing) as I called it, is going to be necessary for any human-level general intelligence to run on feasible computational resources. In the human brain it compensates for the weaknesses of wetware, but in a computer system it will be needed to compensate for the weaknesses of the hardware.... Each real-world physical infrastructure for intelligence has its own weaknesses that need to be compensated..

Boris:

Our current hardware has the opposite weakness, - it's designed for fast sequential operations, which would favor slower & more analytical algorithm. But it doesn't really matter, hardware will ultimately be designed for the algorithm (while evolution does it the other way around).
You didn’t respond to my main point: this initial speed-up must come at the expense of ultimate scalability, - the system will then have to *overcome* these biases. Such trade-off makes no sense for AGI because its immediate survival is not at risk.
In any case, we need to design a scalable core: general pattern discovery algorithm, *before* adding specialized biases to it. In practical terms, this is a moot question.

4/26/2012:
Boris,

On that very high level of description, your model sounds similar to my proposed translational/scale/rotation invariant DeSTIN...

However, the "trick" of DeSTIN is supposed to be the processing that occurs inside a node, to arrive at a belief state (a probability distribution over the centroids that constitute the pattern-library associated with the node). This processing makes use of states from nodes above and below in the hierarchy.

I note that Itamar's philosophy is more like yours than mine, in that he thinks it's a good direction to build up this sort of hierarchical pattern recognition into a whole AGI. Whereas I'm currently more interested in using a hierarchical pattern recognition system as a "perception lobe" for a more diverse cognitive architecture aimed at AGI.... So he and I can both collaborate on immediate DeSTIN work, but with different long-term aims...

-- Ben
Ben,

> On that very high level of description, your model sounds similar to my proposed translational/scale/rotation invariant DeSTIN...
> However, the "trick" of DeSTIN is supposed to be the processing that occurs inside a node, to arrive at a belief state (a probability distribution over the centroids that constitute the pattern-library associated with the node). This processing makes use of states from nodes above and below in the hierarchy.

In my model, this occurs inside a template pattern: his node, that searches across a queue: his level. In terms of bottom-up fuzzy clustering, centroids are my output patterns, & their weight is a projected match, = probability. But this weight not known until the search across queue is complete, & the pattern is displaced out of it. So, a continuously updated probability distribution based on these outputs will represent the next (not current) queue. See, there is a delay between computing individual & aggregate probability, the former must be evaluated on the basis of the later *for the prior* search span. My lower-level states are input patterns, current-level are template patterns, & higher-level feedback is aggregate selected output patterns.

> I note that Itamar's philosophy is more like yours than mine, in that he thinks it's a good direction to build up this sort of hierarchical pattern recognition into a whole AGI.

Right, but mine is a lot more incremental, = analytic :). In my hierarchy, a level is a 1D queue of inputs, in his a 2D layer of nodes. This makes all the difference in the world, - I form explicit Cartesian coordinates incrementally. That means 1D Cs of 1D patterns are compared across 2D C on the next level, where angles (& then rotation) are computed as a ratio of 1D C difference over 2D C difference for each pattern. I don’t need a separate algorithm for that, it’s the same comparison process that I use for all other variables of a pattern.

“Scale” also has an incremental dimensionality: 1D length, 2D area, 3D volume, TD trajectory. All lower-level variables are aggregated within each level of such “scale”. Comparison by division of both “scale” & “content” variables across patterns will compute ratios, & for scale-invariant patterns, the ratios of the “scale” component & “content” components will match within a pattern.

I call it pattern *discovery*, vs. recognition, to stress that it’s unsupervised learning.

> Whereas I'm currently more interested in using a hierarchical pattern recognition system as a "perception lobe" for a more diverse cognitive architecture aimed at AGI.... So he and I can both collaborate on immediate DeSTIN work, but with different long-term aims...

But we all agree that this should be a starting point anyway :).

In my model, the patterns are inherently scale, angle, & a bunch of other things - invariant. That's because each of those things is automatically encoded as a separate variable. You simply quantify partial match & difference for each, & top-down feedback determines a threshold for overall match: output to a higher level. And that output pattern is incrementally complex, because it includes (selectively) these newly aggregated partial matches & differences.

On Fri, Apr 27, 2012 at 11:54 PM, Boris Kazachenko wrote:
I am trying to find something compatible with my approach, would greatly appreciate suggestions.
As I was explaining, my core principle is incremental complexity of inputs, generated by just as incremental range of comparison among them.
That complexity is layers of syntax representing partial matches & differences from prior comparisons. The purpose of this syntactic expansion is predictive search pruning: each increment is selected by its contribution to projected match. That should make for a scalable search.
Projected match is calculated as past match adjusted by vectors: past differences projected onto future coordinates. It is selected for elevation by subtracting average_projected_match * redundancy_of_representation on the higher level. And so on.
So, any compatible approach must start with some form of fuzzy clustering over dense arrays of single-variable elements: a scan line of mono-pixels across an image. It should then cluster / compare selected (increasingly sparse) 1D results / patterns over 2D frame, 3D focus, 4D stream, & then discontinuous search expansion.
Anyone knows something like that? I despair.

Boris.

Sat, Apr 28, 2012
Ben,

> I can't quite grok exactly what function the algorithm you're looking for should fulfill...

The function is prediction: "The purpose of this syntactic expansion is predictive search pruning: each increment is selected by its contribution to projected match. That should make for a scalable search..."
I guess it's hard to understand because the problem is *general*. That can't be helped: this is about AGI.

> I know what incremental clustering (online clustering) is ....

"Clustering" is just a mainstream term for any unsupervised pattern discovery.

> Are you "just" looking for a clustering method that will work on k-dimensional patterns, for small k?

I am looking for a consistent application of principles I just outlined. Any number of methods apply some of them, somehow, that doesn't really help: the whole point is "pruning".
Initial dimensionality I was talking about is spatio-temporal: external order in which inputs are compared. I guess your k is internal dimensions: variables that make-up a pattern. That should a product of initial comparisons, for any consistently incremental approach.

> If so, what is the mathematical form of the "patterns" you're considering?

"Mathematical form" is incrementally complex syntax that consists of variables generated by prior comparisons...
Sorry to be a pain, Ben, but I can't explain it better if don't know what part of that is not clear.
Ben,

> "Clustering" usually refers specifically to unsupervised learning methods that are
> oriented toward learning categories of items, where all the items in each category C
> are explicitly chosen to have a low "distance" between them according to some metric.

That's my definition of a pattern: a set of matching inputs. Match is a complimentary of difference, or "distance" in your terms. I use "distance" to mean difference of S-T coordinates only, for clarity.

> This is different from general unsupervised learning.

No, it's not. "Learning" is different from simple recording precisely in that it's selective for "patterns", & exclusive of "noise".

> Anyway, I think if I were doing to try to build an AGI system along the lines you describe, I would
> 1)Choose to represent patterns as programs in some simple language, probably a LISP or combinatory logic variant (hmm, maybe MOSES's representation language, Combo...)

My approach is *all about* its own language: incrementally built-up syntax. It can't be any simpler at the start, & gets as complex as the subject requires. In any truly *general* method, there should be no distinction between language & algorithm.

> 2) Do pattern recognition using MOSES (or some other probabilistic evolutionary program learning method), using

I *hate* random variation in evolutionary algorithms, - it’s an antithesis of intelligence. Variation in my approach is always directional: I increase | reduce complexity of comparison depending on the search results (match vs. costs) at the prior level of complexity. The equivalent of reproduction here is the number of corresponding-complexity comparisons, defined as an incremental search span on a higher level.

> A) for some programs, a fitness function involving: An Occam's Razor term (to bias toward simple programs), a term based on predictive accuracy,

That's what my algorithm is doing: "Projected match is calculated as past match adjusted by vectors: past differences projected onto future coordinates."
Occam's Razor: "It is selected for elevation by subtracting average_projected_match * redundancy_of_representation on the higher level. And so on."
Redundancy is the first variable representing cost, (vs. simplicity), & higher types of cost (complexity) are added incrementally.

> and a term based on interaction information or the partial information lattice (nice
> mathematical measures of surprisingness).

That's "novelty": selection by subtracting higher-level locally-projected match, in my intro part 7.

> B) for other programs, a fitness function involving Occam's Razor plus terms based on predictive accuracy and pattern frequency

"Frequency" is simply an incrementally higher-level pattern, not a distinct category.
To summarize, I am already doing most of these things, but *incrementally*, which is my core principle. The others, I have specific high-level reasons not to do.

****
The function is prediction: "The purpose of this syntactic expansion is predictive search pruning: each increment is selected by its contribution to projected match. That should make for a scalable search..."
***

> the clause "each increment is selected by its contribution to projected match" worries me a bit. That sounds like you're advocating some sort of greedy search, where the fitness of each incremental change is gauged on its own. But greedy search may be too limiting to recognize the complex patterns in the real world...,

Just because I want to quantify predictive value on each step doesn't mean that these predictions must be exclusive. That evaluation only assigns weights, which then select *potentially multiple* alternative representations for higher-level search.

> Greedy search methods are faster, but may not be adequate for the problem at hand...

My approach is actually the opposite: I initially "invest" in expensive incrementally complex syntax & redundant representations. That'll make it slow & uncompetitive with conventional methods at the start, but should ultimately produce more accurate predictions to scale the search in the long run.

> or in an agent's goal-relevant behavior in the real world.

That's a "narrow AI" attitude :).

I still don't think you understand what I mean by "incremental". And this is closely related to a "man-with-a-hammer syndrome": you (& most mathematicians & programmers) think in terms tools: methods that are proven to work, while I think in terms of high-level definition of general intelligence. The problem is, none of the current methods scale to do anything interesting, precisely because they're not consistently (incrementally) derived from such definition.

I appreciate that you're trying to help, Ben, but we seem to be going in circles. There's got to be a better way.

Boris.

Alex,

>Would you formulate your problems in contradiction form? In such manner as "I work on system X. It provides function Y but with some negative effects. I know how to fix them, but it generates new problems."
Right now, I am looking for approaches that share my starting point & general principles, below.
I am having a tough time explaining these, & it would be so much tougher to explain my current problems :).
For example, I am trying to find a common formalism for comparison & evaluation functions. Does that help? :).

Boris.

Boris to Ben:

> But of course finding an efficient search algorithm is the holy grail of AI since nearly the start, right?

Yes, but there never was a universal criterion to measure that efficiency on a single-input level. A key point of my approach is that it's strictly bottom-up, which means that I can quantify & translate *projected match* incrementally across all levels. All complex patterns (eventually equivalent to semantic concepts) are derived incrementally, from original single-variable inputs. In addition to layers of higher syntax, they all share lower-level syntax, so there's a common language across the hierarchy.

> Why would you want to deal with 1D queues??

Because it's the most incremental (potentially selective) way to expand search. Not necessarily in terms of numbers of inputs, but in terms of additional variables you need to introduce. When you expand search in 2Ds at once, you also need to add derivatives in 2Ds, - that's an extra cost for search (this goes back to my point that we need explicit coordinates & derivatives to form predictive vectors). On the other hand, adding 1D at a time allows to select inputs for each additional layer of syntax, reducing the costs overall.
That 1D is a temporal sequence of input patterns, the queue is FIFO. The patterns themselves can have arbitrary number of dimensions, but these dimensions are also sequenced, to represent macro-to-micro hierarchy. Such as POV- normalized length ( width ( depth of 3D objects. Comparison between these complex hierarchical patterns is also selective in detail, -comparison between their lower-D (& potentially higher-derivation) levels is conditional upon the results of comparison between their higher levels.

Do you see a pattern here, - incremental -> selective, = efficient?

> but it's an awkward way to represent many kinds of patterns, isn't it?

It's awkward to visualize, but so is a tangle of neurons in our brain, right?

> and certainly isn't brain-like ;p

We don't really know what a brain is like functionally, otherwise we'd be out of job. And evolution had some horrible design constraints, that we don't have to deal with.

Boris to Ben:

> As you say, this is Jeff Hawkins' basic philosophy, but his
> implementation thereof is badly inadequate...

Not quite, I think his basic philosophy is badly flawed too. The most important question is what you select for, & he is suggesting both invariance (match) & novelty (miss): intro part 5:
"He also suggests selection by novelty, but that would be mutually exclusive with generality. The scope of discovered generality is limited by the span of experience searched by an input, & the longer it searches (especially over the following inputs) the less "novel" it becomes. Any pattern is defined by some sort of repetition, so prioritizing novelty per se would simply select for random noise..."

Such basic contradictions mean that he can't really implement his "theory", & tinkers with a rather conventional ANN instead. And the problem with the whole "network" metaphor is that it puts hardware first, & the algorithm becomes an afterthought. To make theoretical assumptions explicit & consistent, the algorithm must be designed for a single node, parallelizing it is a secondary issue. And beyond that, expanding search by 2D layers is inherently coarse, or non-incremental, intro part 3:

"For those with ANN background, I want make it clear that each level of search here is 1D queue, not a 2D layer of ANN. Adding a single coordinate at a time allows for incremental selection to compensate for syntactic cost of its representation, & correspondingly more efficient scaling of search. Of course, ANNs don't even represent coordinates as explicit variables, which means that they can't be compared to form vectors. And without vectors any prediction must remain extremely crude.
So, while internally represented dimensionality of input patterns is increasing, their external order remains a 1D sequence. Also, dimensions are added in the order of decreasing rate of change therein. This means that spatial dimensions (with controllable rate) must be scanned first, while comparison over purely temporal sequence is delayed until accumulated variation within it justifies search for additional compression."

> This approach seems *so* simple

No one knows just how complex it must be to to start "self-improving" on reasonable hardware. More complex short-cuts are necessary to increase speed, we already know how to do it the slow way, - the evolution. I don't think it must be very complex, because very little of our math (which we developed with roughly incremental complexity) seems to be implemented biologically. Even if it is simple, that doesn't mean it's easy to figure out. E = m* c^2 was a bitch to discover.

> that I wonder, if it's workable, why
> one of you guys hasn't gotten impressive results from it already?

It doesn't scale because this iterative search-level algorithm is not efficient enough yet, part 1: "A general intelligence must have an indefinite range of search. Even a minor inefficiency of predictive input selection is multiplied at each search expansion, soon resulting in a combinatorial explosion of junk comparisons."
And it's not efficient enough because no one seems to care for theoretical subtleties. Intermediate simulations won't do any good if we don't know how to interpret them.

> Or do you think the issue is that it just needs much more massive scale
> to work properly?

That's a typical mindset, - just brute-force it. I think whatever advantage we can get from better hardware is insignificant compared to that of better algorithms.

Ben
:
Boris,

I spend most of my time focusing on "real AGI issues", but I don't consider this list the best place to do that.

Focusing on real AGI issues is best done within some particular paradigm and approach, within a community of people who have provisionally agreed to work within that approach to see where it leads. This list is so heterogeneous in nature, that it's not really possible to pursue in-depth AGI conversations here -- because as soon as you get started discussing a set of detailed ideas meaningful within one broad approach to AGI, the discussion gets sidetracked into foundational discussions with folks who don't like that broad approach.

I tried to resolve this problem a few years ago by starting an AGI forum site, but pretty much nobody came, so I killed it after a while...

So I've found private discussions on deep AGI issues much more productive... though this list is still useful as a generic "meeting ground" for various random AGI-interested people...

Anyway, it would be a big mistake to judge the level of overall discussions btw AGI researchers in the world, based on the discussions on this list ;p

... ben g


On Sun, Dec 18, 2011 at 8:08 PM, Boris Kazachenko wrote:




Typical.
A thread is hijacked & turned into a pissmatch because no one here has an attention span to focus on real issues.
In terms science in general, I definitely agree with Ben, - an ability to work alone is a plus, but other things are more important.
But in terms of formalizing general intelligence, it's not a plus, it's an AND. One must work alone because no one else is working.

Boris:Ben,

I appreciate your efforts (including this list) & didn't mean to blame you for sidetracking the thread. Heck, if no one wants to talk business, why not... Like you said, this list is only useful for introducing an approach & updates thereto.

> I spend most of my time focusing on "real AGI issues", but I don't consider this list the best place to do that.

The best place to focus is one's own website... & I don't see much focus on your blog.

> So I've found private discussions on deep AGI issues much more productive...

You only get private discussions *after* you introduced people to your approach. And you restrict your audience down to nothing unless that introduction is public.

> Anyway, it would be a big mistake to judge the level of overall discussions btw AGI researchers in the world, based on the discussions on this list ;p

That's not how I judge it. I follow links & do searches on *unavoidable* keyword combinations. The level of private discussions can't be much higher than that of public introductions.

From: Ben Goertzel

Sent: Monday, December 19, 2011 10:12 AM

To:
AGI

Subject: Re: [agi] Intelligence as a cognitive algorithm.


Boris wrote,

The best place to focus is one's own website... & I don't see much focus on your blog.

OpenCog has its own website, which is not updated frequently enough, but does focus on OpenCog ;)

My personal blog is more wide-ranging, as you've noted. I spend a majority but not 100% of my time on AGI -- partly because I need to earn a living, and partly because that's just the way my mind works ... I guess we all need to strike our own balance between purposeful focus on one thing, and broad-ranging exploration...
> So I've found private discussions on deep AGI issues much more productive...
You only get private discussions *after* you introduced people to your approach. And you restrict your audience down to nothing unless that introduction is public.


Sure, and this list is good for those sorts of introductions...
The level of private discussions can't be much higher than that of public introductions.


Well, private discussions can get much more in-depth both conceptually and technically.

But I guess it's true that, if you reject someone's approach based on a rough description (because it doesn't agree with your own intuition), you would probably still reject it after hearing more of the conceptual and technical details. Maybe you mean something like that...


Boris:
Ben,

> But I guess it's true that, if you reject someone's approach based on a rough description (because it doesn't agree with your own intuition),

> you would probably still reject it after hearing more of the conceptual and technical details. Maybe you mean something like that...

From OpenCog "Theory" section: "OpenCog is a diverse assemblage of cognitive algorithms, each embodying their own innovations — but what makes the overall architecture powerful is its careful adherence to the principle of cognitive synergy."

There's nothing for me to reject. You only know what's "synergetic" after experimentation, so your overall "theory" is trial & error. That took evolution >3B years on a planet-size quantum mechanical "computer".

From:
Ben Goertzel

Sent: Monday, December 19, 2011 12:03 PM

To:
AGI

Subject: Re: [agi] Intelligence as a cognitive algorithm.


No... in OpenCog we're trying to engineer synergy between a specific collection of cognitive processes,

architected according to specific principles, and there's a lot of theory underlying each of these processes and their interactions.

There is a certain amount of trial and error involved but also a lot of specialized theory...

ben

Boris:
But no overall theory.


Ben:
Hmmm...

Well, there is a high-level overall theory underlying OpenCog, which I wrote about at length during 1993-2006 in various books, e.g. The Hidden Pattern which gives a summary of many aspects (only semi-technically)

Then there is a lot of detailed theory underlying the different cognitive processes in the OpenCog design, and their interactions

However, while this detailed theory appears to be **compatible with** the high-level theory, it's not **derived from** the high-level theory.... This is a shortcoming.

However, I prefer to accept this shortcoming, than to adopt an alternate approach whose underlying theory appears to me fundamentally conceptually inadequate (which is my current reaction to your knol, though I must temper that with the comment that it's obviously a very compacted representation of your ideas, so there may be way more to your thinking and approach than I limned from that page...). I just don't buy the idea that hierarchical pattern recognition is the whole story, or even 40% of the story, for human-level AGI...

One of my ongoing compromises, is how to divide time btw building theoretical bridges between the high-level and detailed theory of my approach, versus guiding the practical implementation. I enjoy the theoretical aspect more, but feel the practical work is probably more valuable at this stage...


Boris:

> However, while this detailed theory appears to be **compatible with** the high-level theory, it's not **derived from** the high-level theory.... This is a shortcoming.

Right, you can't "derive" much from that hand-waving :). You need a formal definition of a "pattern", & I have it.

> However, I prefer to accept this shortcoming, than to adopt an alternate approach whose underlying theory appears to me

> fundamentally conceptually inadequate (which is my current reaction to your knol, though I must temper that with the comment

> that it's obviously a very compacted representation of your ideas, so there may be way more to your thinking and approach than I limned from that page...).

Use your imagination :). I should have an expanded edit soon, along with moving back to Blogger.

> I just don't buy the idea that hierarchical pattern recognition is the whole story, or even 40% of the story, for human-level AGI...

I think you're confusing general intelligence with a bunch of other things that clog human mind, as well as forgetting about modulatory / motor feedback.
 
From: Ben Goertzel
Sent: Monday, December 19, 2011 1:56 PM
To:
AGI
Subject: Re: [agi] Intelligence as a cognitive algorithm.

On Mon, Dec 19, 2011 at 1:37 PM, Boris Kazachenko wrote:

> Hmm, I had a formal definition of a "pattern" in 1990 or so, that's not the hard part ;p ...
OK, a "compressed representation" is rather obvious, I meant that compression must be defined as an incremental & selective process.

That is also rather obvious ...
And I don't mean "AIT-incremental", - a pattern must be discovered among empirical inputs. Generating predictions independently from inputs is anti-compressive when you consider both combined.  

On Mon, Dec 19, 2011 at 2:59 PM, Boris Kazachenko wrote:
> That is also rather obvious.
Show me where it's explained.


From:
Ben Goertzel

Sent: Monday, December 19, 2011 3:04 PM

To:
AGI

Subject: Re: [agi] Intelligence as a cognitive algorithm.



Essentially every proto-AGI architecture contains some component that does compression in an incremental and selective way, e.g. DeSTIN and MOSES certainly do... those are broad constraints that don't really say that much about how to do compression or pattern recognition....

In 1993 I wrote about the internal network of a mind as a dynamic "dual network" with linked (and co-evolving) hierarchical and heterarchical structures. The hierarchical network provides incremental pattern composition, the heterarchical network provides associational selection. Each network must be associated with appropriate learning algorithms. That was a long time ago and my AGI design is much more sophisticated now, but it's a similar principle...


Boris:

> Essentially every proto-AGI architecture contains some component that does compression in an incremental and selective way, e.g. DeSTIN and MOSES certainly do...

> those are broad constraints that don't really say that much about how to do compression or pattern recognition...

These are broad constraints for “broadly“ incremental approach. My approach is strictly incremental, - that’s not a “constraint“, it’s a direct determinant of what is being compared (=compressed) & how. There’s only one place to start: pixels, & only one way to go from there: compare them in 1D, & iterate from there. Well, it actually starts from binary inputs & digitization, but that’s harder to relate to the rest of the algorithm.

Any less incremental, & you lose opportunity for intermediate selection, which leads to less efficient search & then combinatorial explosion.

Boris:
> I think the learning/teaching approach is, to some extent, a separate issue from the system architecture and algorithms.

You can make it separate, but that would be a waste.

> ...DeSTIN and also Itamar's proprietary HDRN system are already applied in that manner...

Right, I keep hearing that. It's supposed to be a top secret, but half of what I've seen is incompatible with my approach, & your difficulties understanding the later suggest that so is much of the rest.

> I'm unconvinced that this is the best way to have one's AGI system learn.

> But, I do think one should build an AGI system **capable** of learning in such a

> manner, even if for practical expediency reasons one chooses a different sort of world-

> interfacing approach...

It is conceptually the simplest & the most fundamental way, any short-cuts should be an add-on.

> Some folks, like my friend Itamar Arel, seem to think all of abstract cognition can be gotten to emerge
> from this sort of perception / action / reinforcement focused architecture.

> According to my best understanding, you and Boris K share this general perspective

The fact that he needs additional action & reinforcement hierarchies suggests that his perceptual hierarchy is fundamentally deficient.

> I'm not so sure, I suspect other stuff may be required too...

Well, lets start from the basics, you can't avoid that anyway, right?

> I don't mean to be dismissive when I refer to "details" --- getting
> the details right is going to be critical to making a thinking machine work.

> And there may be many different ways of getting the details right...

In my approach, there's only one right way to get every detail, that's why I call it a theory.
  Boris:

>> In my approach, there's only one right way to get every detail, that's
>> why I call it a theory.
>> Boris.
>
> Strange statement...
> For example, aerodynamics is a real theory (better established than
> anyone's theory of AGI), yet it admits multiple possible ways of
> creating flying machines... with rather large differences between them
> !!

All analogies are flawed, - it's a lazy way to think.

> (There may be one optimal way of creating a flying machine, given a
> certain set of well-specified constraints on the flying machine, where
> optimality is defined as minimum-energy or some such. But,
> aerodynamic theory as yet gives us no way to find this kind of optimal
> flying machine design...)
> So why should a theory of intelligence admit only one possible design
> for a thinking machine?

I didn't say possible, I said right way: directly derived from the way you
define a problem. A theory of intelligence is different because it's
supposed to be general, - context-free. At least the very core of it, which
should be a starting point anyway. Again, you can add environmentally &
application- specific adaptations latter, but the core algorithm must, in
principle, be able to learn them on it's own.
That's the very meaning of "general", as opposed to any empirically-specific
theory.

> I don't grok your theory of theories ;p
> ... ben g

It's a meta-theory, not a garden-variety kind :).


Boris:

Ben,

The reference to 1D seems strange, since the physical world is
generally understood as 3D,


It's 4D, but that's our physics. General intelligence must be able to operate in any-dimensional space. I start with 1D because dimensionality (as well as any other form complexity) must be *incremental*. Search in higher dimensions adds syntactic cost, & we need to *select* inputs capable of bearing that extra cost.
and retinas are generally approximately
understood as 2D arrays ... care to clarify?


It's 2D, but that's not the first level of processing. Eye tremor makes each rod | kone *see* & interrelate a largely horizontal scan line of inputs. Interaction among these cells can be interpreted as subsequent integration of these scan lines in 2D. But, as anywhere else in biology, there's a lot redundancy & evolutionary artifacts in retina, so I don't see it as a model.
Are you just saying that the same algorithm would apply to 1D retinas
as to 2D retinas, so you want to test it in the simpler 1D case first?


Every level is a simpler "test" before the next level, but primarily for inputs rather than algorithms. In my model 1D (horizontal scan line) search generates 1D patterns, which are selected & then compared in 2D (vertically) on a higher level, forming 2D patterns. The comparison algorithm is largely the same, but 1D patterns have mulitple additional variables (length & derivatives). Each variable is compared independently, & the results are then integrated within a pattern. And so on in higher dimensions.
& selectively generate higher levels from the
results. Well, it actually starts from binary inputs & digitization thereof,
but its harder to see how this relates to the rest of the algorithm. Colors
& so on, as well as spatial dimensions & hardware details are not part a
core algorithm, - those are sensor / empirically specific & learnable,
though you can add short-cuts manually. We need to understand the core
algorithm *before* we can develop useful add-ons.

Any less incremental, & you lose opportunity for intermediate selection,
which leads to less efficient search.
Of course the cost of doing the selection intelligently, must always
be balanced against the cost of having more possibilities survive the
selection.... But indeed it's important to have the potential for
selection at all levels in the perception processing hierarchy, as
needed...


Boris: Selection is what intelligence is all about.

Ben G: As a couple almost examples of things that confused me in your knol ..:
"
This may seem similar to Levin Search, but the latter selects among randomly generated algorithms (of incremental complexity) that happen to solve a problem | compress a bitstring. My approach, on the other hand, is to search for patterns within environmental input flow. Hard distinction between input patterns & algorithms exists only for special-purpose programs.
"
Ben G: but it's not clear to me from the preceding paragraphs how your proposed system can recognize arbitrary computable patterns (as Levin search obviously can),

Boris:

I don’t like “arbitrary”, but if a given location is projected to be important enough (per “hierarchical feedback” part), all its outputs are elevated losslessly & eventually compared in all possible combinations.

Ben G: or if so what representation language it uses.

Boris: There’s no fixed “language”, the algorithm generates incrementally complex syntax on every level of generalization. I described first steps of this process in “syntactic expansion” part.

Ben G: My immediate impression is that your method would be limited to primitive recursive functions (which can be built up via composition from elementary functions), but the description isn't detailed enough for me to tell.

Boris:
It doesn’t need to be detailed, selective (pruned) recursion & combinatorics is the only method we have for generating functions of any complexity. But that’s math, empirical pattern discovery comes way before that.

"
On the next level of search, the derivatives are also selectively compared between patterns. This generates secondary derivatives over discontinuity, &|or over different types of coordinates. Such syntactic expansion is pruned by selected representation of variable types at a given resolution of position & magnitude, with partially redundant aggregation at a lower resolution.
Beyond that, a higher-order syntax is formed by comparisons across current syntax, analogous to, but far more complex than comparison across initial coordinates.
"
Ben G: but I don't understand what is your method for choosing which expressions in this "higher order syntax" to evaluate against sensory data (and the lower-level patterns computed therein)

Boris:
All the patterns are compared to all the lower-level outputs within a given range of search. Selection (evaluation for elevation) of potential outputs is done at a lower level, according to projected match of the former. Projected match (predictive correspondence) is the quantitative criterion of intelligence that I was talking about. It is computed by adjusting accumulated match of a pattern (defined in part 2) by hierarchical feedback (described in part 4), – average match, redundancy, contrast, expected match...

Ben G:
I don't really understand how you would handle
-- generating composite actions
-- episodic memory
-- assignment of credit thru the hierarchy once reinforcement is received
-- lots and lots of other stuff

Boris:
Neither I nor anyone else can explain higher levels explicitly, they build on a gazillion of lower-level choices. What I have is general principles that guide this process, derived from my definition of intelligence. This definition is the highest-level (meta) generalization one can make. It can't be "proven" without a life-time worth of examples, one simply has to work up to it through introspection. People get increasingly sloppy with elevation, hence the ludicrous mess that passes for philosophy & AGI.

Ben G: I sort of understand how you want to do perceptual pattern recognition, but not really how you want to leverage perceptual pattern recognition for control of an embodied agent doing stuff in a world over an extended time period...

Boris: Action is an adjustment of coordinates for sensors & actuators (they’re always combined), which is a direct extension of downward feedback within a representational hierarchy.

Ben G: I would be curious if other list members find the knol more transparent than I do...
Boris: They probably don’t, it’s something you need to work on, at the exclusion of everything else.
Sep 21, 2011 10:35 AM

Ben G: About predictive accuracy as an intelligence measure.... What matters is if the system is good at predicting which sequences of its action will lead to achievement of its goals in the contexts relevant to its life. This is different than, though related to, general predictive capability.

Boris: For a purely cognitive system the only goal is maximizing its predictive correspondence (accuracy * range). That goes for both internal information processing & external action (see the end of part 4). We can adopt it for our goals by pre-selecting inputs.

Ben G: "Hard distinction between input patterns & algorithms exists only for special-purpose programs. "
Hmmm, well clearly in the brain there's a distinction between its input patterns and the algorithms implicit in its wiring, no?

Boris: I think you're talking about a distinction between innate & acquired wiring patterns. Yes, some level of algorithms must be built-in to initiate learning at an acceptable speed, but the cut-off is not a qualitative distinction. Cognitive algorithm can then be refined & extended indefinitely. All of our math is such an extension.

Ben G: "... if a given location is projected to be important enough (per “hierarchical feedback” part), all its outputs are elevated losslessly & eventually compared in all possible combinations. "
How can you take all possible combinations, in reality?

Boris: You can't, not if you include combinations of subsequent derivatives, I was talking "in the limit".

Ben G: Don't you need to prune the space of possible combinations? How is this done?

Boris: I already described evalutation by projected match for empirical inputs & patterns. In pure math (on much higher levels), the criterion is reduction (as in equations), which is an equivalent of match. You prune the expressions with below-average resuts-per-operations ratio.

Ben G: Are the following simple points correct?

-- you're building a hierarchical pattern recognition network,
Boris: Obviously.

-- it's substantially aligned with the spatiotemporal structure of the perceived world
Boris: Initially, but a comparison sequence can be re-ordered on higher levels according to more compressive "coordinates".

-- higher levels of the network embody more abstract patterns, combining the outputs of the lower levels
Boris: Yes, but "network" implies lateral data transfers, while in my model primary data transef is vertical: across levels.

-- perception and action are carried out in the same network, so that action control and planning are part of the same process as top-down perceptual feedback
Boris: Right.

-- the system's goal is accurate prediction,
Boris: More like a "predicted... prediction", actual confirmation is not always necessary.

-- and somehow those patterns that lead to accurate predictions are going to be rewarded and have more likelihood of surviving and being used again
Boris: They're selected as inputs to higher levels, which have a longer search cycle, thus slower content "recycling".

Ben G: I don't think I'll be able to fully grok your design and algorithms in detail without putting us both through more QA than we want to do,
Boris: No problem. BTW, I may post (a constructive part of) this discussion as comment on the knol, do you mind? Might save me a few questions in the future.
> Boris: For a purely cognitive system the only goal is maximizing its
> predictive correspondence (accuracy * range).

Ben G: Hmmmm.. in that case I don't think "purely cognitive systems" are
going to be very useful in reality. Given limited compute resources,
they will be massively outperformed by systems that are oriented
toward maximizing *useful* predictive correspondence...

Boris: You don't know what's "useful" till you have a big chunk of
"correspondence" in the first place. And you can't define "useful" in general terms anyway,
so this is just another one of your excuses for lazy thinking.

> Boris: I already described evaluation by projected match for empirical
> inputs & patterns. In pure math (on much higher levels), the criterion is
> reduction (as in equations), which is an equivalent of match. You prune
> the expressions with below-average results-per-operations ratio.

Ben G: Yeah, but there are very many expressions due to combinatorial
explosion -- you can't just produce them all then prune the bad ones.
Is your approach to incrementally build up complex expressions
compositionally from simpler ones?

Boris: I think I have "incremental" in just about every paragraph in the knol, starting from the 1st. You seem to have too many things on your mind to keep track of this discussion.

Ben G: If so, why don't you run into the same problems as greedy learning
systems? Presumably because the
top-down feedback from the existing complex expressions guides the
formation of new complex ones from simpler components, I suppose. But
this is a key point and it's not very clear to me how your system does
it...

Boris: Maybe you can point out what part of the process I already described
is not clear to you.
> Boris: You don't know what's "useful" till you have a big chunk of
> "correspondence" in the first place. And you can't define "useful" anyway,

Ben G: It seems a baby learns pretty quickly that getting milk from the tit
is useful, whereas the specific pattern of wrinkles on its blanket is
irrelevant -- and this learning then helps focus its ongoing learning
activity (which then leads to further focusing, etc.)

Boris: I am working on a scalable intelligence, not just another animal with primitive drives.

Ben G:
> so this is just another of your excuses for lazy thinking.

I wonder what purpose you think is served by throwing insults like
that into a conversation?
I hope at least you find it entertaining; to me it's rather dull ;p

Boris: I have nothing to lose, - either I shock you into actually working, or save
myself a distraction. Guess it's the latter.

Ben G: Boris,

I'm not shocked by being insulted, not even nontrivially annoyed --
it's par for the course on unmoderated email lists.

And I **am** already working on AGI, inasmuch as my personal economic
situation permits (meaning, around 50% of my working time, i.e. about
30-35 hours/week).... I happen not to be working on it according to
the precise approach you prefer, however...

> Boris: Maybe you can point out what part of the process I already described
> is not clear to you.

Ben G: Too many parts are unclear to me, and we're both busy, so I guess we
should drop off the conversation here.

As I said, I'll be eager to read a detailed description of your ideas
if/when you choose to publish one. The knol is evocative but has a
high density of obscurities as compared to, say, a typical research
paper.

No comments: