5/14/13

Cognitive Algorithm: a step of hierarchical generalization


I define intelligence or cognitive ability as a capacity for prediction & planning. Technically, planning is a prediction that includes oneself. And the only way to predict is by discovering & projecting input patterns. Pattern is a set of matching inputs, or lower-level patterns. Match & miss are produced by comparisons among inputs (part 2). I see no meaningful alternative to these definitions, & my approach is to uniquely derive from them specific operations on the input flow. Hence my use of a singular in “cognitive algorithm“.

The above, & this intro in general, seems to be impossibly abstract for most readers. I believe, however, that such dry & sparse presentation is essential to develop a mindset suitable for formalizing general intelligence. “Real life” examples are always conceptually dirty, & usually distract more than explain (let me know of any good exceptions). For an excellent more contextual introduction, see
“On Intelligence” by Jeff Hawkins or, not as good, “How to Create a Mind“ by Ray Kurzweil. But, neither is nearly coherent enough for me.

Also, I try to avoid specialized terminology, such as used in Machine Learning or even Cognitive Science.
My subject is the most general of all, & so are my terms. Which is a major confounder, because people crave specific context, but… generalization is decontextualization. I am actually pretty good at explaining normal things, but cognitive algorithm must be a generalization of our entire experience. That can’t be done in a brief intro, I assume deeply analytic introspection as a prerequisite. So, back to my definitions & derivations:

Autonomous cognition must start by cross-comparing quantized sensory input flow within a search range (initial inputs must be sensory because symbolic data could only be produced by prior cognitive process). Bottom-up approach is also the easiest to implement: any representations beyond sensory inputs, such as natural language terms & expressions, are implicitly encoded. Thus, they must be decoded before being cross-compared to form patterns, & the difficulty of decoding is exponential with the level of encoding.

To discover & predict anything complex at a feasible cost, both search (multiple comparisons) & resulting patterns themselves must be hierarchical. That is, only selected patterns search over higher orders of distance, & only selected levels of differentiation (past partial misses) are compared for each pattern.
Hence, cross-comparison discovers initial patterns, which are then selectively forwarded to the next level of search. Recursive input comparison & output selection on successive levels of search discovers increasingly general & complex patterns, initially “physical” & then conceptual. Higher levels also generate downward feedback, ultimately a motor action, to focus on lower-level sources with above-average predictive value.

Hierarchical approaches are very common in Machine Learning, & most do some sort of pattern recognition. But, none that I know of implements a strictly incremental growth in scope & complexity of discoverable patterns. This is critical for efficient scalability: each increment allows for predictive input selection, without which there is a combinatorial explosion in potential search space.
That space is the number of patterns, * number of variables within each. Number of variables per pattern is incremental, starting from single-variable inputs, such as pixels in vision. New variables added per search cycle are cumulative match & miss for each compared variable of an input pattern.

So, number of variables multiplies at each level, new variables are aggregated within search span to a critical value, then evaluated to select their pattern for extended search. Such incremental syntax also extends the algorithm of comparison: each variable of selected input pattern is evaluated for comparison vs. aggregation. Basic comparison is by subtraction, aggregation is by summation, & evaluation is by higher-level feedback values of selection criteria, which I derive from the above definition of intelligence. This feedback can be incrementally deep & complex, & I am currently working on a mechanism that could scale in both ways.

Levels of search are grouped into super-levels, defined by incremental
order of distance among comparands: - Comparison of adjacent inputs, forming continuous patterns of incremental dimensionality (distance = 1): line segments: 1D, blobs: 2D, objects: 3D, & processes: 4D. This is similar to connected-component analysis.
- Cross-comparison across a whole queue of inputs, forming discontinuous patterns per input.(distance->n). These patterns are
fuzzy, & their overlap is compensated by selection among inputs to a next level.
- Comparison across a hierarchy of short-cuts to higher-level queues, generated by feedback (distance -> nn).

In conventional terms, my approach is a form of
deep learning. But, current DL is based on ANNs, which learn probabilistically via some version of Hebbian “fire together, wire together” coincidence reinforcement. That is, weighted inputs are summed into output, which is then combined with input per synapse to adjust weight of the latter. It’s a very crude form of comparison: match is determined only for combination of all inputs, & no derivatives or coordinates of individual inputs are recorded. Such coarse recognition results in ridiculously inefficient learning. Another label that fits my approach is hierarchical clustering, but, as with ANNs, no other version proceeds with incremental dimensionality, or generates incremental syntax.


Contents:


1. Cognition vs evolution, incremental complexity vs symbolic data.
2. Comparison: quantifying match & miss per input.
3. Search: incremental range & derivation of comparisons.
4. Cartesian dimensions & sensory modalities.
5. Patterns: syntactic expansion & re-integration of inputs.
6. Feedback: selection by higher-level aggregate representations.
7. Hierarchy: selection by feedback of multi-level shortcuts.
8. Notes on clustering, ANNs, the brain, & probabilistic inference.
9. Notes on a working mindset & a prize for ideas.


1. Cognition vs evolution, incremental complexity vs symbolic data.


We know one mechanism that produced a human-level intelligence: biological evolution. Algorithmically simple, evolution alters heritable traits at random & selects those with above-average reproductive fitness. But this process is horribly inefficient because selection is coarse: on the level of a whole genome rather than individual traits, & also because intelligence is only one of many factors for reproductive fitness. And random variation in evolutionary algorithms, as well as in “generative” deep learning, is antithetical to intelligence.

Intelligent variation must be directional: range & complexity of search should increase (decrease) if current-level predictive benefits exceed the costs of search. Hence, variation in cognitive algorithm is driven by data: environmental stimuli, incrementally generalized into patterns, & then into algorithms. These categories differ only in the degree of discovered generality.
Second element of genetic algorithms is selection: conditional reproduction of variations. My analogue of reproduction is simply extended search span: the number of additional comparisons for selected input. And the criteria for selection are directly derived from the very definition of intelligence.

A popular attitude in AI is that intelligence can be recognized but not defined. That seems absurd to me: recognition *is* a match between observation & definition. Some researchers agree with my definition, but none that I know of apply it as a criterion for consistently bottom-up pattern discovery process. I believe this lack of theoretical integrity is a main reason for the failure to scale in all AI/ AGI attempts to date.
Intelligence is usually defined in terms of solving problems, but the only general “problem” is efficiency of a search for solutions. That efficiency is predictive selection of comparands, vs. brute-force all-to-all search. And the only way to predict is by projecting patterns, which gets us back to my definition of intelligence:

If intelligence is ability to predict, then cognitive fitness function is predictive correspondence of recorded inputs (memory): their cumulative match to future inputs. Of course, future correspondence can only be estimated by criteria correlated with it in the past. The most basic of such criteria is past cumulative match itself, & more complex multi-variable prediction criteria can be discovered on higher levels of search. These criteria are sent as feedback to filter inputs on lower levels, “self-improving” hierarchical selection process.

Such selection is a must for scalable search, or roughly constant match discovered per cost of comparison:
Given a non-random environment, average match will decline with the distance between comparands. So, search expansion can scale only when better-than-average inputs are selected. Any lag & inaccuracy of such selection is multiplied at each expansion, soon resulting in a combinatorial explosion of junk comparisons.
To avoid this, my approach is strictly incremental: search must start with minimal-complexity inputs, & expand by minimal-complexity increments in scope & syntax for selected outputs. There is always only one best next step: a search complexity increment expected to discover the greatest additive predictive value.

Most approaches to AGI are loosely incremental, but coarse & arbitrary increments miss the opportunity for intermediate selection, exponentially decreasing efficiency of search. I guess people who aim at human-level intelligence are impatient with small increments & dumb sensory data: both make them feel too distant from their goal. Yet, this is the most theoretical problem ever, calling for the longest delay in gratification.

The most obvious example of this impatience is the fact that most of current AI projects are designed to use linguistic / symbolic data as initial inputs. To me, this betrays profound misunderstanding of our cognitive process. Even children, predisposed to learn language, only become fluent after years of directly observing things their parents talk about. All our knowledge is ultimately derived from senses: the only source of presumably context-free inputs. But our sensory processing is unconscious, only generalized concepts make it into human
declarative memory, where they are assigned symbols (words) to facilitate communication.

Symbolic data is by definition contextual: encoded with distant cross-references that are hardly ever explicit outside of a brain. Incremental (efficient) search should exploit proximity first, but this won’t be productive for such contextually compressed data: it looks random unless you know the code. And that code is the way we subconsciously generalize pixels into objects into concepts, which is what I am trying to formalize here.
NL text does contain more valuable information, but sensory data is exponentially more suitable for local & lossy (hence selective & scalable) general cognitive algorithm. So, I believe our conscious symbolic bias, especially strong in artificial intelligentsia, sacrifices scalability for impressive short-term results.

Hence a dire scarcity of references here. Everything ever written is related to my subject, but… generalization is a reduction. And cognitive algorithm is a meta-generalization, demanding utterly ruthless reduction.
My approach is presented strictly bottom-up (in parts 2 - 7), thus can be understood without references.
However, this requires a clean context, - hopefully cleaned out by reader‘s own introspective generalization.
Other (insufficiently) related approaches are addressed in part 8.
I also have a far more advanced work-in-progress, but will need a meaningful feedback to elaborate.


2. Comparison: quantifying match & miss per input.


The purpose of cognition is to predict, & prediction must be quantified. Algorithmic information theory defines predictability or order as compressibility of representations, which is perfectly fine. However, AIT, as well as all statistical / probabilistic models, quantifies compression only for whole sequences of inputs.
To enable far more incremental selection (& correspondingly more scalable search), I start by quantifying match in comparisons between individual inputs. Partial match adds a micro-dimension to binary same | different analysis of statistical inference. This is analogous to the way statistical inference improved on classical logic by quantifying partial probability, vs. binary true | false values of statements.

In a non-random environment, initial & the most basic predictions are simply past inputs, & predictive value is their cumulative match to prior & subsequent inputs within a range of search.
Individual match is compression of magnitude by replacing larger comparand with its difference relative to a smaller comparand. In other words, match is a complementary of a miss, & is initially equal to the smaller comparand. The ultimate criterion is recorded magnitude, rather than record space (number of bits) it occupies after compression, because the former represents physical impact that we want to predict.

This definition is tautological: smaller comparand = sum of Boolean AND between uncompressed (unary code) representations of both comparands, = partial identity of these comparands. Some may object that identity also includes the case when both comparands or bits thereof equal zero, but that identity also equals zero. Again, the purpose here is prediction, which is a representational equivalent of conservation in physics. We’re predicting some potential impact on the observer, represented by an input. Zero input ultimately means zero impact, which has no conservable physical value (inertia), thus no intrinsic predictive value.

Given incremental complexity of representation, initial inputs should have binary resolution. However, average binary match won’t justify the cost of comparison: syntactic overhead of representing new match & miss between positionally distinct inputs. Rather, these binary inputs are compressed by digitization within a position (coordinate): substitution of every two lower-order bits with one higher-order bit within an integer. Resolution of that coordinate (input aggregation span) is adjusted to form integers sufficiently large to produce (when compared) average match that exceeds above-mentioned costs of comparison. These are “opportunity costs“: a longer-range average match discovered by equivalent computational resources.

So, the next order of compression is comparison across coordinates, initially defined with binary resolution as before | after input. Any comparison is an inverse arithmetic operation of incremental power: Boolean AND, subtraction, division, logarithm, & so on. Actually, since digitization already compressed inputs by AND, comparison of that power won’t further compress resulting integers. In general, match is *additive* compression, achieved only by comparison of a higher power than that which produced the comparands. Thus, initial comparison between integers is by subtraction, which compresses miss from !AND to difference by cancelling opposite-sign bits, & increases match because it’s a complimentary of that reduced difference.

Division will further reduce magnitude of miss by converting it from difference to ratio, which can then be reduced again by converting it to logarithm, & so on. By reducing miss, higher power of comparison will also increase complimentary match. But the costs may grow even faster, for both operations & incremental syntax to record incidental sign, fraction, irrational fraction. The power of comparison is increased if current-power match plus miss predict an improvement, as indicated by higher-order comparison between the results from different powers of comparison. Such “meta-comparison” can discover algorithms, or meta-patterns.


3. Search: incremental range & derivation of comparisons.


Each level of search in my model is a queue of templates: Q(t), compared to a sequence of inputs. Assuming that environment is not random, average match will decline with the distance. To maintain proximity of templates to inputs, the queue is FIFO: each input is compared to all templates in the queue, then added to the queue as a new template, while the oldest template is outputted & evaluated as an input to the next level.

Inputs to the first level are integers: pixels of 1D scan line across an image, or equivalents from other senses.
Comparison of each input-template pixel pair forms two new variables: scalar match (m) & signed miss (d). Both are then buffered into Q(m) & Q(d), as well as aggregated across the search span of Q(t) into M & D. Differences are also aggregated within lengths of their contiguous sign: L’s, which are separately buffered into Q(L,D). So, full syntax of a 1st level template is: i (original input), M, Q(m), D, Q( L,D ( Q(d) ).

Outputted template is evaluated for search on a higher-level queue, which must be increasingly selective to avoid combinatorial explosion. Evaluation is a comparison of selection criteria in the output to those in aggregate representation of a target queue, the latter formed by adding Q’ inputs & subtracting the outputs. The most basic criterion is match, so it is compared to higher queue’ average match to form relative match: Q(m) >> m; r = m - am. Just like differences, r’s are aggregated within contiguous-sign L‘s, forming Q(L,R). Positive L,R indicates contiguous pattern: 1D segment, 2D blob, 3D object, or 4D temporal persistence.

If these contiguous patterns are long enough (another selection criterion), discrete d’s within corresponding span of Q(d) are also cross-compared to form second derivatives: Q(m) >> m; r = m - am; Q(d) >> d;
while (r_sign = 1){ md = min(d, td); MD+=md; dd = d-td; while (td_sign = d_sign){ L++; DD+=dd;} td=d;}
This is similar to cross-comparing original inputs / templates within Q(t), but md’s & dd’s are not buffered, - there’s no option to cross-compare them. That’s because even evaluation for cross-comparison incurs costs that are no longer justified by lesser average magnitude (thus potential match) of second derivatives.

When evaluation of Q(m) & partial cross-comparison of Q(d) is complete, aggregate match is incremented by aggregate match across differences to form i+d aggregate match: idM = M + MD, & compared to average idM of a higher queue: r_idM = idM - a_idM. If r_idM is positive, individual L,D’s & L,R’s are also cross-compared within, correspondingly, Q(L,D) & Q(L,R), forming mL’s -> ML & dL’s -> DL for each.
D_ML & R_ML then contribute to total aggregate match of the output: TM = idM + D_ML + R_ML;

Three paragraphs above cover internal consolidation of an output: pre-evaluation & cross-comparison within queues of higher derivatives. Subsequent evaluation of consolidated output for higher-level search must be adjusted for cumulative overlap / partial redundancy of derived variables: M, Q(m), D, Q( L( Q(d)), & so on, to corresponding variables in previously selected cross-compared inputs.
So, final output evaluation is rTM = TM - aTM * overlap. Proximate templates have greater overlap, so adjusting for it “sparsifys” higher queue’ inputs: outputs with positive unique relative total match.
Increasingly complex selection criteria are covered in part 6: Feedback, & part 7: Hierarchy.

Not-selected outputs are aggregated within a minimal span: between selected outputs, or a maximal span, which increases aggregated magnitude to a value that forms an average match in a higher-level comparison. Match is limited by (& crudely correlates with) magnitude of comparands. Thus, a higher queue will consist of selected discrete templates, interlaced with aggregated templates. Even though the magnitude of the later is increased, any higher-resolution matches within aggregation span are lost, & variable span makes cross-comparison more difficult. Partial aggregation extends coordinate span for a queue of a fixed length.

Not-selected outputs can also be stored in longer-term buffers, to delay the loss of detail. Such buffers are implemented in slower & cheaper media (tape vs. RAM), with possible compression by non-selective transforms, & multiple stages for parallel access. The buffers are accessed only if higher-level patterns that represent buffered templates at a lower resolution become strong enough to justify increased resolution, or if new inputs get relatively “close” to the buffered templates again.

Along with declining match, continuous search span (number of templates in a queue) is also limited by exponential increase in redundancy. Basically, number_of_comparisons = number_of_comparands ^ 2. Match & miss per comparison are represented as discrete &| aggregate variables in both comparands, which is a major overlap. Such overlap is a cost-multiplier, so the length of a queue is the distance at which average difference between these costs & projected match per comparison exceeds the cost of individually evaluating such difference. This may extend across the queues: derivatives of selected outputs partly overlap with those of cross-compared templates still in the lower queue, which adds to the cost of the latter in future evaluation.


4. Cartesian dimensions & sensory modalities.


For those with ANN background, I want to stress that a level of search in my approach is one-dimensional queue of inputs / templates, not a layer of nodes (more in part 8). Even when ANN layer is 1D, it’s still orthogonal to the direction of inputs, so the hierarchy is at least 2D. These inputs are combined within a node regardless of laminar distance (a proto-coordinate) between source & target nodes. Coordinates are essential because the value of prediction = precision of what * precision of where. Since coordinates & co-derived differences are not represented in ANNs as explicit variables, they can't be compared to calculate Euclidian vectors. And without such vectors any prediction (thus selection) must remain extremely crude.

The direction of inputs to my queues is in the same dimension as the queue itself, which means that my core algorithm is by default 1D. A search hierarchy of 1D queues is the most incremental way to expand search: we can add | extend only one coordinate at a time. This allows algorithm to select inputs that are predictive enough to justify additional cost of representing that coordinate & corresponding derivatives. Again, such incremental syntax expansion is my core principle because it enables selective (thus scalable) search.

A common objection is that images are “naturally” 2D, & our space-time is 4D. Of course, these empirical facts are practically universal in our environment, but a core cognitive algorithm must not be empirically specific. Additional dimensions of proximity can be specified as add-ons to the core algorithm, or discovered as a general periodicity in the input flow. Basically, the distances between frequent input matches are compared, match between these distances indicates a period (lower dimension), & multiple periods make up higher dimension / coordinate.
Initial coordinate resolution is binary: before | after sign. Integer coordinates are only needed to represent variable aggregation span, or discontinuity between inputs, produced by selective pruning on higher levels.

So, while internally represented dimensionality of patterns is increasing on higher levels, their external order remains 1D sequence of inputs. Also, dimensions are added in the order of decreasing rate of change therein. This means spatial dimensions are scanned first, - the rate of change there can be sped-up, while comparison over purely temporal sequence is delayed until accumulated variation within it justifies search for additional patterns. Actually, temporal sequence is the original dimension, but it is mapped onto incremental spatial dimensions until spatial continuum is exhausted: 0D ) 1D ) 2D ) 3D ) TD.

Even higher levels should process search & integration across sensory modalities, & then over increasing spatio-temporal distance between multi-modal & multi-dimensional patterns. Besides S-T dimensions, search can be re-ordered by any derived variable with above-average correlation between its match & match of all other variables, - which is a defining property of any useful coordinate. This match of a match is calculated by comparison of new matches across input variables within a pattern, then aggregated for same-type variables between patterns, & finally compared to higher-level average costs of reordering all inputs.

Sequential increase of dimensionality can also be traced in most biological senses. In vision, original stimuli are aggregated into 0D “pixels” of brightness, represented by each rod or cone cell in the retina. Output here is a spike train produced by accumulate- &- fire, which in functional terms performs digitization. All of these cells "scan" in the same direction of eye tremor or a saccade. That means they receive (& likely compare) a 1D queue of inputs within their soma & axon. 2D integration would be across 1D outputs of sensor cells in the ganglion layer of retina, then in LGN & primary visual cortex. 3D is formed by integration across ocular dominance columns in V1|V2, colors are integrated in V4, temporal sequences probably in V5 & beyond.


5. Patterns: syntactic expansion & re-integration of inputs.


A pattern is a compressed representation of multiple inputs, the same concept as centroid of a
fuzzy cluster in terms of machine learning. A measure of compression in my model is cumulative match, &, to a lesser degree, cumulative difference of a given input. All three: i, m, d, indirectly represent cross-compared inputs & make up a basic layer of syntax in a multi-variable pattern. Technically, all cross-compared inputs form a pattern, but only those with an above-average compression are forwarded to higher levels for extended search.

More exclusive type of a pattern is a recurrent match itself: a subset of each input that is common across a set of inputs. This is actually a higher-derivation pattern, definable as above-average match of a match. Just like above-average match selects an input for a higher-level search, this higher-order match selects a common subset for a higher integration vs. differentiation level within a pattern itself. That subset also has a priority for extended search. The most basic hierarchical sub-differentiation within a pattern is by match of a binary sign in relative match, forming continuous segments of above | below average match across input queue.

Selective elevation increases both predictive value & potential syntactic complexity of patterns: the number of different variables within it. That’s because comparison of same-type variables between patterns repeats the process I described in part 3 for single-variable patterns, but this time for each variable within a pattern.
New types of derivatives also form in comparisons across higher spatio-temporal or derived coordinates.

Such syntactic expansion is pruned by selective comparison vs. aggregation of individual variable types within input patterns, over each coordinate type or resolution. As with templates, minimal aggregation span is resolution of individual inputs, & maximal span is determined by average magnitude (thus match) of new derivatives on a higher level. Hence, a basic comparison cycle generates queues of interlaced individual & aggregated derivatives at each template variable, & conditional higher derivatives on each of the former. Combined with the distances, derivatives can form “motor” feedback vectors that select sources (location & resolution of future inputs) projected to maximize predictive correspondence of a target.

Syntax is identification of different variable types by their position (syntactic coordinate) within a pattern. This is analogous to recognizing parts of speech by their position within a sentence. Search algorithm must be guided by syntax to compare or aggregate same-type variables between input & template patterns. In addition to variable types that represent comparisons & evaluations that formed them (as above & in part 3), syntax also includes empirically-specific variable types that indicate different sensory modalities & incrementally higher spatio-temporal dimensions (see part 4: “Dimensions & modalities”).

Sufficiently complex syntax will justify comparing variables across “syntactic“ coordinates within a pattern, analogous to comparison across external coordinates. In fact, that’s what higher-power comparisons do. For example, division is an iterative comparison between difference & match, - across derivation order & within a pattern. Beyond syntactically continuous comparisons, there could be comparisons across syntactic gaps. For example, comparison between lengths of different dimensions within a multi-D pattern will form potentially recurrent dimensional proportions.
Internal comparisons can further compress a pattern, but at the cost of adding a higher-order syntax, which means that they must be increasingly selective. This selection will increase “discontinuity” over syntactic coordinates: operations necessary to convert the variables before comparison.

At some point, these operators will become “large” enough to merit direct comparison / search among them. This will produce purely algebraic equations, where the match (compression) is a reduction in the number of operations needed to produce a result.
The first such short-cut is probably some version of Pythagorean theorem, to be discovered during search in 2D as a way compute cosines. Cosines are necessary to normalize lengths: L(m) & L(d), initially in 1D, to a value they would have in an orthogonal “subjective” angle of 1D scan lines. Basically, 2D-adjacent 1D Ls will be compared by division, & over 1D distance & derivatives: an angle. This comparison will discover relatively constant ratio (cosine) between the ratio of 1D Ls & a 2nd derivative of 1D distance.

Such normalization for a POV angle is similar to
dimensionality reduction in Machine Learning, but is much more efficient because it is secondary to selective dimensionality expansion. And “reduction” is misleading, - dimensionality here is prioritized rather than reduced. That is, the dimension of a main axis (pattern|cluster - specific) is maximized, & dimensions sequentially orthogonal to higher axes are correspondingly minimized. The process of discovering these axes is so basic that it might be genetically encoded in animals.

The patterns I described here are not qualitatively different from our semantic concepts, which are either generalized empirical patterns: objects & processes, or are strictly relational. There is no other meaningful way to define a “concept“, all input data is either selected / remembered as patterns, or filtered out as noise.
Given sufficient computational resources & discoverable mathematical shortcuts, search over incrementally complex syntax will discover patterns / concepts on & beyond the level of natural language.


6. Feedback: selection by higher-level aggregate representations.


This part will expand on the principles of selecting inputs for the next level of search by feedback. The most basic feedback is aggregated representation of selection criteria in feedforward inputs to the next level queue. Whole-queue representation may also contain averages of other input variables, for preliminary comparison, but that will have lower value than selection of an input as a whole.
As explained above, the outputs are initially selected for extended search by their past cumulative match, minus average higher-level match multiplied by the redundancy of output’s prior representation there.

This will select for generality, or invariance in terms of Jeff Hawkins (in
“On Intelligence"). He also suggests selection for novelty, but that would be complementary/ mutually exclusive with generality. The scope of discovered generality is limited by the span of experience searched by an input, & the longer it searches (especially over the following inputs) the less “novel” it becomes. Any pattern is defined by some sort of repetition, so prioritizing novelty per se would simply select for random noise. On the other hand, selection for both invariance (match) & novelty (miss) is no selection at all: they exhaust all possibilities.

Generality must be the ultimate criterion, but incrementally more abstract types of correspondence (which defines generality) are “novel” relative to the lower ones. The most basic form of such apparent novelty-seeking, that actually increases predictive power, is simply a preference for more recent inputs.
Recent inputs are relatively more predictive than the older ones because of their temporal proximity to future inputs. Thus, proximity should determine the span of search within a level of generality. But it can’t select for hierarchical elevation: search expansion increases the distance among comparands. I implement selection for proximity via FIFO design of template queues, explained in “Search” part.

A higher form of such coincidental novelty is change, or the difference between S-T proximate inputs. Difference is primarily an alternative representation of a comparand, but its association with match may make it stronger than the original one. Strong enough difference is compared to a queue of other selected or aggregate differences between the same template & other comparands, & then evaluated as an output. Evaluation is subtraction of target queue’s average difference, which indicates the trend for subsequent inputs. This subtraction forms relative difference, similar to relative match.

Secondary value of difference is as a contrast, projection of which competes with the projection of a pattern it interrupts. This value is negative, borrowed from positive predictive value of adjacent pattern. Basically, it’s valuable to know when a pattern is no longer predictive, to better utilize its resources elsewhere. Differences (change) within noise have no such value, because they don’t interrupt any pattern.
Value of output template in projected for a target queue by “recombining” it with co-derived difference, which is also projected: multiplied by its relative distance to an average coordinate of a target queue. Projection of a difference “competes” with that of a template, so both must be adjusted proportionally: projected_template = template + average_distance * relative_difference * aggregate difference / match.

So, the two most basic selection criteria are unique relative match & projected relative difference. These are initial derivatives (match & difference), minus corresponding averages (normalized aggregates) of a higher level. What makes selection more complex than simple comparison is the necessity to adjust (multiply) the averages by newly formed parameters: redundancy for match & distance for difference. These new parameters are products of prior input selection, which increases redundancy of a potential new input, or prior input aggregation, which increases coordinate span of a target queue.


7. Hierarchy: selection by feedback of multi-level shortcuts.


In addition to selection by representation of the next level, the output may also be filtered by direct feedback from the levels beyond the next. Such higher-level feedback is what we call “expectations”. Just like feedback from immediately higher level, these short-cuts select inputs for their additive generality. The difference here is greater discontinuity in reference, as well as in range & potentially higher derivation of the expectations. These short-cuts provide a mechanism for periodic sampling, vs. continuous monitoring of lower levels:

A common objection to predictive correspondence as a fitness function of intelligence is that it would keep you “staring at a wall”: lock into predictable environments. We tend to do the opposite, - skip over too predictable locations, thereby reducing match of new inputs to known templates. I think that’s because we maximize higher level‘s projected, rather than confirmed, correspondence. Projection is re-integration of incrementally “abstract” vectors / derivatives into original templates, in proportion to their relative value & distance to a target level The value of expectation is negative for potential inputs because confirmation is redundant, - predictive value of higher-level templates will only increase (decrease) to the extent that the match (miss) is unexpected. Thus, inputs are redundant to a higher level to the extent they are “expected”.

Downward suppression of locations with expected inputs will result in preference for exploration & discovery of new patterns vs. confirmation of the old ones. This is the opposite of upward selection for stronger patterns, but this “sign reversal” in selection criteria is a basic feature of feedback, - this also holds for subtracting the average, redundancy, & projected differences, as explained in the previous chapter.

Focus on unknown locations is different from selection of unexpected, as suggested by Hawkins, because all outputs of a given location are equally suppressed by expectations. In relative terms, stronger patterns within a location (continuous search span) still win over the noise (weaker patterns). Expectations can’t be pattern-specific on a lower level because individual patterns would be “out of range” there. Basically, a range of search is what defines a level & its syntax. Thus, lower levels simply don’t have the syntax to represent the origin of higher-level patterns, & will “mistake” them for the local ones.
Note that my interpretation is in agreement with the consensus in neuroscience that cortical layer-1 feedback is modulatory in nature, rather than containing “driving“ inputs as Hawkins proposes.

Of course, human curiosity is not purely cognitive, it is biased toward a survival value of information. Proximate & changing objects are more likely to affect a subject in a short term, thus attracting far more attention than they would for their immediate contribution to predictive power. That should also be the case for a purely cognitive system with enough introspection: it will seek impacts | materials that could build-up its predictive capacity, & avoid those that threaten it. Maximizing cognitive capacity will increase predictive correspondence of self-representation, rather than that of external representation. But longer term, it will increase external predictive correspondence more than direct pursuit thereof.

“Seeking” in the above describes both modulatory feedback downward a hierarchy of representations, & a motor feedback as a modulation of sensors / actuators “below” that hierarchy. Modulation is adjustment of scope & resolution for both coordinates & values of lower-level sources. The scope (temporal & spatial) is defined as resolution of corresponding coordinates, reduced in proportion to variation in local input flow.
The “motor” part is the feedback to initial coordinates: the start of a search span, that are beyond the scope &|or resolution covered by internal representations (memory). That part must be translated into action by an equivalent of neuro-muscular junction, which should be built-in as an intrinsic element of the system.

Higher types of curiosity maximize more abstract forms of representation: projected matches of higher derivatives. Any representation is a partial & mediated form of reproduction, & such “abstraction” is what distinguishes any cognitive process from biomorphic evolution that maximizes complete reproduction.


8. Notes on clustering, ANNs, the brain, & probabilistic inference.


In terms of conventional machine learning, my approach is a form of hierarchical
fuzzy clustering. Cluster is simply a different term for pattern: a set of matching inputs. Each set is represented by a centroid: an input with below-threshold combined “distance” to other inputs of the set. The equivalent of centroid in my model is an input with above-average match (a complimentary of a distance) to other inputs within its search span. Such inputs are selected to search next-level queue, & indirectly represent other cross-compared, but not selected, inputs via their discrete or aggregate derivatives relative to the selected one.

Crucial differences here is that conventional clustering methods initialize centroids with arbitrary random weights, while I use matches (& so on) from past comparisons. And the weights are usually defined in terms of one variable, while I select higher-level inputs based on a combination of all variables per pattern, the number of which increases with the level of search.
Current methods in unsupervised learning were developed / accepted because they solved specific problems with reasonable resources. But, they aren’t even comparable to human learning in scalability. I believe that requires an upfront investment in incremental complexity of representation: a syntactic overhead that makes such representation uncompetitive at short runs, but is necessary to predictively prune longer-range search.

The most basic example here is my use of explicit coordinates, & of input differences at their distances. I haven’t seen that in other low-level approaches, yet they are absolutely necessary to form Euclidian vectors. Explicit coordinates is why I start image processing with 1D scan lines, - another thing that no one else does. Images seem to be “naturally” 2D, but expanding search in 2Ds at once adds the extra cost of two sets of new coordinates & derivatives. On the other hand, adding 1D at a time allows to select inputs for each additional layer of syntax, reducing overall (number of variables * number of inputs) costs of search on the next level.

The same coordinate-blind mindset pervades ANNs. The obsession with parallelism leads to indiscriminate aggregation of inputs & very coarse comparison & selection process (as explained in part 4). The distance is only vertical: number of layers, & learning is probabilistic: match is determined as a coincidence of multiple inputs. Again, no derivatives (0D miss), or coordinates (1-4D miss) per comparison are recorded. Without them, there is no Euclidian vectors, thus pattern prediction must remain extremely crude. And we can record these parameters only with sequential comparisons of individual inputs, parallelizing is a secondary issue.

More generally, there is critical difference in the way patterns are represented in the brain, vs. my hierarchy of queues. Brain consists of neurons, each likely representing a variable of a given magnitude & modality. These variables are shared among multiple patterns, or co-activated networks (“cognits” in terms of J. Fuster, see
Cortex & Mind). This is conceptually perverse: relatively constant value of specific variables defines a pattern, but there’s no reason that value should be shared among different patterns. The brain is forced to share variables because it has a fixed number of neurons, but a fluid & far greater number of their networks. This leads to mental artifacts, such as our arbitrary taxonomy of large, medium & small, vs. specific numbers.

We don’t have to slavishly copy these constraints. Logically, a co-occurring set of variables (of a given value & type/modality) should be localized into pattern, rather than localizing variables into values & modalities, & then sharing them among distant patterns. Put another way, the brain has relatively fixed variables, that are combined into fluid patterns. This is oxymoronic: “variable“ is what varies, & “pattern” is what doesn’t. So, it’s not surprising to me that our minds are largely irrational, even leaving aside all the sub-cortical nonsense. To refresh, patterns in my model are also grouped by modality on the lowest levels of cognitive hierarchy, & by proximity on all levels, but each pattern contains an internal (not shared) hierarchy of derived variables.


Probabilistic inference: AIT, Bayesian logic, & Markov chains.



A good introduction to Algorithmic information theory is
Philosophical Treatise of Universal Induction by Rathmanner & Hutter. The criterion is same as mine: compression & prediction. But, while a progress vs. static frequentist probability calculus, both AIT & Bayesian inference still assume a prior, which simply doesn’t belong in a consistently inductive approach. In my approach, priors / models are simply past inputs & correlations among them. Environmentally specific priors could speed-up learning, but unsupervised pattern discovery algorithm must be the core on which such short-cuts are added or removed from.

More importantly, as with most contemporary approaches, Bayesian learning is statistical or probabilistic. Probability is estimated for occurrence or co-occurrence of inputs | states, which I think is way too coarse. Inputs hardly ever match or miss precisely, so their similarity should be quantified. This would add a whole new dimension of micro-grayscale: partial match, as in my approach, vs. binary incidences in probabilistic inference. It should improve accuracy to the same extent that probabilistic inference improved on classical logic, by adding a macro-grayscale of partial probability vs. binary true | false values of the former.

And resolution of my inputs is always greater than that of their coordinates, while Bayesian inference & AIT typically start with the reverse: strings of 1bit inputs. These inputs, binary confirmations / disconfirmations, are extremely crude way to represent events & estimate their probability. Besides, the events are assumed to be high-level concepts: the kind that occupy our conscious minds & are derived from senses by subconscious cognitive processes, which must be built into general algorithm. Their choice as initial inputs for BI & AIT already shows a total lack of discipline in incrementing complexity: a fatal fault for any attempt at scalability.

As an attempt at general intelligence, Solomonoff introduced “universal prior“: a class of all models. That class is a priori infinite, which means that he hits combinatorial explosion even *before* receiving actual inputs. It‘s a “solution” that only a mathematician may find interesting. Marginally practical implementation of AIT is
Levin Search, which randomly generates models / algorithms of incremental complexity, & selects those that happen to solve a problem or compress a bitstring.
Again, I think starting with prior models is putting a cart before a horse: cognition must start with empirical data, math doesn’t become cost-efficient till much higher levels of abstraction. And the distinction between input patterns & algorithms (short-cuts to pattern discovery) is only valid within a level, - these algorithms are embedded in patterns they produce, hence are compared on higher levels & form “algorithmic patterns“.


9. Notes on a working mindset & a prize for ideas.


Many researches believe that AGI is a mathematical problem, which I think is a "man with a hammer" syndrome. Cognition is an incremental problem, - intelligence is a matter of degree. Higher math can accelerate learning on higher levels of generalization, but is excessively complex for initial levels.
It certainly isn’t innate in humans on any level, - cavemen didn’t do calculus.
Most of the rest think it’s an engineering problem. As with math, it depends on how you define engineering. But, there is an obvious trade-off between theoretical scope aspired by artificial intelligence, & practical certainty required of any conventional engineering.

To even begin designing an algorithm, one needs a working definition of intelligence. Which is an ultimate meta-generalization, - lossy in proportion to its vast scope. Such inductive lossiness is an anathema to precision & deduction- heavy culture of math, computer science & artificial intelligence.
That implies a major mismatch between a typical mindset in these fields & the nature of a problem they are trying to solve. I think this mismatch is largely responsible for barely detectable progress in theoretical understanding of cognition, hence current ML focus on narrow tasks & wasteful, shallow, statistical learning.

Cognitive psychology is a lot more intuitive in the way it develops explanations. But to be taken seriously, those theories must be confirmed by high-level experimentation, which way too coarse & slow.
It seems obvious to me that AGI is a meta-scientific problem, requiring more theoretical (detached from immediate verification) attitude than what’s acceptable in any established field. Well, except for philosophy, which really has no other proper subject of study. But, recognized philosophers have always been utterly dysfunctional, - not surprisingly as their judges & target audience are freshmen & other clueless highbrows.

Lots of people think that a major problem in AI is a lack of funding. I disagree, Einstein didn't need to be paid to work on his theory. Real science, especially theoretical, & even more so meta-science, must be driven mainly by curiosity. Anyway, I made a few bucks on my investments & want to put the money to work.
If you're interested in scalable complexity pattern discovery, here is an incentive:

I offer prizes from $100 to $100,000 for the ideas that refute, correct, or further develop principles explained above. $100 could be a "consolation prize" for ideas that I already incorporated but did not fully explain here, or that are largely cosmetic in nature. I‘ll also pay for good questions or pointers to relevant approaches. $100,000 is for an alternative / complementary theory, comparable in scope to the one introduced here.
A winner will have an option to convert his prize into an interest in all commercial applications of a final algorithm, at the rate of $10,000 per 1% share. This would be informal & likely unprofitable, - mine is not a commercial enterprise. Again, I don't believe money can be primary motivation here, but it may help a bit.

One winner so far: Todor Arnaudov, thank you very much! His idea seems obvious in retrospect (they all do), - simple buffering of old inputs after search. The buffer is accessed if coordinates of its contents get relatively “close” again. It occurred to me more than once before, but I rejected it as redundant to potential elevation of these inputs. Now that he made me think about it again (& stuck to it), I realized that partial redundancy can be justified because buffering delays loss of detail at much lower cost than elevation. It didn’t feel right because neocortex has no substrate for passive storage, but that should not matter here.
The prize was $600, would've been higher if it wasn’t for
noise-to-signal ratio:) (May 2010). Todor also won $400 consolation prize for understanding some ideas that weren’t clearly explained here. (September 2011).


I am deeply convinced that main challenge we face in formalizing GI is our specie-wide ADHD. We didn’t evolve for sustained focus on this level of generalization, that would cause extinction long before any tangible results. Our survival is no longer at risk, & we have computers to implement deeply recursive algorithms.
But, cognitive psychology always lags a light year behind technology. That psychology is addicted to mental crutches of authority, examples, & experimentation, while theoretical integrity is neglected & abused.

It’s pretty obvious that AGI is by far the most important problem now. Yet, not one out of 7B people pays it his full attention. To me, that’s a stark proof that a dressed-up ape needs some serious therapy.
I’ve experimented with various methods to focus on my meta-theory, with subjective success. For those who like the results (above), I posted suggestions on my other blog:
Cultivating focus on abstractions.

A handful of people claim to work on GI, but they either lack or ignore functional definition of intelligence. That is, their theory is always just a vague inspiration, which doesn‘t uniquely translate into implementation. And that’s not because they lack of raw intelligence or top-down focus. Rather, their “top” doesn‘t seem to be high enough. I came up with these neurological conditions for developing abnormal intellectual integrity:

- This level of generalization likely requires very
sparse cortical architecture, not terribly practical otherwise.
- Given scarcity of related material, one must be emotionally detached & asocial enough to start a new field.
- Meta generalization is extremely data-intensive, so this should begin before puberty, while prefrontal cortex is still unmyelinated & plastic. Of course, working alone is the hardest when one is young.