10/25/17

Cognitive Algorithm: hierarchical search and pattern encoding



Intelligence is ability to predict and plan (self-predict), which can only be done by discovering and projecting patterns. This definition is well established: pattern recognition is a core of any IQ test. The problem is, there was no general and constructive definition of either pattern or recognition (similarity). So, I came up with my own definitions, which directly translate into algorithm introduced in the next section.
For excellent popular introduction to cognition-as-prediction thesis see “On Intelligence” by Jeff Hawkins and “How to Create a Mind“ by Ray Kurzweil. But on a technical level, both of them and most current researchers  implement pattern discovery via artificial neural networks, which operate in a very coarse fashion:
ANN learns via some version of Hebbian “fire together, wire together” coincidence reinforcement. Normally, “neuron’s” inputs are weighed at synapses, then summed and thresholded into output. Output also back- propagates to synapses and is combined with their last input to adjust the weights. This weight adjustment is learning. But prior summation degrades initial resolution of inputs, precluding any comparison between them (the inputs are recoverable, but why degrade and then spend a ton of computation restoring them).
I believe resolution of inputs should be adjusted according to their predictive value, not reduced by default. Predictive value is input’s match to future inputs, projected from its match to past inputs. I define match as lossless compression per input (part 1, compare to existing measures of similarity), from cross-comparing inputs at initial resolution. Then inputs should be cross-compared over extended distance within patterns: spans of above-average match, and summed within spans of both above- and below- average match.
Statistical methods work in reverse order: inputs are summed within arbitrary samples, then the sums are evaluated as feedback to redefine weights and future samples into some meaningful shape. Neural Nets are all the rage now, and they obviously work, especially in a deep learning hierarchy. But they require very long training, thus don’t scale without supervision or task-specific reinforcement, and also recognize things that aren’t there. These problems are caused by aforementioned coarseness, inherent in statistical learning.
CNN, currently the most successful method, defines match as a dot product of input and template (kernel of weights). Product does overweigh similar vs. distant input-kernel pairs, for the same value of input + kernel. But it’s just a loose proxy, there is no conceptual reason to identify it with similarity.
Also, both input and kernel are arrays. This is far more coarse, thus less selective and efficient, than one-to-one comparison of adjacent inputs (pixels, then higher-level patterns) in my algorithm.

Even more coarse is training, which forms weights. It has to be coarse because CNN doesn’t compute difference (input - kernel) on each layer. I think the lack of fine-grained difference contributes to initialization bias in ANN, similar to confirmation bias in humans. In my approach, both difference and partial match are computed at the same time by comparison: my equivalent of inference. And feedback of minimal accumulated difference updates my filters (equivalent of weights), which directly ties training to inference.

Inspiration by the brain kept ANN research going for decades before they became useful. But their “neurons” are mere stick figures for real ones. Of course, most of complexity in a neuron is due to constraints of biology. Ironically, weighted summation in ANN may also be a no-longer needed compensation for such constraint:
neural memory requires dedicated connections (synapses), which makes individual input representation and comparison prohibitively expensive. But not anymore, we now have dirt-cheap random access memory.
Another biological constraint is very low speed of neurons, compounded by the imperative of fast reaction for survival in the wild. Both favor fast though crude summation, vs. slower one-to-one comparison. But this fast inference comes at the cost of glacial training. Reaction speed became relatively less important: modern society is quite secure, and continuous learning speed is far more important because of accelerating progress.
Yet another constraint is noise: neurons often fire at random, so spikes must be summed to get a stable input. But there is no good reason to degrade resolution of far more precise electronic circuits.   

Current Machine Learning and related theories (AIT, Bayesian inference, etc.) are largely statistical also because they were developed primarily for symbolic data. Such data, pre-compressed and pre-selected by humans, is far more valuable than sensory inputs it was ultimately derived from. But due to this selection and compression, proximate symbols are not likely to match, and partial match between them is hard to quantify. Hence, symbolic data is a misleading target for developing conceptually consistent algorithm.
This content is published under the Creative Commons Attribution 4.0 International License. You are encouraged to republish and rewrite it in any way you see fit, as long as you provide attribution and a link.



outline of my approach



Proposed algorithm is a clean design for deep learning: non-neuromorphic, sub-statistical, comparison-first.
It’s a hierarchical search for patterns, similar to hierarchical clustering. But conventional clustering defines match as inverted difference between inputs, which is wrong. Match is a subset common for both comparands, distinct from and complementary to their difference. Also, I don’t know of any clustering that implements incremental syntax: number of variables per input, and incremental spatio-temporal dimensionality (part 2).

I define pattern as a span of matching inputs, and match as partial overlap between them. Search should be strictly incremental, to allow for fine-grained input selection for extended search. Within a level of search, incremental is the distance between inputs and their derivation (part 4, level 1). Between levels, incremental is compositional scope and derivation of patterns formed on lower levels. “Strictly incremental” means that there must be a unique set of operations per level, hence a singular in “cognitive algorithm“.
Autonomous cognition must start with analog inputs, such as video or audio. All symbolic data, including natural languages, is encoded by some cognitive process. Thus, it should be decoded before being cross- compared to search for meaningful patterns. And the difficulty of decoding is exponential with the level of encoding, so search that starts with raw sensory input is by far the easiest to implement (part i).
Hence, my initial inputs are pixels, and higher-level inputs are patterns formed by lower-level search.
I quantify match and miss by cross-comparing inputs over selectively extended range of search. Basic comparison is inverse arithmetic operation between two single-variable inputs, starting with adjacent pixels.
Specific match and miss is determined by power of comparison: Boolean match is AND and miss is XOR, comparison by subtraction increases match to a smaller comparand and reduces miss to a difference,
comparison by division increases match to a multiple and reduces miss to a fraction, and so on (part 1).
To discover anything complex at “polynomial” cost, both search and resulting patterns must be hierarchical:
comparisons group inputs into patterns, which are compared on the next level. Comparison is also selective per level of differentiation (prior miss) within these hierarchically composed patterns. Such compression and selection reduces search time, relative to comparing original inputs. Higher levels form increasingly long-range spatio-temporal and then conceptual patterns. They also send feedback: filters and then motor action, to select lower-level inputs and locations with above-average additive predictive value (part 3).

Hierarchical approaches are common in unsupervised learning, and all do some sort of pattern recognition. But none that I know of implement strictly incremental growth in scope and complexity of discoverable patterns. This is critical for scalability: each increment allows for predictive input selection, to avoid combinatorial explosion in search space. However, this selection is more expensive upfront and won’t pay in simple problems, which are the usual test cases. Thus, my approach is not suitable for immediate experimentation, which is probably why no one else seems to be working on anything sufficiently similar.
Search space is number of patterns * number of their variables. Number of variables per pattern is incremental, starting from single-variable pixels in video. Each level of search adds new variables: match and miss, to every compared variable of an input pattern. So, the variables multiply at every level, new variables are summed within a pattern, then are combined to evaluate it for expanded search. Feedback of new variables extends lower-level algorithm by adding operations that filter future inputs, and then by adjusting the filters.



Contents:

i.  Cognition vs. evolution, analog vs. symbolic initial input
1.  Comparison: quantifying match and miss between two variables
2. Forward search and patterns, incremental space-time dimensionality
3. Feedback of filters, attentional input selection, imagination, motor action
4. Initial levels of search, corresponding orders of feedback, and resulting patterns:
level 1: comparison to past inputs, forming difference and relative match patterns
level 2: additional evaluation of resulting patterns for feedback, forming filter patterns
level 3: additional evaluation of projected filter patterns, forming updated-input patterns

5. Comparison between variable types within a pattern
6. Cartesian dimensions and sensory modalities
7. Notes on clustering, ANNs, and probabilistic inference
8. Notes on working mindset, “hiring” and a prize for contributions

i. Cognition vs. evolution, analog vs. symbolic initial input



Some say intelligence can be recognized but not defined. I think that’s absurd: we recognize some implicit definition. Others define intelligence as a problem-solving ability, but the only general problem is efficient search for solutions. Efficiency is a function of selection among inputs, vs. brute-force all-to-all search. This selection is by predicted value of the inputs, and prediction is interactive projection of their patterns. Some agree that intelligence is all about pattern discovery, but define pattern as a crude statistical coincidence.
Of course, the only mechanism known to produce human-level intelligence is even cruder, and that shows in haphazard construction of our brains. Algorithmically simple, biological evolution alters heritable traits at random and selects those with above-average reproductive fitness. But this process requires almost inconceivable computing power because selection is very coarse: on the level of whole genome rather than individual traits, and also because intelligence is only one of many factors in reproductive fitness.
Random variation in evolutionary algorithms, generative RBMs, and so on, is antithetical to intelligence. Intelligent variation must be driven by feedback within cognitive hierarchy: higher levels are presumably “smarter” than lower ones. That is, higher-level inputs represent operations that formed them, and are evaluated to alter future lower-level operations. Basic operations are comparison and summation among inputs, defined by their range and resolution, analogous to reproduction in genetic algorithms.
Range of comparison per conserved-resolution input should increase if projected match (cognitive fitness function) exceeds average match per comparison. In any non-random environment, average match declines with the distance between comparands. Thus, search over increasing distance requires selection of above- average comparands. Any delay, coarseness, and inaccuracy of such selection is multiplied at each search expansion, soon resulting in combinatorial explosion of unproductive (low additive match) comparisons.

Hence, my model is strictly incremental: search starts with minimal-complexity inputs and expands with minimal increments in their range and complexity (syntax). At each level, there is only one best increment, projected to discover the greatest additive match. No other AGI approach follows this principle.
I guess people who aim for human-level intelligence are impatient with small increments and simple sensory data. Yet, this is the most theoretical problem ever, demanding the longest delay in gratification.



symbolic obsession and its discontents
Most obvious example of this impatience is the use of symbolic data as initial inputs in cognitive projects. Such a start betrays profound misunderstanding of cognition. Even children, predisposed to learn language, only become fluent after years of directly observing things their parents talk about. Words are mere labels for concepts, the most important of which are generalized spatio-temporal patterns, derived from multi-modal sensory experience. Top-down reconstruction of such patterns solely from correlations among their labels is exponentially more difficult than their bottom-up construction from sensory inputs.    
All our knowledge is ultimately derived from senses, but lower levels of human perception are unconscious. Only generalized concepts make it into our consciousness, AKA declarative memory, where we assign them symbols (words) to facilitate communication. This brain-specific constraint creates heavy symbolic vs. sub-symbolic bias, especially strong in artificial intelligentsia. Which is like putting a cart in front of a horse: most words are meaningless unless coupled with representations of sensory patterns.
To be incrementally selective, cognitive algorithm must exploit proximity first, which is only productive for continuous  and loss-tolerant raw sensory data. Symbolic data is already compressed: consecutive characters and words in text won’t match. It’s also encoded with distant cross-references, that are hardly ever explicit outside of a brain. Text looks quite random unless you know the code: operations that generalized pixels into patterns (objects, processes, concepts). That means an algorithm designed specifically for text will not be consistently incremental in the range of search, which will impair its scalability.
In Machine Learning, input is string, frame, or video sequence of a defined length, with artificial separation between training and inference. In my approach, learning is continuous and interactive. Initial inputs are streamed pixels of maximal resolution, and higher-level inputs are multi-variate patterns formed by comparing lower-level inputs. Spatio-temporal range of inputs, and selective search across them, is extended indefinitely. This expansion is directed by higher-level feedback, just as it is in human learning.
Everything ever written is related to my subject, but nothing is close enough: not other method is meant to be fully consistent. Hence a dire scarcity of references here. My approach is presented bottom-up (parts 1 - 6), thus can be understood without references. But that requires a clean context, - hopefully cleaned out by reader‘s own introspective generalization. Other (insufficiently) related approaches are addressed above and in part 7. I also have a more advanced work-in-progress, but will need a meaningful feedback to elaborate.

1. Comparison: quantifying match and miss between two variables



First of all, we must quantify predictive value. Algorithmic information theory defines it as compressibility of representation, which is perfectly fine. But current implementations only define compression for whole sequences of inputs. To enable far more incremental selection (thus more scalable search), I start by quantifying match (= compression) between individual inputs. Partial match is a finer dimension of analysis, vs. binary same | different inputs in probabilistic inference. This is similar to the way probabilistic inference improved on classical logic, by quantifying partial probability vs. binary true | false value of statements.

I define match as a complementary of miss. Basic miss is a difference between comparands, hence match is a smaller comparand. In other words, match is a compression of larger comparand’s magnitude by replacing it with difference relative to smaller comparand. Ultimate criterion is recorded magnitude, rather than bits of memory it occupies, because the former represents physical impact that we want to predict. The volume of memory used to record that magnitude depends on prior compression, which is not an objective parameter.

This definition is tautological: smaller input is a common subset of both inputs, = sum of AND between their uncompressed (unary code) representations. Some may object that match includes the case when both inputs equal zero, but that match also equals zero. The purpose is prediction, which is a representational equivalent of conservation in physics. Ultimately, we’re predicting some potential impact on observer, represented by input. Zero input means zero impact, which has no conservable inertia, thus no intrinsic predictive value.

Given incremental complexity of representation, initial inputs should have binary resolution. However, these binary inputs are more efficiently compressed by digitization within a coordinate: substitution of every two lower-order bits of an integer with one higher-order bit. Resolution of coordinate (input summation span) is adjusted to form large enough integers, to produce average match that exceeds the cost of comparison.
All costs here are opportunity costs: average compression by equivalent computational resources.

Next order of compression is by comparing integers across coordinates, initially only between consecutive coordinates. Basic comparison is inverse arithmetic operation of incremental power: AND, subtraction, division, logarithm, and so on. Additive match is achieved by comparison of a higher power than that which produced comparands: comparison by AND will not further compress integers previously digitized by AND.
Rather, initial comparison between integers is by subtraction, miss is resulting difference and absolute match is a smaller input. For example, if inputs are 4 and 7, then miss is 3, and their match or common subset is 4. Difference is smaller than XOR (complementary of AND) because XOR includes opposite-sign bit pairs 0,1 and 1,0, which will be cancelled-out by subtraction.
Comparison by division forms ratio, which is a compressed difference. Compression is explicit in long division: match is accumulated over iterative subtraction of smaller comparand from remaining difference. In other words, this is also a comparison by subtraction, but between different orders of derivation. New match is smaller comparand * integer part of ratio, and miss is final reminder or fractional part of ratio.
Ratio can be further compressed by converting it to radix or logarithm, and so on.
By reducing miss, higher-power comparison increases complementary match (match = larger input - miss):
total per order:   1) larger input   2) XOR                       3) difference: combined current-order match & miss
additive match:       AND                opposite-sign XOR      multiple of a smaller input within a difference
remaining miss:      XOR                difference                       fraction: complementary to multiple within a ratio
But the costs of operations and record of signs, fractions, irrational fractions, etc. may grow even faster. Thus, the power of comparison is increased only if higher-level “meta-comparison” between results from different powers of comparison finds a pattern of improving (compression - costs) with increasing power.

The above covers absolute match, which must be projected on potential search location before evaluation. Projection is recombination of co-derived matches and misses at a distant location:
projected match = match + ((difference * (summed match / summed input)) * distance) / 2  

This match projection from results of a single comparison is not implemented in the algorithm. That’s because projection is a feedback, which adjusts lower-level filters and is represented in the output. Incurred costs must be justified by the value of feedback, which is low per comparison but accumulates over multiple comparisons.
So, actual feedback is *average* projected match in higher-level patterns, formed by multiple vs. individual comparisons. This projection is by recombining overlapping match patterns and difference patterns in proportion to their predictive value (vs. recombining individual matches and differences above).

But ultimate selection criterion is relative match: current-level match - past current-level match that co-occurs with average higher-level projected match. In other words, relative match is positive or negative predictive value, which determines input inclusion into positive or negative pattern.



2. Forward search and patterns, incremental space-time dimensionality



Pattern is a contiguous span of inputs that form above-average matches, similar to conventional cluster.
As explained above, matches and misses (derivatives) are produced by comparing consecutive inputs. These derivatives are summed within a pattern and then compared between patterns on the next level of search, adding new derivatives to a higher pattern. Patterns are defined contiguously for each level, but are necessarily followed by a span of below-average match (gap or negative pattern), thus next-level inputs are discontinuous.
Negative patterns represent contrast or discontinuity between positive patterns, which is a one- or higher- dimensional equivalent of difference between zero-dimensional pixels. As with differences, projection of a negative pattern competes with projection of adjacent positive pattern. But match and difference is derived from the same input pair, while positive and negative patterns represent separate spans of inputs.

Negative pattern is not predictive on its own, but is valuable for allocation: computational resources of no-longer predictive pattern should be used elsewhere. Hence, the value of negative pattern is borrowed from predictive value of co-projected positive pattern, as long as combined additive match remains above average.
Consecutive positive and negative patterns project over same future input span, and these projections partly cancel each other. So, they should be combined to form feedback, as explained in part 3.
Initial match is evaluated for inclusion into higher positive or negative pattern. The value is summed until its sign changes, and if positive, evaluated again for cross-comparison among constituent inputs over increased distance. Second evaluation is necessary because the cost of incremental syntax, produced by cross-comparing, is per pattern rather than per input. Pattern is terminated and outputted to the next level when value sign changes. On the next level, it is compared to previous patterns of the same compositional order.
Initial inputs are pixels of video, or equivalent limit of positional resolution in other modalities. Hierarchical search on higher levels should discover patterns representing empirical objects and processes, and then relational logical and mathematical shortcuts, eventually exceeding generality of our semantic concepts. Which are also patterns in cognitive terms: all information is either learned as patterns or filtered out as noise. For online learning, all levels should receive inputs from lower levels and feedback from higher levels in parallel.

space-time dimensionality and initial implementation
Any prediction has two components: what and where. We must have both: value of prediction = precision of what * precision of where. That “where” is currently neglected: statistical ML methods represent S-T dimensions with a significant lag, much more coarsely than the inputs themselves. Hence, precision of where (spans of and distances between patterns) is severely degraded, and so is predictive value of combined representations. There is no such immediate degradation of positional information in my method.
My core algorithm is 1D: time only (part 4). Our space-time is 4D, but each of  these dimensions can be mapped on one level of search. This way, levels can select input patterns that are strong enough to justify the cost of representing additional dimension, as well as derivatives (matches and differences) in that dimension.
Initial 4D cycle of search would compare contiguous inputs, similarly to connected-component analysis:
level 1 compares consecutive 0D pixels within horizontal scan line, forming 1D patterns: line segments.
level 2 compares contiguous 1D patterns between consecutive lines in a frame, forming 2D patterns: blobs.
level 3 compares contiguous 2D patterns between incremental-depth frames, forming 3D patterns: objects.
level 4 compares contiguous 3D patterns in temporal sequence, forming 4D patterns: processes.

Subsequent cycles would compare 4D input patterns over increasing distance in each dimension, forming longer-range discontinuous patterns. These cycles can be coded as implementation shortcut, or discovered by core algorithm itself, which can adapt to inputs of any dimensionality. “Dimension” here is a parameter that defines external sequence and distance among inputs. This is different from conventional clustering, which treats both external and internal parameters as dimensions. More in part 6.
However, average match in our space-time is presumably equal over all four dimensions. That means patterns defined in fewer dimensions will be biased by the angle of scanning, introducing artifacts. Hence, initial pixel comparison and inclusion into patterns should also be over 4D, or at least over 2D for still images.
This will be a universe-specific extension of my core algorithm.
 
I am currently working on implementation of core algorithm adapted to process images and then video, but it should be extensible to any type and scope of data. My work-in-progress Python code is posted on github, suggestions and collaboration are most welcome, see the last part here on prizes.




3. Feedback of filters, attentional input selection, imagination, motor action
(needs work)

After evaluation for inclusion into higher-level pattern, input is also evaluated as feedback to lower levels. Feedback is update to filters that evaluate forward (Λ) and feedback (V), as described above but on lower level.
Basic filter is average value of input’s projected match that co-occurs with (thus predicts) average higher-level match within a positive (above-average) pattern. Both values are represented in resulting patterns.
Feedback value = forward value - value of derivatives /2, both of an input pattern. In turn, forward value is determined by higher-level feedback, and so on. Thus, all higher levels affect selection on lower-level inputs. This is because the span of each pattern approximates, hence projects over, combined span of all lower levels.
Indirect feedback propagates level-sequentially, more expensive shortcut feedback is sent to selected levels.
Negative derivatives project increasing match: match to subsequent inputs is greater than to previous inputs.
Such feedback will reduce lower-level filter. If filter is zero, all inputs are cross-compared, and if filter is negative, it is applied to cancel subsequent filters for incrementally longer-range cross-comparison.
There is one filter for each compared variable within input pattern, initialized at 0 and updated by feedback.
novelty vs. generality
Any integrated system must have a common selection criterion. Two obvious cognitive criteria are novelty and generality: miss and match, But we can’t select for both, they exhaust all possibilities. Novelty can’t be primary criterion: it would select for noise and filter out all patterns, which are defined by match. On the other hand, to maximize match of inputs to memory we can “stare at a wall” and lock into predictable environments. But of course, natural curiosity actively avoids predictable locations, thus reducing the match.

This dilemma is resolved if we maximize predictive power: projected match, rather than actual match, of inputs to records. To the extent that new match was projected by past inputs, it doesn’t add to their projected match. But neither does noise: novelty (difference to past inputs) that is not projected to persist (match) in the future.
Additive projected match = downward (V) match to subsequent inputs - upward (Λ) match to previous inputs.

This ΛV asymmetry comes from projecting difference or value (both are signed) of input relative to feedback.
Λ match is initially projected from match between input and its higher-level average. Such averages may be included in feedback, along with average match (filter). Each average is compared to same-type variable of an input, and resulting match (redundancy to a higher level) is subtracted from input match before evaluation.
Averages represent past vs. current higher level and should be projected over feedback delay:
average += average difference * (delay / average span) /2. Value of input-to-average match is reduced by higher-level match rate: rM= match / input, so additive match = input match - input-to-average match * rM.
Adjustment by input-to-average match is done if rM is above-average for incurred cost of processing.
So, basic selection for novelty is subtraction of adjusted Λ match-to-filter. V selection for generality, on the other hand, is a function of back-projected difference of match, which is a higher-derivation feedback. Projected increase of match adds to, and decrease of match subtracts from, predictive value of future inputs.
Higher-order selection for novelty should skip (or avoid processing) future input spans predicted with high certainty. This is a selection of projected vs. actual inputs, covered in part 4, level 3.  
More generally, ΛV projection asymmetry is expressed by differences of incremental representation order:
difference in magnitude of initial inputs: projected next input = last input + difference/2,
difference in input match, a subset of magnitude: projected next match = last match + match difference/2,
difference in match of match, a sub-subset of magnitude, projected correspondingly, and so on.
Ultimate criterion is top order of match on a top level of search: the most predictive parameter in a system.

imagination, planning, action
Imagination is not actually original, it can only be formalized as interactive projection of known patterns. Strong patterns send positive feedback to lower-level locations where they are projected to re-occur at higher resolution, to increase represented detail. When originally distant patterns are projected to re-occur in the same location, their projections interact and combine. This is a generative (vs. reductive) mechanism.
Interaction is comparison and summation between same-type variables of co-projected patterns: direct feedback from multiple higher levels to the same location. Location is a span of search within a lower level, which receives feedback with matching projected coordinates. Evaluation of actual inputs is delayed until additive match projected in their location exceeds the value of combined filters.
Comparison forms patterns of filters to compress their representations and selectively project their match, same as with original inputs. Summation of their differences cancels or reinforces corresponding variables, forming combined filter, including secondary patterns of repulsion or attraction. These patterns per variable are discovered on a higher level, by comparing past inputs with their multiple-feedback projections.
Combined filter is pre-evaluated: projected value of positive patterns is compared to projected cost of evaluating all inputs, both within a filtered location. Resulting prevalue (value of evaluation) is negative when the latter exceeds the former: projected inputs are not worth evaluating and their span / location is skipped.
Iterative skipping of input spans is the most basic motor feedback: it increments coordinates and differences of the filters. Cross-filter search then proceeds over multiple “imagined” locations, before finding one with projected above-average additive match. That’s where skipping stops and actual inputs are received.
Cognitive component of action is planning. It is technically identical to imagination, except that projected patterns also represent the system itself. Feedback of such self-patterns eventually reaches the bottom of representational hierarchy: sensors and actuators, adjusting their sensitivity and coordinates. Such environmental interface is part of any cognitive system, although actuators are optional.

4. Initial levels of search, corresponding orders of feedback and resulting patterns
 
This part recapitulates and expands on my core algorithm, which operates in one dimension: time only. Spatial and derived dimensions are covered in part 6. Even within 1D, search is hierarchical in scope, containing any number of levels. New level is added when current top level terminates and outputs the pattern it formed.
Higher-level patterns are fed back to select future inputs on lower levels. Feedback is sent to all lower levels because span of each pattern approximates combined span of inputs within whole hierarchy below it.
So, deeper hierarchy forms higher orders of feedback, with increasing elevation and scope relative to its target: same-level prior input, higher-level match average, beyond-the-next-level match value average, etc.

These orders of feedback represent corresponding order of input compression: input, match between inputs, match between matches, etc. Such compression is produced by comparing inputs to feedback of all orders. Comparisons form patterns, the type of which corresponds to relative span of compared feedback:

1: prior inputs are compared to the following ones on the same level, forming difference patterns dPs,
2: feedback of higher-level match is used to evaluate match between inputs, forming value patterns vPs,
3: trans-level feedback re-evaluates positive values of match, forming more selective shortcut patterns sPs

Feedback of 2nd order consists of input filters (if) defining value patterns, and coordinate filters (Cf) defining positional resolution and relative distance to future inputs.
Feedback of 3rd order is shortcut filters for beyond-the-next level. These filters, sent to a location defined by attached coordinate filters, form higher-order value patterns for deeper internal and distant-level comparison.

Higher-order patterns are more selective: difference is as likely to be positive as negative, while value is far more likely to be negative, because positive patterns add costs of re-evaluation for extended cross-comparison among their inputs. And so on, with selection and re-evaluation for each higher order of positive patterns. Negative patterns are still compared as a whole: their weak match is compensated by greater span.

All orders of patterns formed on the same level are redundant representations of the same inputs. Patterns contain representation of match between their inputs, which are compared by higher-order operations. Such operations increase overall match by combining results of lower-order comparisons across pattern’s variables:

0Le: AND of bit inputs to form digitized integers, containing multiple powers of two
1Le: SUB  of integers to form patterns, over additional external dimensions = pattern length L
2Le: DIV  of multiples (L) to form ratio patterns, over additional distances = negative pattern length LL
3Le: LOG of powers (LLs), etc.  Starting from second level, comparison is selective per element of an input.
Such power increase also applies in comparison to higher-order feedback, with a lag of one level per order.
Power of coordinate filters also lags the power of input filters by one level:
  
1Le fb: binary sensor resolution: minimal and maximal detectable input value and coordinate increments   
2Le fb: integer-valued average match and relative initial coordinate (skipping intermediate coordinates)
3Le fb: rational-valued coefficient per variable and multiple skipped coordinate range
4Le fb: real-valued coefficients and multiple coordinate-range skip

I am defining initial levels to find recurring increments in operations per level, which could then be applied to generate higher levels recursively, by incrementing syntax of output patterns and of feedback filters per level.

operations per generic level  (out of date)
Level 0 digitizes inputs, filtered by minimal detectable magnitude: least significant bit (i LSB). These bits are AND- compared, then their matches are AND- compared again, and so on, forming integer outputs. This is identical to iterative summation and bit-filtering by sequentially doubled i LSB.
Level 1 compares consecutive integers, forming ± difference patterns (dP s). dP s are then evaluated to cross-compare their individual differences, and so on, selectively increasing derivation of patterns.
Evaluation: dP M (summed match) - dP aM (dP M per average match between differences in level 2 inputs).
Integers are limited by the number of digits (#b), and input span: least significant bit of coordinate (C LSB).
No 1st level feedback: fL cost is additive to dP cost, thus must be justified by the value of dP (and coincident difference in value of patterns filtered by adjusted i LSB), which is not known till dP is outputted to 2nd level.
Level 2 evaluates match within dP s | bf L (dP) s, forming ± value patterns: vP s | vP (bf L) s. +vP s are evaluated for cross-comparison of their dP s, then of resulting derivatives, then of inputted derivation levels. +vP (bf L) s are evaluated to cross-compare bf L s, then dP s, adjusted by the difference between their bit filters, and so on.
dP variables are compared by subtraction, then resulting matches are combined with dP M (match within dP) to evaluate these variables for cross-comparison by division, to normalize for the difference in their span.
// match filter is also normalized by span ratio before evaluation, same-power evaluation and comparison?
Feedback: input dP s | bf L (dP) are back-projected and resulting magnitude is evaluated to increment or decrement 0th level i LSB. Such increments terminate bit-filter span ( bf L (dP)), output it to 2nd level, and initiate a new i LSB span to filter future inputs. // bf L (dP) representation: bf , #dP, Σ dP, Q (dP).
Level 3 evaluates match in input vP s or f L (vP) s, forming ± evaluation-value patterns: eP s | eP (fL) s. Positive eP s are evaluated for cross-comparison of their vP s ( dP s ( derivatives ( derivation levels ( lower search-level sources: buffered or external locations (selected sources may directly specify strong 3rd level sub-patterns).
Feedback: input vP is back-projected, resulting match is compared to 2nd level filter, and the difference is evaluated vs. filter-update filter. If update value is positive, the difference is added to 2nd level filter, and filter span is terminated. Same for adjustment of previously covered bit filters and 2nd level filter-update filters?
This is similar to 2nd level operations, but input vP s are separated by skipped-input spans. These spans are a filter of coordinate (Cf, higher-order than f for 2nd level inputs), produced by pre-valuation of future inputs:
projected novel match = projected magnitude * average match per magnitude - projected-input match?
Pre-value is then evaluated vs. 3rd level evaluation filter + lower-level processing cost, and negative prevalue-value input span (= span of back-projecting input) is skipped: its inputs are not processed on lower levels.
// no prevaluation on 2nd level: the cost is higher than potential savings of only 1st level processing costs?
As distinct from input filters, Cf is defined individually rather than per filter span. This is because the cost of Cf update: span representation and interruption of processing on all lower levels, is minor compared to the value of represented contents? ±eP = ±Cf: individual skip evaluation, no flushing?
or interruption is predetermined, as with Cb, fixed C f within C f L: a span of sampling across fixed-L gaps?
alternating signed Cf s are averaged ±vP s?
Division: between L s, also inputs within minimal-depth continuous d-sign or m-order derivation hierarchy?
tentative generalizations and extrapolations
So, filter resolution is increased per level, first for i filters and then for C filters: level 0 has input bit filter,
level 1 adds coordinate bit filter, level 2 adds input integer filter, level 3 adds coordinate integer filter.
// coordinate filters (Cb, Cf) are not input-specific, patterns are formed by comparing their contents.
Level 4 adds input multiple filter: eP match and its derivatives, applied in parallel to corresponding variables of input pattern. Variable-values are multiplied and evaluated to form pattern-value, for inclusion into next-level ±pattern // if separately evaluated, input-variable value = deviation from average: sign-reversed match?
Level 5 adds coordinate multiple filter: a sequence of skipped-input spans by iteratively projected patterns, as described in imagination section of part 3. Alternatively, negative coordinate filters implement cross-level shortcuts, described in level 3 sub-part, which select for projected match-associated novelty.
Additional variables in positive patterns increase cost, which decreases positive vs. negative span proportion.
Increased difference in sign, syntax, span, etc., also reduces match between positive and negative patterns. So, comparison, evaluation, pre-valuation... on higher levels is primarily for same-sign patterns.
Consecutive different-sign patterns are compared due to their proximity, forming ratios of their span and other variables. These ratios are applied to project match across different-sign gap or contrast pattern:
projected match += (projected match - intervening negative match) * (negative value / positive value) / 2?
ΛV selection is incremented by induction: forward and feedback of actual inputs, or by deduction: algebraic compression of input syntax, to find computational shortcuts. Deduction is faster, but actual inputs also carry  empirical information. Relative value of additive information vs. computational shortcuts is set by feedback.
Following sub-parts cover three initial levels of search in more detail, though out of date:
Level 1: comparison to past inputs, forming difference patterns and match patterns


Inputs to the 1st level of search are single integers, representing pixels of 1D scan line across an image, or equivalents from other modalities. Consecutive inputs are compared to form differences, difference patterns, matches, relative match patterns. This comparison may be extended, forming higher and distant derivatives:
resulting variables per input: *=2 derivatives (d,m) per comp, + conditional *=2 (xd, xi) per extended comp:
    8 derivatives   // ddd, mdd, dd_i, md_i, + 1-input-distant dxd, mxd, + 2-input-distant d_ii, m_ii,
          /        \
     4 der   4 der     // 2 consecutive: dd, md, + 2 derivatives between 1-input-distant inputs: d_i and m_i,
     /     \    /    \
  d,m d,m   d,m    // d, m: derivatives from default comparison between consecutive inputs,
 /    \   /    \  /     \
i  >>  i  >>  i  >>  i    // i: single-variable inputs.
This is explained / implemented in my draft python code:  level_1_working. That first level is for generic 1D cognitive algorithm,  its adaptation for image and then video recognition algorithm will be natively 2D.
That’s what I spend most of my time on, the rest of this intro is significantly out of date.
bit-filtering and digitization   
1st level inputs are filtered by the value of most and least significant bits: maximal and minimal detectable magnitude of inputs. Maximum is a magnitude that co-occurs with average 1st level match, projected by outputted dP s. Least significant bit value is determined by maximal value and number of bits per variable.
This bit filter is initially adjusted by overflow in 1st level inputs, or by a set number of consecutive overflows.
It’s also adjusted by feedback of higher-level patterns, if they project over- or under- flow of 1st level inputs that exceeds the cost of adjustment. Underflow is average number of 0 bits above top 1 bit.
Original input resolution may be increased by projecting analog magnification, by impact or by distance.
Iterative bit-filtering is digitization: bit is doubled per higher digit, and exceeding summed input is transferred to next digit. A digit can be larger than binary if the cost of such filtering requires larger carry.
Digitization is the most basic way of compressing inputs, followed by comparison between resulting integers.
hypothetical: comparable magnitude filter, to form minimal-magnitude patterns
Initial magnitude justifies basic comparison, and summation of below-average inputs only compensates for their lower magnitude, not for the cost of conversion. Conversion involves higher-power comparison, which must be justified by higher order of match, to be discovered on higher levels. Or by average neg. mag. span?
iP min mag span conversion cost and comparison match would be on 2nd level, but it’s not justified by 1st level match, unlike D span conversion cost and comparison match, so it is effectively the 1st level of comparison?
possible +iP span evaluation: double evaluation + span representation cost < additional lower-bits match?
The inputs may be normalized by subtracting feedback of average magnitude, forming ± deviation, then by dividing it by next+1 level feedback, forming a multiple of average absolute deviation, and so on. Additive value of input is a combination of all deviation orders, starting with 0th or absolute magnitude.
Initial input evaluation if any filter: cost < gain: projected negative-value (comparison cost - positive value):
by minimal magnitude > ± relative magnitude patterns (iP s), and + iP s are evaluated or cross-compared?
or by average magnitude > ± deviations, then by co-average deviation: ultimate bit filter?
Summation *may* compensate for conversion if its span is greater than average per magnitude spectrum?!
Summation on higher levels also increases span order, but within-order conversion is the same, and  between-order comparison is intra-pattern only. bf spans overlap vP span, -> filter conversion costs?
Level 2: additional evaluation of input patterns for feedback, forming filter patterns
(out of date).

Inputs to 2nd level of search are patterns derived on 1st level: dP ( L, I, D, V, Q (d)) s or t dP ( tV, tD, dP ()) s.
These inputs are evaluated for feedback to update 0th level i LSB, terminating same-filter span.
Feedback increment of LSB is evaluated by deviation () of magnitude, to avoid input overflow or underflow:
+= I/ L - LSB a; |∆| > ff? while (|∆| > LSB a){ LSB ±; |∆| -= LSB a; LSB a *2};
LSB a is average input (* V/ L?) per LSB value, and ff is average deviation per positive-value increment;
Σ () before evaluation: no V patterns? #b++ and C LSB-- are more expensive, evaluated on 3rd level?
They are also compared to previously inputted patterns, forming difference patterns dPs and value patterns vPs per input variable, then combined into dPP s and vPP s per input pattern.
L * sign of consecutive dP s is a known miss, and match of dP variables is correlated by common derivation.
Hence, projected match of other +dP and -dP variables = amk * (1 - L / dP). On the other hand, same-sign dP s are distant by L, reducing projected match by amk * L, which is equal to reduction by miss of L?
So, dP evaluation is for two comparisons of equal value: cross-sign, then cross- L same-sign (1 dP evaluation is blocked by feedback of discovered or defined alternating sign and co-variable match projection).
Both of last dP s will be compared to the next one, thus past match per dP (dP M) is summed for three dP s:
dP M ( Σ ( last 3 dP s L+M)) - a dP M (average of 4Le +vP dP M) -> v, vs;; evaluation / 3 dP s -> value, sign / 1 dP.
while (vs = ovs){ ovs = vs; V+=v; vL++; vP (L, I, M, D) += dP (L, I, M, D);; default vP - wide sum, select preserv.
vs > 0? comp (3 dP s){ DIV (L, I, M, D) -> N, ( n, f, m, d); vP (N, F, M, D) += n, f, m, d;; sum: der / variable, n / input?  
vr = v+ N? SUB (nf) -> nf m; vd = vr+ nf m, vds = vd - a;; ratios are too small for DIV?
while (vds = ovds){ ovds = vds; Vd+=vd; vdL++; vdP() += Q (d | ddP);; default Q (d | ddP) sum., select. preserv.
vds > 0? comp (1st x lst d | ddP s of Q (d) s);; splicing Q (d) s of matching dP s, cont. only: no comp ( Σ Q (d | ddP)?
Σ vP ( Σ vd P eval: primary for -P, redundant to individual dP s ( d s  for +P, cost *2, same for +P' I and -P' M,D?
no Σ V | Vd evaluation of cont. comp per variable or division: cost + vL = comp cost? Σ V per fb: no vL, #comp;
- L, I, M, D: same value per mag, power / compression, but I | M, D redund = mag, +vP: I - 2a, - vP: M, D - 2a?
- no variable eval: cost (sub + vL + filter) > comp cost, but match value must be adjusted for redundancy?
- normalization for comparison: min (I, M, D) * rL, SUB (I, M, D)? Σ L (pat) vs C: more general but interrupted?
variable-length DIV: while (i > a){ while (i> m){ SUB (i, m) -> d; n++; i=d;}; m/=2; t=m;  SUB (d, t); f+= d;}?
additive compression per d vs. m*d: > length cost?
tdP ( tM, tD, dP(), ddP Σ ( dMΣ (Q (dM)), dDΣ (Q (dD)), ddLΣ (Q (ddL)), Q (ddP))); // last d and D are within dP()?
Input filter is a higher-level average, while filter update is accumulated over multiple higher-level spans until it exceeds filter-update filter. So, filter update is 2nd order feedback relative to filter, as is filter relative to match.
But the same filter update is 3rd  order of feedback when used to evaluate input value for inclusion into pattern defined by a previous filter: update span is two orders higher than value span.

Higher-level comparison between patterns formed by different filters is mediated, vs. immediate continuation of current-level comparison across filter update (mediated cont.: splicing between different-filter patterns by vertical specification of match, although it includes lateral cross-comparison of skip-distant specifications).
However, filter update feedback is periodic, so it doesn’t form continuous cross-filter comparison patterns xPs.


adjustment of forward evaluation by optional feedback of projected input
More precisely, additive value or novel magnitude of an input is its deviation from higher-level average. Deviation = input - expectation: (higher-level summed input - summed difference /2) * rL (L / hL).
Inputs are compared to last input to form difference, and to past average to form deviation or novelty.
But last input is more predictive of the next one than a more distant average, thus the latter is compared on higher level than the former. So, input variable is compared sequentially and summed within resulting patterns. On the next level, the sum is compared vertically: to next-next-level average of the same variable.
Resulting vertical match defines novel value for higher-level sequential comparison:
novel value = past match - (vertical match * higher-level match rate) - average novel match:
nv = L+M - (m (I, (hI * rL)) * hM / hL) - hnM * rL; more precise than initial value: v = L+M - hM * rL;
Novelty evaluation is done if higher-level match > cost of feedback and operations, separately for I and D P s:
I, M ( D, M feedback, vertical SUB (I, nM ( D, ndM));
Impact on ambient sensor is separate from novelty and is predicted by representational-value patterns?
- next-input prediction: seq match + vert match * relative rate, but predictive selection is per level, not input.
- higher-order expectation is relative match per variable: pMd = D * rM, M/D, or D * rMd: Md/D,
- if rM | rMd are derived by intra-pattern comparison, when average M | Md > average per division?
one-input search extension within cross-compared patterns


Match decreases with distance, so initial comparison is between consecutive inputs. Resulting match is evaluated, forming ±vP s. Positive P s are then evaluated for expanded internal search: cross-comparison among 1-input-distant inputs within a pattern (on same level, higher-level search is between new patterns).
This cycle repeats to evaluate cross-comparison among 2-input-distant inputs, 3-input-distant inputs, etc., when summed current-distance match exceeds the average per evaluation.  
So, patterns of longer cross-comparison range are nested within selected positive patterns of shorter range. This is similar to 1st level ddP s being nested within dP s.
Same input is re-evaluated for comparison at increased distance because match will decay: projected match = last match * match rate (mr), * (higher-level mr / current-level mr) * (higher-level distance / next distance)?
Or = input * average match rate for that specific distance, including projected match within negative patterns.
It is re-evaluated also because projected match is adjusted by past match: mr *= past mr / past projected mr?
Also, multiple comparisons per input form overlapping and redundant patterns (similar to fuzzy clusters),
and must be evaluated vs. filter * number of prior comparisons, reducing value of projected match.

Instead of directly comparing incrementally distant input pairs, we can calculate their difference by adding intermediate differences. This would obviate multiple access to the same inputs during cross-comparison.
These differences are also subtracted (compared), forming higher derivatives and matches:
ddd, x1dd, x2d  ( ddd: 3rd derivative,  x1dd: d of 2-input-distant d s,  x2d: d of 2-input-distant inputs)
     /       \
dd, x1d  dd, x1d  ( dd: 2nd derivative, x1d = d+d = difference between 1-input-distant inputs)
     / \   / \
  d   d d      ( d: difference between consecutive inputs)
 /   \   /    \   /    \
i   i       i   i ( i: initial inputs)
As always, match is a smaller input, cached or restored, selected by the sign of a difference.
Comparison of both types is between all same-type variable pairs from different inputs.
Total match includes match of all its derivation orders, which will overlap for proximate inputs.
Incremental cost of cross-comparison is the same for all derivation orders. If projected match is equal to projected miss, then additive value for different orders of the same inputs is also the same: reduction in projected magnitude of differences will be equal to reduction in projected match between distant inputs?

multi-input search extension, evaluation of selection per input: tentative


On the next level, average match from expansion is compared to that from shorter-distance comparison, and resulting difference is decay of average match with distance. Again, this decay drives re-evaluation per expansion: selection of inputs with projected decayed match above average per comparison cost.
Projected match is also adjusted by prior match (if local decay?) and redundancy (symmetrical if no decay?)
Slower decay will reduce value of selection per expansion because fewer positive inputs will turn negative:
Value of selection = Σ |comp cost of neg-value inputs| - selection cost (average saved cost or relative delay?)
This value is summed between higher-level inputs, into average value of selection per increment of distance. Increments with negative value of selection should be compared without re-evaluation, adding to minimal number of comparisons per selection, which is evaluated for feedback as a comparison-depth filter:
Σ (selection value per increment) -> average selection value;; for negative patterns of each depth, | >1 only?
depth adjustment value = average selection value; while (|average selection value| > selection cost){
depth adjustment ±±; depth adjustment value -= selection value per increment (depth-specific?); };
depth adjustment > minimal per feedback? >> lower-level depth filter;; additive depth = adjustment value?
- match filter is summed and evaluated per current comparison depth?
- selected positive relative matches don’t reduce the benefit of pruning-out negative ones.
- skip if negative selection value: selected positive matches < selection cost: average value or relative delay?
Each input forms a queue of matches and misses relative to templates within comparison depth filter. These derivatives, both discrete and summed, overlap for inputs within each other’s search span. But representations of discrete derivatives can be reused, redundancy is only necessary for parallel comparison.
Assuming that environment is not random, similarity between inputs declines with spatio-temporal distance. To maintain proximity, a n-input search is FIFO: input is compared to all templates up to maximal distance, then added to the queue as a new template, while the oldest template is outputted into pattern-wide queue.

value-proportional combination of patterns: tentative  
Summation of +dP and -dP is weighted by their value: L (summed d-sign match) + M (summed i match).
Such relative probability of +dP vs. - dP is indicated by corresponding ratios: rL = +L/-L, and rM = +M/-M.
(Ls and Ms are compared by division: comparison power should be higher for more predictive variables).
But weighting complementation incurs costs, which must be justified by value of ratio. So, division should be of variable length, continued while the ratio is above average. This is shown below for Ls, also applies to Ms:
dL = +L - -L, mL = min (+L, -L); nL =0; fL=0; efL=1; // nL: L multiple, fL: L fraction, efL: extended fraction.
while (dL > adL){ dL = |dL|; // all Ls are positive; dL is evaluated for long division by adL: average dL.
while (dL > 0){ dL -= mL; nL++;} dL -= mL/2; dL >0? fL+= efL; efL/=2;} // ratio: rL= nL + fL.
Ms’ long-division evaluation is weighted by rL: projected rM value = dM * nL (reduced-resolution rL) - adM.
Ms are then combined: cM = +M + -M * rL; // rL is relative probability of -M across iterated cL.
Ms are not projected (M+= D * rcL * rM D (MD/cD) /2): precision of higher-level rM D is below that of rM?
Prior ratios are combination rates: rL is probability of -M, and combined rL and rM (cr) is probability of -D.
If rM < arM, cr = rL, else: cr = (+L + +M) / (-L + -M) // cr = √(rL * rM) would lose L vs. M weighting.
cr predicts match of weighted cD between cdPs, where negative-dP variable is multiplied by above-average match ratio before combination: cD = +D + -D * cr. // after un-weighted comparison between Ds?
Averages: arL, arM, acr, are feedback of ratios that co-occur with above-average match of span-normalized variables, vs. input variables. Another feedback is averages that evaluate long division: adL, adM, adD.
Both are feedback of positive C pattern, which represents these variables, inputted & evaluated on 3rd level.
; or 4th level: value of dPs * ratio is compared to value of dPs, & the difference is multiplied by cL / hLe cL?
Comparison of opposite-sign Ds forms negative match = smaller |D|, and positive difference dD = +D+ |-D|.
dD magnitude predicts its match, not further combination. Single comparison is cheaper than its evaluation.
Comparison is by division if larger |D| co-occurs with hLe nD of above-average predictive value (division is sign-neutral & reductive). But average nD value is below the cost of evaluation, except if positive feedback?
So, default operations for L, M, D of complementary dPs are comparison by long division and combination.
D combination: +D -D*cr, vs. - cD * cr: +D vs. -D weighting is lost, meaningless if cD=0?
Combination by division is predictive if the ratio is matching on higher level (hLe) & acr is fed back as filter?
Resulting variables: cL, rL, cM, rM, cr, cD, dD, form top level of cdP: complemented dP.

Level 3: prevaluation of projected filter patterns, forming updated-input patterns
(out of date)

3rd level inputs are ± V patterns, combined into complemented V patterns. Positive V patterns include derivatives of 1st level match, which project match within future inputs (D patterns only represent and project derivatives of magnitude). Such projected-inputs-match is pre-valuated, negative prevalue-span inputs are summed or skipped (reloaded), and positive prevalue-span inputs are evaluated or even directly compared.
Initial upward (Λ) prevaluation by E filter selects for evaluation of V patterns, within resulting ± E patterns. Resulting prevalue is also projected downward (V), to select future input spans for evaluation, vs. summation or skipping. The span is of projecting V pattern, same as of lower hierarchy. Prevaluation is then iterated over multiple projected-input spans, as long as last |prevalue| remains above average for the cost of prevaluation.
Additional interference of iterated negative projection is stronger than positive projection of lower levels, and should flush them out of pipeline. This flushing need not be final, spans of negative projected value may be stored in buffers, to delay the loss. Buffers are implemented in slower and cheaper media (tape vs. RAM) and accessed if associated patterns match on a higher level, thus project above-average match among their inputs.

Iterative back-projection is evaluated starting from 3rd level: to be projectable the input must represent derivatives of value, which are formed starting from 2nd level. Compare this to 2nd level evaluation:
Λ for input, V for V filter, iterated within V pattern. Similar sub-iteration in E pattern?
Evaluation value = projected-inputs-match - E filter: average input match that co-occurs with average  higher-level match per evaluation (thus accounting for evaluation costs + selected comparison costs). Compare this to V filter that selects for 2nd level comparison: average input match that co-occurs with average higher-level match per comparison (thus accounting for costs of default cross-comparison only).
E filter feedback starts from 4th level of search, because its inputs represent pre-valuated lower-level inputs.
4th level also pre-pre-valuates vs. prevaluation filter, forming pre-prevalue that determines prevaluation vs. summation of next input span. And so on: the order of evaluation increases with the level of search.
Higher levels are increasingly selective in their inputs, because they additionally select by higher orders derived on these levels: magnitude ) match and difference of magnitude ) match and difference of match, etc.
Feedback of prevaluation is ± pre-filter: binary evaluation-value sign that determines evaluating vs. skipping initial inputs within projected span, and flushing those already pipelined within lower levels.
Negative feedback may be iterated, forming a skip span.
Parallel lower hierarchies & skip spans may be assigned to different external sources or their internal buffers.
Filter update feedback is level-sequential, but pre-filter feedback is sent to all lower levels at once.
Pre-filter is defined per input, and then sequentially translated into pre-filters of higher derivation levels:
prior value += prior match -> value sign: next-level pre-filter. If there are multiple pre-filters of different evaluation orders from corresponding levels, they AND & define infra-patterns: sign ( input ( derivatives.
filter update evaluation and feedback

Negative evaluation-value blocks input evaluation (thus comparison) and filter updating on all lower levels. Not-evaluated input spans (gaps) are also outputted, which will increase coordinate range per contents of both higher-level inputs and lower-level feedback. Gaps represent negative projected-match value, which must be combined with positive value of subsequent span to evaluate comparison across the gap on a higher level. This is similar to evaluation of combined positive + negative relative match spans, explained above.
Blocking locations with expected inputs will result in preference for exploration & discovery of new patterns, vs. confirmation of the old ones. It is the opposite of upward selection for stronger patterns, but sign reversal in selection criteria is basic feature of any feedback, starting with average match & derivatives.
Positive evaluation-value input spans are evaluated by lower-level filter, & this filter is evaluated for update:
combined update = (output update + output filter update / (same-filter span (fL) / output span)) /2.
both updates: -= last feedback, equal-weighted because higher-level distance is compensated by range: fL?
update value = combined update - update filter: average update per average higher-level additive match.
also differential costs of feedback transfer across locations (vs. delay) + representation + filter conversion?
If update value is negative: fL += new inputs, subdivided by their positive or negative predictive value spans.
If update value is positive: lower-level filter += combined update, new fL (with new filter representation) is initialized on a current level, while current-level part of old fL is outputted and evaluated as next-level input.
In turn, the filter gets updates from higher-level outputs, included in higher-higher-level positive patterns by that level’s filter. Hence, each filter represents combined span-normalized feedback from all higher levels, of exponentially growing span and reduced update frequency.
Deeper hierarchy should block greater proportion of inputs. At the same time, increasing number of levels contribute to projected additive match, which may justify deeper search within selected spans.
Higher-level outputs are more distant from current input due to elevation delay, but their projection range is also greater. So, outputs of all levels have the same relative distance (distance/range) to a next input, and are equal-weighted in combined update. But if input span is skipped, relative distance of skip-initiating pattern to next input span will increase, and its predictive value will decrease. Hence, that pattern should be flushed or at least combined with a higher-level one:

combined V prevalue = higher-level V prevalue + ((current-level V prevalue - higher-level V prevalue) / ((current-level span / distance) / (higher-level span / distance)) /2. // the difference between current-level and higher-level prevalues is reduced by the ratio of their relative distances.
To speed up selection, filter updates can be sent to all lower levels in parallel. Multiple direct filter updates are span-normalized and compared at a target level, and the differences are summed in combined update. This combination is equal-weighted because all levels have the same span-per-distance to next input, where the distance is the delay of feedback during elevation. // this happens automatically in level-sequential feedback?

combined update = filter update + distance-normalized difference between output & filter updates:
((output update - filter update) / (output relative distance / higher-output relative distance)) /2.
This combination method is accurate for post-skipped input spans, as well as next input span.

- filter can also be replaced by output + higher-level filter /2, but value of such feedback is not known.
- possible fixed-rate sampling, to save on feedback evaluation if slow decay, ~ deep feedforward search?
- selection can be by patterns, derivation orders, sub-patterns within an order, or individual variables?
- match across distance also projects across distance: additive match = relative match * skipped distance?
cross-level shortcuts: higher-level sub-filters and symbols
After individual input comparison, if match of a current scale (length-of-a-length…) projects positive relative match of input lower-scale / higher-derivation level, then the later is also cross-compared between the inputs.
Lower scale levels of a pattern represent old lower levels of a search hierarchy (current or buffered inputs).
So, feedback of lower scale levels goes down to corresponding search levels, forming shortcuts to preserve detail for higher levels. Feedback is generally negative: expectations are redundant to inputs. But specifying feedback may be positive: lower-level details are novel to a pattern, & projected to match with it in the future.
Higher-span comparison power is increased if lower-span comparison match is below average:
variable subtraction ) span division ) super-span logarithm?
Shortcuts to individual higher-level inputs form a queue of sub-filters on a lower level, possibly represented by a queue-wide pre-filter. So, a level has one filter per parallel higher level, and sub-filter for each specified sub-pattern. Sub-filters of incrementally distant inputs are redundant to all previous ones.
Corresponding input value = match - sub-filter value * rate of match to sub-filter * redundancy?  
Shortcut to a whole level won’t speed-up search: higher-level search delay > lower-hierarchy search delay.
Resolution and parameter range may also increase through interaction of co-located counter-projections?
Symbols, for communication among systems that have common high-level concepts but no direct interface, are “co-author identification” shortcuts: their recognition and interpretation is performed on different levels.
Higher-level patterns have increasing number of derivation levels, that represent corresponding lower search levels, and project across multiple higher search levels, each evaluated separately?
Match across discontinuity may be due to additional dimensions or internal gaps within patterns.
Search depth may also be increased by cross-comparison between levels of scale within a pattern: match across multiple scale levels also projects over multiple higher- and lower- scale levels? Such comparison between variable types within a pattern would be of a higher order:

5. Comparison between variable types within a pattern (tentative)



To reiterate, elevation increases syntactic complexity of patterns: the number of different variable types within them. Syntax is identification of these types by their position (syntactic coordinate) within a pattern. This is analogous to recognizing parts of speech by their position within a sentence.
Syntax “synchronizes” same-type variables for comparison | aggregation between input patterns. Access is hierarchical, starting from sign->value levels within each variable of difference and relative match: sign is compared first, forming + and - segments, which are then evaluated for comparison of their values.

Syntactic expansion is pruned by selective comparison vs. aggregation of individual variable types within input patterns, over each coordinate type or resolution. As with templates, minimal aggregation span is resolution of individual inputs, & maximal span is determined by average magnitude (thus match) of new derivatives on a higher level. Hence, a basic comparison cycle generates queues of interlaced individual & aggregate derivatives at each template variable, and conditional higher derivatives on each of the former.

Sufficiently complex syntax or predictive variables will justify comparing across “syntactic“ coordinates within a pattern, analogous to comparison across external coordinates. In fact, that’s what higher-power comparisons do. For example, division is an iterative comparison between difference & match: within a pattern (external coordinate), but across derivation (syntactic coordinate).
Also cross-variable is comparison between orders of match in a pattern: magnitude, match, match-of-match... This starts from comparison between match & magnitude: match rate (mr) = match / magnitude. Match rate can then be used to project match from magnitude: match = magnitude * output mr * filter mr.
In this manner, mr of each match order adjusts intra-order-derived sequentially higher-order match:
match *= lower inter-order mr. Additive match is then projected from adjusted matches & their derivatives.
This inter-order projection continues up to the top order of match within a pattern, which is the ultimate selection criterion because that’s what’s left matching on the top level of search.
Inter-order vectors are ΛV symmetrical, but ΛV derivatives from lower order of match are also projected for higher-order match, at the same rate as the match itself?
Also possible is comparison across syntactic gaps: ΛY comparison -> difference, filter feedback VY hierarchy. For example, comparison between dimensions of a multi-D pattern will form possibly recurrent proportions.

Internal comparisons can further compress a pattern, but at the cost of adding a higher-order syntax, which means that they must be increasingly selective. This selection will increase “discontinuity” over syntactic coordinates: operations necessary to convert the variables before comparison. Eventually, such operators will become large enough to merit direct comparisons among them. This will produce algebraic equations, where the match (compression) is a reduction in the number of operations needed to produce a result.

The first such short-cut would be a version of Pythagorean theorem, discovered during search in 2D (part 6) to compute cosines. If we compare 2D-adjacent 1D Ls by division, over 1D distance and derivatives (an angle), partly matching ratio between the ratio of 1D Ls and a 2nd derivative of 1D distance will be a cosine.
Cosines are necessary to normalize all derivatives and lengths (Ls) to a value they have when orthogonal to 1D scan lines (more in part 6).

Such normalization for a POV angle is similar to dimensionality reduction in Machine Learning, but is much more efficient because it is secondary to selective dimensionality expansion. It’s not really “reduction”: dimensionality is prioritized rather than reduced. That is, the dimension of pattern’s main axis is maximized, and dimensions sequentially orthogonal to higher axes are correspondingly minimized. The process of discovering these axes is so basic that it might be hard-wired in animals.


6. Cartesian dimensions and sensory modalities



This is a recapitulation and expansion on incremental dimensionality introduced in part 2.
Term “dimension” here is reserved for a parameter that defines sequence and distance among inputs, initially Cartesian dimensions + Time. This is different from terminology of combinatorial search, where dimension is any parameter of an input, and their external order and distance don’t matter. My term for that is “variable“, external dimensions become types of a variable only after being encoded within input patterns.

For those with ANN background, I want to stress that a level of search in my approach is 1D queue of inputs, not a layer of nodes. The inputs to a node are combined regardless of difference and distance between them (the distance is the difference between laminar coordinates of source “neurons”).
These derivatives are essential because value of any prediction = precision of what * precision of where. Coordinates and co-derived differences are not represented in ANNs, so they can't be used to calculate Euclidean vectors. Without such vectors, prediction and selection of where must remain extremely crude.

Also, layers in ANN are orthogonal to the direction of input flow, so hierarchy is at least 2D. The direction of inputs to my queues is in the same dimension as the queue itself, which means that my core algorithm is 1D. A hierarchy of 1D queues is the most incremental way to expand search: we can add or extend only one coordinate at a time. This allows algorithm to select inputs that are predictive enough to justify the cost of representing additional coordinate and corresponding derivatives. Again, such incremental syntax expansion is my core principle, because it enables selective (thus scalable) search.

A common objection is that images are “naturally” 2D, and our space-time is 4D. Of course, these empirical facts are practically universal in our environment. But, a core cognitive algorithm must be able to discover and forget any empirical specifics on its own. Additional dimensions can be discovered as some general periodicity in the input flow: distances between matching inputs are compared, match between these distances indicates a period of lower dimension, and recurring periods form higher-dimension coordinate.

But as a practical shortcut to expensive dimension-discovery process, initial levels should be designed to specialize in sequentially higher spatial dimensions: 1D scan lines, 2D frames, 3D set of confocal “eyes“, 4D temporal sequence. These levels discover contiguous (positive match) patterns of increasing dimensionality:
1D line segments, 2D blobs, 3D objects, 4D processes. Higher 4D cycles form hierarchy of multi-dimensional orders of scale, integrated over time or distributed sensors. These higher cycles compare discontinuous patterns. Corresponding dimensions may not be aligned across cycles of different scale order.

Explicit coordinates and incremental dimensionality are unconventional. But the key for scalable search is input selection, which must be guided by cost-benefit analysis. Benefit is projected match of patterns, and cost is representational complexity per pattern. Any increase in complexity must be justified by corresponding increase in discovered and projected match of selected patterns. Initial inputs have no known match, thus must have minimal complexity: single-variable “what”, such as brightness of a grey-scale pixel, and single-variable “where”: pixel’s coordinate in one Cartesian dimension.
Single coordinate means that comparison between pixels must be contained within 1D (horizontal) scan line, otherwise their coordinates are not comparable and can’t be used to select locations for extended search. Selection for contiguous or proximate search across scan lines requires second (vertical) coordinate. That increases costs, thus must be selective according to projected match, discovered by past comparisons within 1D scan line. So, comparison across scan lines must be done on 2nd level of search. And so on.
Dimensions are added in the order of decreasing rate of change. This means spatial dimensions are scanned first: their rate of change can be sped-up by moving sensors. Comparison over purely temporal sequence is delayed until accumulated change / variation justifies search for additional patterns. Temporal sequence is the original dimension, but it is mapped on spatial dimensions until spatial continuum is exhausted. Dimensionality represented by patterns is increasing on higher levels, but each level is 1D queue of patterns.
Also independently discoverable are derived coordinates: any variable with cumulative match that correlates with combined cumulative match of all other variables in a pattern. Such correlation makes a variable useful for sequencing patterns before cross-comparison.
It is discovered by summing matches for same-type variables between input patterns, then cross-comparing summed matches between all variables of a pattern. Variable with the highest resulting match of match (mm) is a candidate coordinate. That mm is then compared to mm of current coordinate. If the difference is greater than cost of reordering future inputs, sequencing feedback is sent to lower levels or sensors.

Another type of empirically distinct variables is different sensory modalities: colors, sound and pitch, and so on, including artificial senses. Each modality is processed separately, up a level where match between patterns of different modalities but same scope exceeds match between unimodal patterns across increased distance. Subsequent search will form multi-modal patterns within common S-T frame of reference.

As with external dimensions, difference between modalities can be pre-defined or discovered. If the latter, inputs of different modalities are initially mixed, then segregated by feedback. Also as with dimensions, my core algorithm only assumes single-modal inputs, pre-defining multiple modalities would be an add-on.



7. Notes on clustering, ANNs, and probabilistic inference



In terms of conventional machine learning, my approach is a form of hierarchical fuzzy clustering. Cluster is simply a different term for pattern: a set of matching inputs. Each set is represented by a centroid: an input with below-threshold combined “distance” to other inputs of the set. The equivalent of centroid in my model is an input with above-average match (a complementary of a distance) to other inputs within its search span. Such inputs are selected to search next-level queue and to indirectly represent other cross-compared but not selected inputs, via their discrete or aggregate derivatives relative to the selected one.

Crucial differences here is that conventional clustering methods initialize centroids with arbitrary random weights, while I use matches (and so on) from past comparisons. And the weights are usually defined in terms of one variable, while I select higher-level inputs based on a combination of all variables per pattern, the number of which increases with the level of search.
Current methods in unsupervised learning were developed / accepted because they solved specific problems with reasonable resources. But, they aren’t comparable to human learning in scalability. I believe that requires an upfront investment in incremental complexity of representation: a syntactic overhead that makes such representation uncompetitive at short runs, but is necessary to predictively prune longer-range search.

The most basic example here is my use of explicit coordinates, and of input differences at their distances. I haven’t seen that in other low-level approaches, yet they are absolutely necessary to form Euclidean vectors. Explicit coordinates is why I start image processing with 1D scan lines, - another thing that no one else does. Images seem to be “naturally” 2D, but expanding search in 2Ds at once adds the extra cost of two sets of new coordinates & derivatives. On the other hand, adding 1D at a time allows to select inputs for each additional layer of syntax, reducing overall (number of variables * number of inputs) costs of search on the next level.


Artificial Neural Networks


The same coordinate-blind mindset pervades ANNs. Their learning is probabilistic: match is determined as an aggregate of multiple weighted inputs. Again, no derivatives (0D miss), or coordinates (1-4D miss) per individual input pair are recorded. Without them, there is no Euclidean vectors, thus pattern prediction must remain extremely crude. I use aggregation extensively, but this degradation of resolution is conditional on results of prior comparison between inputs. Neurons simply don’t have the capacity for primary comparison.

This creates a crucial difference in the way patterns are represented in the brain vs. my hierarchy of queues. Brain consists of neurons, each likely representing a variable of a given magnitude and modality. These variables are shared among multiple patterns, or co-activated networks (“cognits” in terms of J. Fuster).
This is conceptually perverse: relatively constant values of specific variables define a pattern, but there’s no reason that same values should be shared among different patterns. The brain is forced to share variables because it has fixed number of neurons, but a fluid and far greater number of their networks.
I think this is responsible for our crude taxonomy, such as using large, medium, small instead of specific numbers. So, it’s not surprising that our minds are largely irrational, even leaving aside all the sub-cortical nonsense. We don’t have to slavishly copy these constraints. Logically, a co-occurring set of variables should be localized within a pattern. This requires more local memory for redundant representations, but will reduce the need for interconnect and transfers to access global shared memory, which is far more expensive.
More broadly, neural network-centric mindset itself is detrimental, any function must be initially conceptualized as a sequential algorithm, parallelization is an optional superstructure.


Probabilistic inference: AIT, Bayesian logic, Markov models


A good introduction to Algorithmic information theory is Philosophical Treatise of Universal Induction by Rathmanner and Hutter. The criterion is same as mine: compression and prediction. But, while a progress vs. frequentist probability calculus, both AIT and Bayesian inference still assume a prior, which doesn’t belong in a consistently inductive approach. In my approach, priors or models are simply past inputs and their patterns. Subject-specific priors could speed-up learning, but unsupervised pattern discovery algorithm must be the core on which such shortcuts are added or removed from.

More importantly, as with most contemporary approaches, Bayesian learning is statistical and probabilistic. Probability is estimated from simple incidence of events, which I think is way too coarse. These events hardly ever match or miss precisely, so their similarity should be quantified. This would add a whole new dimension of micro-grayscale: partial match, as in my approach, vs. binary incidence in probabilistic inference. It should improve accuracy to the same extent that probabilistic inference improved on classical logic, by adding a macro-grayscale of partial probability vs. binary true | false values of the former.

Resolution of my inputs is always greater than that of their coordinates, while Bayesian inference and AIT typically start with the reverse: strings of 1bit inputs. These inputs, binary confirmations / disconfirmations, are extremely crude way to represent “events” or inputs. Besides, the events are assumed to be high-level concepts: the kind that occupy our conscious minds and are derived from senses by subconscious cognitive processes, which must be built into general algorithm. Such choice of initial inputs in BI and AIT demonstrates a lack of discipline in incrementing complexity.

To attempt a general intelligence, Solomonoff introduced “universal prior“: a class of all models. That class is a priori infinite, which means that he hits combinatorial explosion even *before* receiving actual inputs. It‘s a solution that only a mathematician may find interesting. Marginally practical implementation of AIT is Levin Search, which randomly generates models / algorithms of incremental complexity and selects those that happen to solve a problem or compress a bit string.

Again, I think starting with prior models is putting a cart before a horse: cognition must start with raw data, complex math only becomes cost-efficient on much higher levels of selection and generalization. And this distinction between input patterns and pattern discovery process is only valid within a level: algorithm is embedded in resulting patterns, hence is also compared on higher levels, forming “algorithmic patterns“.



8. Notes on working mindset and a prize for contributions



My terminology is as general as the subject itself. It’s a major confounder, - people crave context, but generalization is decontextualization. And cognitive algorithm is a meta-generalization: the only thing in common for everything we learn. This introduction is very compressed, because much of it is work in progress. But I think it also reflects and cultivates ruthlessly reductionist mindset required for such subject.
My math is very simple, because algorithmic complexity must be incremental. Advanced math can accelerate learning on higher levels of generalization, but it’s too expensive for initial levels. And general learning algorithm must be able to discover computational shortcuts (= math) on it’s own, just like we do. Complex math definitely isn’t innate in humans on any level: cavemen didn’t do calculus.

This work may seem too speculative, but any degree of generalization must be correspondingly lossy. Which is contrary to deduction-heavy culture of math and computer science. Hence, current Machine Learning is mostly experimental, with constrained tasks and supervised techniques. A handful of people aspire to work on AGI, but they either lack or neglect functional definition of intelligence, their theories are only vague inspiration.
I think working on this level demands greater delay of experimental verification than is acceptable in any established field. Except for philosophy, which has nothing else real to study. But established philosophers have always been dysfunctional fluffers, not surprisingly as their paying customers are college freshmen.

Our main challenge in formalizing GI is a specie-wide ADHD. We didn’t evolve for sustained focus on this level of generalization, that would cause extinction long before any tangible results. Which is no longer a risk, GI is the most important problem conceivable, and we have plenty of computing power for anything better than brute-force algorithms. But our psychology lags a light-year behind technology: we still hobble on mental crutches of irrelevant authority and peer support, flawed analogies and needless experimentation.



Prize for contributions
I offer prizes up to a total of $500K for debugging, optimizing, and extending this algorithm: github.
Contributions must fit into incremental-complexity hierarchy outlined here. Unless you find a flaw in my reasoning, which would be even more valuable. I can also pay monthly, but there must be a track record.
Winners will have an option to convert the awards into an interest in all commercial applications of a final algorithm, at the rate of $10K per 1% share. This would be informal and likely unprofitable, mine is not a commercial enterprise. Money can’t be primary motivation here, but it saves time.

Winners so far:

2010: Todor Arnaudov, $600 for suggestion to buffer old inputs after search. This occurred to me more than once before, but I rejected it as redundant to potential elevation of these inputs. Now that he made me think about it again, I realized that partial redundancy can preserve the detail at much lower cost than elevation.
The buffer is accessed if coordinates of contents get relatively close to projected inputs (that and justification is mine). It didn’t feel right because brain has no substrate for passive memory, but we do now.

2011: Todor, $400 consolation prize for understanding some ideas that were not clearly explained here.
2014: Dan He, $600 for pushing me to be more specific and to compare my algorithm with others.
2016: Todor, $500 for multiple suggestions on implementing the algorithm, as well as for the effort.
Kieran Greer, $375 for an attempt to implement my level 1 pseudo code in C#

2017: Alexander Loschilov, $2800 for help in converting my level 1 pseudo code into Python, consulting on PyCharm and SciPy, and for insistence on 2D clustering, February-April.
Todor: $2000 for help in optimizing level_1_2D, June-July.
Kapil Kashyap: $ 2000 for stimulation and effort, help with Python and level_1_2D, September-October

84 comments:

Aaron Hosford said...

I wish you had illustrations to make this easier to follow. I think what I'm missing is sufficient context to follow your initial development of your ideas.

Boris Kazachenko said...

I am not sure about illustrations Aaron, they may mislead rather than explain. The only macro "architecture" I have is a hierarchy of generalization. There are micro- hierarchies of derivation within each level, but they have variable / incremental depth & dimensionality. I don't think you're missing context, - there isn't any. Generalization is decontextualization, & nothing's more general than GI. It comes with the territory, - you pay for an altitude with a lack of air:). But I'll think about illustrations, & please let me know if you have specific questions.

Boris Kazachenko said...

Alper E:

I need an example (such as character recognition, or pattern recognition) in relation to which I can think on the algorithm. You already mention "pixels across a 1 dimensional scan line". Do you have a more elaborate example and a description of the problem you are trying to tackle with this alogrithm?

Thanks,
Alper

Boris Kazachenko said...

All of this is pattern recognition, Alper, according to my definition of a pattern (haven't seen any constructive alternative to that). It's just that for me a pattern can be of any level of complexity, & you are probably thinking of something specific. For example, a character is a contiguous 2D pattern. So in my scheme it would be formed on the second level, out of 1D segments (even of length = 1) contiguously matching in 2nd D (across scan lines). "Contiguous" means that minimal d_coordinate between any constituent pixels of the segments = 1, in both dimensions. These 2D patterns (characters) will be automatically compared & recognized on higher levels: 3D & higher contiguous or 2D & higher discontinuous search. This comparison will be between same-type variables of both comparands (multi-variate patterns). These variables will include presumably-identical brightness & line-width (angle (d_1D_coordinate) -normalized 1D segments' length), & character-specific length (angle-normalized 2nd D height). This 2D length will mediate constituent sub-patterns: 2D-contiguous sequence of angles & 2D lines, which will be compared across patterns if their macro-patterns (above) match. Overall match will be a sum of matches between individual variables & sub-patterns, each possibly weighted by feedback.
This is a very rough description, I don't design this algorithm for character recognition, or any specific task. It should simply recognize anything recognizable, on corresponding level of search & with sufficient resources.
My "problem" is efficient search, where efficiency is resource allocation in proportion to predictive value, on all levels of representation. Predictive value is past_match - system-average_match, both projected by co-derived partial misses: feed-forward & feed-back vectors.

Boris Kazachenko said...

Alper E said:

It now becomes a little bit more clear (but, only a little bit :)

Do you have a description of your algorithm in a formal language (you had mentioned a pseudo language). I really need a formal description of it, because many of the terms you use do not look clear to me. Moreover, it'd be fine if we stick to a simple example, like character recognition.

How would you propose to proceed?

Alper

Boris Kazachenko said...

That pseudo is my own very high-level language, I sent you a sample. I tried more formal descriptions, but it's very awkward to think in these terms, I need more rhyme & (lossy) compression. None of that pseudo is about 2D, though. I currently work on the core algorithm, which is 1D: time-only. Spatial dimensions are empirically-specific add-ons to that core algorithm. Still worth thinking about, might be suggestive beyond spatial specifics. So, we can talk in terms of character recognition. But email is a low through-put medium, especially as I am a slow typist. For me to initially explain my terms, it should be question-answer in real-time, that's why I suggested skype or conference.

SeanVN said...

I think since so little is know about the algorithmic structure of the human brain you had better leave it to evolutionary algorithms to forge an AI. It would seem from the evolutionary time-line here on earth that creating higher intelligence is a trivial matter compared to perfecting the biochemical basis of life. I think you give an evolutionary algorithm as facile a medium to work with as possible and then, simply let it work. The medium of which I speak is probably some form of patterned arithmetic structure where where mutations do not generally have catastrophic results. And where there is an extremely high level of malleability with regard to what is constructed.

Boris Kazachenko said...

>I think since so little is know about the algorithmic structure of the human brain

I know enough to realize that it's a misshapen kludge,- a worthy product of dumb & blind evolution.

> you had better leave it to evolutionary algorithms to forge an AI.

We did, for 3B years.
Now that we have brains, why not use them?

> creating higher intelligence is a trivial matter compared to perfecting the biochemical basis of life.

That "biochemical basis" is all "arithmetic" too. If you design an algorithm to maximize its own reproduction, then you're starting from 3B years ago.
And if you want to maximize intelligence, then you need to define it functionally, which is where I start in this intro.

> The medium of which I speak ...

You can't get a *general* intelligence to evolve in a severely *constrained* medium.

Boris Kazachenko said...

Laszlo Abraham:

Hi, interesting "project". Read your article (and the indicated literature). In my opinion, your approach to this domain is too philosophical. You set a so high level of generality that keeps you out from any concrete (usable) discovery / solution .

The problem is not to understand your ideas. Any interested scholar, with a sufficient culture in the domain, at a moment, obviously arrives to these conclusions about intelligence and pattern matching. The problems are starting only from here.

If you want to say something new (usable for others too) you must dive in the details. You must split the general problem in small, maybe banal and trivial parts, until the solution is feasible and achievable. This is what Plato names diaeresis. From your level of generality this is impossible. You are blocked there. You want a too big junk with one bite. Philosophy will never solve any concrete problem.

Keep in mind what Henry Ford said: "Nothing is impossible if we could disassemble it in small (simple, solvable) parts". Every serious researcher in the domain is doing this. do not disdain them. This not means that they all miss the general view you just discovered. This means they want more: DO something, even if they could not solve alone all the problems. This is a cooperative game.

Of course we must see the forest, not only the trees. It is important, it will guide us. But, in the same time, never could make a fire with the "forest", only with woods.

Told you all these because (it seems) you traversed the same stages as I did years ago. My critics want to be constructive, want to help you. Don't be offended.

I am working in this domain from more then 20 years. Have some ideas, what and how to do. Never had enough time (money) to work only on these ideas. Would be very happy if could realize something together, but ONLY if you don't want the Moon from the sky in one step, and we have a clear (partial) target every moment.

Waiting for your reply,

Boris Kazachenko said...

I think it's a solvable problem, Laszlo: we have a solution between the ears. And I did define it, from the very beginning. It's true that no solution is final, - there is no limit to how efficient / scalable a search can be.
But there must be a strictly incremental-complexity progression of improving scalability without losing generality, as I outlined. Small steps, as you say.

The difference is that I want these steps to implement first principles, & you want them to solve specific problems. That's a common dilemma in such work, - scalability vs. quick fixes.
You say any reasonable scholar would agree with my conclusions, - name one? "Indicated literature" in my intro is precisely to explain fundamental differences between my approach & anything else I came across.

If you can show me one work that follows roughly the same progression as mine, - that's a big award right there. Even something as basic as defining a match: absolute, projected, relative, - as in part 2? Or *one* approach to clustering that proceeds with incremental spatio-temporal dimensionality, even if their definition of "distance" is not fully consistent with mine. Hello?
This is not philosophy, never mind Plato or Ford.

Re "partial targets, in my sense, I am currently trying to better define mean match, as a criterion for extended search.
Basic definition is: value of current-level projected match that co-occurs with (thus predicts) a mean projected higher-level match.
But I think that mean current-level projected match must be adjusted by projected consequences of its own feedback?

Would be interested in your ideas, of course. Can't promise to accept them, but if I don't, I will tell you exactly why.

Boris Kazachenko said...

Laszlo Abraham said:

Hi Boris,

Unfortunately, you feel offended. Sorry for this. Have no time to make a thorough (statement level) analysis of your theory. I think I understood the main message. Sorry, but I don't saw any original idea in it. You are using other words for the known things. This is not improving our knowledge in the domain. Additionally, the problem you "defined" is existing, at least, from 1676 (see "Characteristica universalis" of Leibnitz).

Since this domain (AI) existed long before we started to study it, and many smart people contributed to it with concrete and relevant results, it is an elementary expectation to use the already consecrated terminology when we are discussing about these things. For example, your "incremental spatial-temporal dimensionality" is the well known phase space in physics. Searching patterns in it is the identification problem in the dynamic systems theory, and the analog problem in the abstract notions world is modeled very well with the causal (knowledge) graphs. If you want to introduce a new terminology, you must first argument its necessity. It is very good if you want to discover some new things, but this is not a linguistic problem: renaming the notions and methods is not helping us at all. Let's speak the same language!

When you try to build a new theory, you must know almost everything about what others done in this direction, and respect their results. In other case, you will prove only your ignorance, if some will take the time to decrypt what you want to say. Also, when you expose your new theory, you must offer links to the already known and generally accepted facts, and your article must contain relevant citations. In academic circles this is an elementary expectation too. At least, you must explain why your theory is necessary, what improvements it brings, and what are YOUR ORIGINAL RESULTS (as Einstein did in 1904). If you want to advocate for a new method (theory), first you must have results, Maybe you're right, maybe your idea is genial, anyhow nobody will believe you till you prove it. So, let's start with the small problems!

Regarding the examples you ask: at a popular science level Kurzweil's, Hofstadter's and Mero's books contain all your ideas about the link between intelligence and pattern matching (and much more). At a scientific level, conforming to my knowledge, Schmidhuber's and Hutter's works are the examples which are working with the same notions / methods, and HAVE SPECTACULAR RESULTS. For example, google for the Hutter Price. This is a very good example how an award promising theoretical problem most be exposed.

Additionally, you could take a look on Judea Pearl's books, and the domains of knowledge graphs, chat robots ontology and the Loebner Price. In my opinion, you rediscovered the contextuality of the knowledge. Bravo, how we go further?

Of course, you are free to publish anything on your website, even on Elance. The risk is only to receive unpleasant comments from old dinosaurs as I am. Again, sorry if offended you, don't bother you more. Except if you really want to put at work the contextual model of the world.

Thank you for your attention,

Laszlo

Boris Kazachenko said...

As you admitted Laszlo, you didn’t get very far in my intro. Just recognized some basic points you already knew, & ignored parts you didn’t understand, - most of it.
Which means that you can’ judge whether it’s original. For example “phase space” has nothing to do with my incremental dimensionality, see part 3.
And other suggestions are pure fluff: “philosophy” as you say.

Except for two: I don’t use accepted terminology & build on other people’s work.
Well, there are several fields involved, all significantly off of my focus. And each has its own terminology, usually incoherent & mind-polluting.
Like I said, generalization is a reduction. You can spend many life-times reading related work, only to realize that it’s not related enough.
Or, you can define & describe the problem in general English terms & phrases that fit it best. Then you google them, & see that my site is on top of the list.
I don’t think the problem here is my terminology, it simply takes a while to acquire appropriate context (in my case, ~ 27 years).
I am dully impressed that there are plenty of smart people in the world, many smarter that I am. But, as with anything else, intelligence is not what you got, it’s how you use it.
And almost all use it mainly to succeed in life, according to societal standards. Yes, even in science: it’s all about peer review. Which is great, but only if you’re lucky enough to have peers.
Anyway, thanks for trying.

Boris Kazachenko said...

Dan He said:

I briefly read your blog, very interesting work. I have the following thoughts.

The proposed method is contrary to deep learning. In deep learning, the number of features in the model presentation is reduced layer by layer, starting from the full set of features. In your model, the number of features increases layer by layer, starting from single features. However, your model also allows cross layer communications from top layers down to the bottom layers, which is similar to deep learning. I think this is indeed how the brain works: when we learn something, we often jump forwards and backwards to connect different levels of details and information. I think this process should be developed in a more systematic way.

Human brains learn new knowledge, or patterns, differently. Someone may pay attention to the big picture first, someone may pay attention to details first. Someone may pay attention to shape first, someone may pay attention to color first. Someone may pay attention to red color first, someone may pay attention to blue color first. Thus there should not be a single "best" order of the dimensions to be selected, alternatively speaking, the incremental growth could start from different dimensions. Instead of fixing the order of the dimensions to be selected as proposed by the blog, we could use a merge-based strategy: we first select a set of size-1 dimensions that are significant enough, assuming we select dimensions A,B,C,D. Then we merge these significant dimensions into size-2 dimensions, AB,AC,AD,BC,BD,CD and then select from them the significant size-2 dimensions. Assuming we select AB,AC,BC,BD,CD. Then we merge these significant size-2 dimensions into size-3 dimensions, ABC,BCD. We repeat the process till no more dimensions can be merged. The advantage of the merge-based strategy is we allow different start points in the search space in a systematic way, which models the multiple possibilities of the brain functions.

One question I have is: the dimensions for the incremental selection seems to be different from the features of the data. For example, an image may have thousands of pixels and each one is considered as a feature. But it seems your dimension here is the 2-D space. Then when you make comparisons, how exactly you handle the features of the data, which is critical for the scalability of the method?

Finally, I would suggest you have a section of "concept definition" and "table of annotations". The current writing is kind of hard to follow as the definitions of some concepts and annotations are missing, or at least not easy to locate.

These are all my brief ideas, please let me know if you think we can talk more.

Thanks
Dan

Boris Kazachenko said...

Thanks for your comments Dan,

> In deep learning, the number of features in the model presentation is reduced layer by layer, starting from the full set of features...

"Features" in NNs correspond to two different things in my model:

1) Positionally distinct inputs / patterns, ordered along S-T dimensions within input flow: 1D length ) 2D width ) 3D depth ) 4D duration. These dimensions are defined by explicit coordinates in my model, but are not easily represented & compared in a NN, which I think is one of their basic flaws. The number of such patterns is reduced with elevation, just as it is in deep NN, because input patterns to higher levels are S-T contiguous spans of lower-level patterns.

2) Variables internal to these patterns, such as different original input modalities, & matches / differences derived from them by lower-level comparisons. Again, there is no such internal variables in NNs, instead, they are distributed across network / brain, & shared among multiple "patterns" or "cognits" in terms of J. Fuster (which are comprised of long-range connections to such variables). This is another basic flaw of NNs because such long-range connections are very expensive, & must be activated all at once. The number of these variables per pattern is increasing with elevation in my model because they are necessary to represent increasingly complex relationships discovered during lower levels of search. This is also true for neocortex, as the input flow proceeds from primary to association cortices. Again, in my model such incremental syntax is localised within a pattern, but is distributed across "neocortex" in neural models.

I tried to explain these things in parts 4 & 8, don't know if you made it that far.

> Someone may pay attention to the big picture first, someone may pay attention to details first...

"Attention" is a feedback from higher levels, you don't initially have it because higher levels are "empty", - they fill-in with generalized outputs of lower levels. A designer determines the order of feedforward, which can latter be altered by self-generated feedback.

> One question I have is: the dimensions for the incremental selection seems to be different from the features of the data. For example, an image may have thousands of pixels and each one is considered as a feature. But it seems your dimension here is the 2-D space.

It's 4D space-time, see above.

> Then when you make comparisons, how exactly you handle the features of the data, which is critical for the scalability of the method?

I compare same-type variables between different patterns: their syntax must "synchronized" before comparison.

> Finally, I would suggest you have a section of "concept definition" and "table of annotations". The current writing is kind of hard to follow as the definitions of some concepts and annotations are missing, or at least not easy to locate.

This would depend on reader's background, which I don't know. That's why I use general English terms, usually in their original meaning. I know it's confusing for a specialist, but my subject is *general* intelligence :).

Thanks again Dan, hopefully we can get into procedural details.

Boris Kazachenko said...

Daniel Michulle said:
Hi there,

I am reading your description on the web site and will put down some of my thoughts:

1. "Incremental": I am not sure whether incremental approaches to pattern discovery work since this requires sequential processing. That is, unless you mean 1a or 1b.
1a: Attention: Attention is a sequential process that looks for "interesting things". This would mean that you need a guidance system for attention that I presume should have the same structure as the cognitive apparatus you intend to create. This can be just another level in the hierarchy.
1b: Hierarchical increments which you already included by stating you want a hierarchic system.

2. Feedback: Feedback must be two way in your solution.
2a Either you are able to do this fully parallely or
2b you define an artificial discrete state and update it in artificial sequentially without any input/output occurring during this time or
2c or you select just some of the processing units in your cognitive apparatus to update and try to stay real-time

Anyhow, control flow seems a critical design decision.

PS: Found this today, nothing scientific but maybe a useful analogy: http://nautil.us/issue/12/feedback/ants-swarm-like-brains-think

3. Homoiconicity
In order to be efficient at implementing your Cognitive Apparatus (CA), you would do well in restricting input to some sort of data structure representable by modern day PCs. Bit-wise is an option but w.r.t. efficiency I presume that "array of double" is a better format. You can downscale later.

Also, in order to be able to create your CA in a finite amount of time, I presume it is much easier to define the output of any functional unit in your CA to be of the type of your selected input. In this way you can compose your CA of different units.

Still, you could later substitute some units by others without sacrificing composability.

Given these thoughts, its no wonder why often people like deep learning based on ANNs because they are homoiconic and composable.

4. Parameters

Your system will have a huge number of parameters and you will need to search for optimal parameters. You don't want to do this by hand so you'll need a meta-algorithm.
Up to now I just found one "universal algorithm" that works context free: Evolutionary algorithms where competition leads to improvement (you dont know how good your CA will be but if you have 10 you at least have a measure of how good it is compared to peers)

Moreover, if you like challenges (and it certinaly seems so), you can use genetic programming to find components. Also, you will obtain a measure for the overall cost of the system which may prove useful in finding:

5. local computation cost
Since you mention search cost, you will need an intrinsic measure of cost. I have no idea how you could implement that but brains do this via its neurotransmitters and vesicles. These are only slowly recharged and you cant spend more than you have. The disadvantage is that you need to model time in this case.

6. Background: I am a big fan of LISP. I encourage you to realize any of your ideas in Clojure because of homoiconicity and representation of logic (core.logic) and the huge amount of libraries in Clojure and Java.

I myself worked on two different AI "leaps":
6.1. A General Game Playing agent: An agent that is able to play any game (deterministic games with complete information, e.g. Chess, Checkers, Connect Four, Tic-Tac-Toe, Rock-Paper-Scissors)
6.2 A generic time series forecasting mechanism that takes any kind of time series data and is able to extract all the info for forecasting, including selecting its internal model, time sieres transformations, outlier detection , ...
I founded a company in 2012 for promoting the system and we have modest success right now.

That being said, I am a big fan of this kind of ideas but I lack time.

Hope you find this interesting!

BR, Daniel

Boris Kazachenko said...

Thanks for your comments, Daniel,

>1. "Incremental": I am not sure whether incremental approaches to pattern discovery work since this requires sequential processing.

Any pattern discovery is an iterative comparison to other inputs within a search range. I tried to explain how search should incrementally / hierarchically expand, let me know what's unclear.

> 1a: Attention: Attention is a sequential process that looks for "interesting things". This would mean that you need a guidance system for attention that I presume should have the same structure as the cognitive apparatus you intend to create. This can be just another level in the hierarchy.

Attention is simply another term for feedback down the hierarchy, which I also tried to explain.

> 2. Feedback: Feedback must be two way in your solution.
2a Either you are able to do this fully parallely

All levels should process in parallel, both feedforward & feedback.

> Anyhow, control flow seems a critical design decision.

"Control" is also another term for feedback.

> 3. Homoiconicity: In order to be efficient at implementing your Cognitive Apparatus (CA), you would do well in restricting input to some sort of data structure representable by modern day PCs. Bit-wise is an option but w.r.t. efficiency I presume that "array of double" is a better format. You can downscale later.

All I need is bit-mapped images.

> I presume it is much easier to define the output of any functional unit in your CA to be of the type of your selected input.

Basic units in my approach are levels, syntactic complexity of their inputs & outputs is incremental with elevation, as I tried to explain.

> 4. Parameters: Your system will have a huge number of parameters and you will need to search for optimal parameters. You don't want to do this by hand so you'll need a meta-algorithm. Up to now I just found one "universal algorithm" that works context free: Evolutionary algorithms where competition leads to improvement (you dont know how good your CA will be but if you have 10 you at least have a measure of how good it is compared to peers)

Input parameters are also altered by feedback, not at random as in GAs, see part 1.

> Since you mention search cost, you will need an intrinsic measure of cost. I have no idea how you could implement that

I use "opportunity cost": average projected match per search step. This is the most basic form of feedback, see part 2.

> The disadvantage is that you need to model time in this case.

I use explicit 4D space-time coordinates, see part 4.

> I encourage you to realize any of your ideas in Clojure because of homoiconicity and representation of logic (core.logic) and the huge amount of libraries in Clojure and Java.

My approach is expressed in its own language. The core algorithm is supposed to be self-contained, I don't plan to use any external libraries for it.

Sorry for being brisk. It's just that I've been repeating these things forever, don't know how to make it any clearer...

Boris Kazachenko said...

Dan He said:
Hi, Boris:
Sorry for the late response. I was traveling the last couple of weeks.
SO your pattern is positionally distinct patterns. What about other types of patterns? Is your model generic to all types of patterns or only to this type of specific patterns?
Also if the dimension is only 4-D, does incremental really help? I am
not sure if my understanding is correct, but it seems you will only have 4 layers?

Thanks
Dan

Boris Kazachenko said...

No problem Dan.

"Positionally" includes temporally distinct, - each level gets a temporal sequence of inputs, each with distinct temporal coordinate. In that sense, any type of pattern is positionally distinct. Initial "patterns" are pixels in 1D scan line, - temporal sequence that is mapped into spatial coordinate on a higher level.
On the other hand, all constituent variables co-occur in a pattern, thus are not positionally distinct. On a given level of differentiation, of course. A pattern may contain multiple levels of partial differences / derivatives relative to other patterns, which are positionally distinct.
Incremental dimensionality is: 1D 1st level, 2D 2nd level, 3D 3rd level,TD 4rth level.
This 4-level cycle repeats indefinitely, with patterns of hierarchically greater span / scale.
Dimensions in cycles of different order of scale are not necessarily aligned with each other.
Anyway, our 3D space is empirically specific & this dimensionality should be learnable.
My core algorithm is 1D, - time only.
The point of having incremental dimensionality is that it lets you filter-out weaker lower-D patterns before adding the costs of higher-D coordinates.

Boris Kazachenko said...

Dan He said:

If the level could be infinite, what do you think that we don't fix the order of the incremental search, instead we increase the levels by merging the lower levels? I think this way we would still consider all possible
combination of levels while pruning the search space effectively.

Dan

Boris Kazachenko said...

Not sure I understand you, Dan.
"Level" is a span of search before selection = elevation.
(It's actually more complex, see part 3, somewhat out of date)
Such span is "increment" in incremental search.
That span ends when the benefit of selection exceeds its cost.
The benefit of selection is (cost - benefit) of comparing sub-critical
(cost > benefit) patterns, which should be pruned-out.
This point is reached because the cost of comparison is fixed,
but the benefit (average match) declines with the distance
between comparands (if the data set is not random).

That point doesn't depend on total memory of the system,
so we fix the number of inputs per level, not the number of levels.
I am glad you are thinking about it Dan, but these ideas must be justified
:).

Boris Kazachenko said...

Dan He said:

Do you know Apriori algorithm? The idea is justified there. It's just a
matter whether it can be applied
here.

Dan

Boris Kazachenko said...

Looking at wiki, I don't see a justification.
Anyway, in breadth-first all data is available at once.
In real-time learning, inputs are searched in the order
they come in, & data set is never complete.
It's neither breadth- nor depth-, but proximity- first.

Boris Kazachenko said...

Dan He said:

Why we have to follow the input order? I mean we could cache the input
within a certain short amount of time window and the apriori algorithm can
then be applied, right?

Dan

Boris Kazachenko said...

We have to follow proximity between inputs because it's
the most basic predictor of similarity / match.
This may not sound obvious to you because you are used to
dealing with heavily pre-processed data. Compressive
pre-processing (starting with image-compression transforms)
already exploited this proximity->similarity correlation,
& you can't do that again.
I start from raw images because the algorithm has to scale
down in order to scale up. Generality / scalability means that the same principles should apply on all levels, only the syntax of input data is getting more complex.

"Time window" is my search level, & yes, it's breadth-first
within it. But apriori algorithm searches for binary match between
combinations among adjacent inputs, & I start by searching for
gray-scale match between individual inputs.
Again, this is because I start from the beginning, while the algorithms you are used to are designed to deal with preprocessed / human-generated symbolic data. That's why they don't scale.

Boris Kazachenko said...

Dan He said:

I don't think Apriori can only deal with binary match. You can use any matching criteria to determine if a candidate should be selected or not.
Also when you include a new dimension, it's either include or not, right?
That's the same thing for Apriori. Even if you allow partially include, you could still do the same thing in Aprior, with partial membership, something similar to fuzzy logic. I just don't see why you can not apply Aprior here.

Dan

Boris Kazachenko said...

I guess you can morph any algorithm into any other, if definitions are fuzzy enough.
That's why I work on the level of operations, rather than wholesale algorithms. What exactly do you suggest to change / add to my algorithm?

Boris Kazachenko said...

Dan He said:

Well, my suggestion is:

Instead of sequential incremental search, we might be able to do parallel incremental search from multiple dimensions, where we increase the dimensions in parallel by merging the dimensions of the current level to generate the candidates for the next level. At each level, the candidates are selected using any user specified rule.

Dan

Boris Kazachenko said...

With incremental dimensionality, a 2nd D (vertical) level
receives inputs (1D patterns) that were compressed &
selected on a lower level. And so on.
In your version, multiple levels would process in parallel
the same uncompressed & unselected inputs, which is
a horrible waste.

Incremental selection is my core principle: it's what
should make the algorithm scalable. Nice try :).

> At each level, the candidates are selected using any user
specified rule.

The algorithm is supposed to be general: autonomous &
unsupervised. Selection criteria is what it's all about, there's
no user in the loop.

BTW, I'd like to have this discussion in comments on my
intro, someone else may find it useful. I'll copy old messages
there, If you don't mind.

Boris Kazachenko said...

Dan He said:

Only on 1D the inputs will be uncompressed & unselected. Since 2D, we will be merging the dimensions from the previous level so it won't be uncompressed. At each level we make selections and we only merge the dimensions that are "significant".

The selection is generic. It doesn't need to be user defined. Whatever method you used to select a dimension can be used here.

Yeah, you can publish these discussions.

Dan

Boris Kazachenko said...

You said:

> parallel incremental search from multiple dimensions,
> where we increase the dimensions in parallel by merging
> the dimensions of the current level to generate the
> candidates for the next level.

Do you mean expanding search in multiple additional
dimensions (& adding corresponding coordinates) per
single additional level, then?

If so, I already answered that in May 11nth message:

"The point of having incremental dimensionality is that it
lets you filter-out weaker lower-D patterns before adding
the costs of higher-D coordinates."
This is an instance of "incremental syntax", - another core
principle in my approach.

I don't know what you mean by "merging" dimensions?
In my model, lower-level variables are selectively
represented or merged with related variables on a
higher level, see part 5.
Although that part is a bit messy & out of date, sorry.

Boris Kazachenko said...

Dan He said:

If you have ABCD 4 dimensions, and at current level, say 2 dimension level, you have AB and BC, both are "significant", then the next level is 3 dimension, you merge AB and BC and you obtain ABC, which is 3 dimension.
That's what I mean by "merging". You merge two n-dimensions to obtain one n+1-dimension, where the two n-dimensions overlap by n-1 dimensions.
Apriori algorithm then guarantees that you won't miss any significant dimensions and you don't need to do exhaustive search.

Dan

Boris Kazachenko said...

I thought we agreed to use "dimension" to mean Cartesian
coordinates + time |the order of input, for clarity?
You are talking about dimensions of combinatorial search,
while mine is proximity-ordered. As in, proximity between
comparands, rather than proximity in expanding the number
of constituent variables within a comparand.
I have multivariate patterns, but their variables aren't included from outside, they are *derived* through lower-level comparisons to a fixed syntax.
When I compare these patterns, I don't compare their variables in all possible combinations, I only compare variables of the same type between different patterns.

So, the difference is that variables in my patterns ("dimensions" in your terms) are added to represent derivatives from comparing & compressing lower-level patterns. While in apriori algorithm, they *are* lower-level patterns, combined into higher-level ones if they co-occur.
My model is finer-grained, thus should be more compressive.

Boris Kazachenko said...

Dan He said:

For proximity-order, you can still cache the input for a certain amount of time, right? Then we can search in parallel? If not, what do you exactly mean "proximity order"?

So why Apriori can't be applied on the variables in your patterns? And also when your patterns grow to higher-level, eventually Apriori can be applied,right?

Dan

Boris Kazachenko said...

> For proximity-order, you can still cache the input for a certain
> amount of time, right? Then we can search in parallel?

Yes, that;s how you get fuzzy patterns (~fuzzy clusters).
Each input member is compared to a sequence of old inputs
(template), possibly in parallel, forming a sequence of
derivatives that overlaps those of adjacent inputs.
This overlapping / redundant representation is far more
expensive than unique representations produced by strictly
sequential search.

> why Apriori can't be applied on the variables in your patterns?

Because individual variable's match is "merged" with those of
co-derived variables, - a group is defined by derivation rather
than simple proximity. That's one of the difference between
variables & patterns: the former are ordered / merged
by derivation syntax rather than input sequence.

Boris Kazachenko said...

Dan He said:

What's your definition of "derivation" and "derivatives"? You compare the
inputs with the templates,
and you call those overlapped ones as "derivatives"?

I still don't get the point that you eventually need to increase the
dimensionality or whatever, and whatever you need to increase, can be feed
into the Apriori algorithm, right? Even if the derivatives depend on the
sequence of inputs, if we can cache the inputs, we could obtain the
sequence of derivatives, right?

Dan

Boris Kazachenko said...

Derivatives are not inputs, they are matches & misses
produced by comparing inputs. They replace original
inputs to achieve lossless compression. I described
this process in part 3, but it's messy & out of date.
I am re-writing it, will let you know when it's done.

Boris Kazachenko said...

Dan He said:

Ok, so please let me know when you have it updated. Also if you could
provide a concrete running-example, that would be very helpful. The current
description contains lots of definitions and it's hard to map them into
some math models.

Dan

Boris Kazachenko said...

Hi Dan, sorry it took me so long.

I was working on something else, & also,
it’s hard to explain a work-in-progress.
Anyway, I just posted a bunch of updates,
especially in part 3, take a look.
Obviously, I am not done yet.

So, I can’t select | exclude variables depending on
their co-occurrence, as in apriority algorithm,
because then syntactic order will be lost.
The syntax is fixed for each search & differentiation
level, & variable types can be deleted / excluded only
by a level of syntax: a group of variables with a
common root (variable from which they are derived).

Such as dP: i: last input, L: same-d-sign length,
D: summed ds, M: summed ms, Q(d): discrete ds.
See part 3.

So, match is combined within a group only.
If it is below average for a corresponding root
variable, the group is deleted, & derivation flag
at the root variable is set to = 0.

> The current description contains lots of definitions
> and it's hard to map them into some math models.

It’s hard because they don’t fit, at least those I know
(not very many). That’s why I start from the scratch.
Terminology & methods must fit the problem,
not the other way around.

Anyway, it’s good that you pushed me to get into
details, here is $200 as a token of my appreciation :).

Boris Kazachenko said...

Dan He said:

Thanks

Also if I fully understand your method, I believe there should be more I can contribute.

Dan

Boris Kazachenko said...

Right, I hope you will understand it better, let me know if any questions.

Kristian Grozdanovski said...

Regarding the third paragraph - in other words our mind understands the experiences in our life by remembering patterns and later associating them with different patterns, right? I mean, this is what I have been theorizing myself and you have reached conclusions very similar to my own.
For example let's take an apple. When you read that word, you suddenly remember the smell, look, taste and feel of this entity - patterns from different sensory inputs.

Boris Kazachenko said...

Yes, but the key is to formally define patterns, and the algorithm to discover them. Do you have any specific ideas or comments on my algorithm? Thanks.

Todor "Tosh" Arnaudov said...

Hi, Boris, a few comments from a partial reading:


1. Noise IS a pattern and IS predictable - that's why it's boring

2. Higher levels are not always smarter

3. Lower levels of cognition are not unconscius

4. Causality, causes and correlations


...

1. Noise IS a pattern and IS predictable - that's why it's boring

Boris: "Novelty can’t be a primary criterion: it would select for noise and filter out all patterns, which are defined by match."

Todor:

I think that's a common, partly superficial, cliche. Noise, if having a stable or predictable distribution, is also a pattern, actually a very simple one, and it is recognized as some kind of distribution or remembered/classified otherwise (with a generative capabilities), depending on the system.

Noise's representations are "boring" (for watching, like TV static) not because they are too "unpredictable", but the opposite. Respectively it is not true that noise is "uncompressible".

It is compressible - to a noise distribution which is precise enough, since or if the cognitive system cannot remember the specific details, thus it wouldn't care or notice in real time if the specific low level values are slightly altered to the original.

If one tries to compress noise as specific values exactly, then she treats that input not as noise.

So in a series of frames with TV static only the specific values of the lowest level representation are "unpredictable", but the distribution is "the same", given the resolution of perception - in abstract sense of resolution, - that goes also for other details such as the same screen/dynamic area (physical coordinates), the same average brightness - applied in many levels/resolution of aggregation (and matching between each other) etc.. Also noise is locally self-predictive, self-matching after General Processing** or adjacent comparison, which makes it boring too.

Noise matches the predictions too match and watching the "static" on a TV is somewhat similar to watching the wall.

That goes for audio noise as well, it immediately starts to sound like a boring periodic pattern and different spectra of noise are recognized - white, pink, low frequency, middle frequency, high frequency* and if one is given synthesizing tools to play with, she can recreate the noise that she has heard, i.e. it is a pattern.***

* These are "generalized", higher level labels, but they are just pointers to the representations. People with good musical ear or with "absolutist" audio sense could remember and recreate or even tell the frequencies, just like they could replay music by ear. See 3. below.
** ...
*** Another theory is that the apparent local unpredictability overdrives brain processing, the system suffers cognitive overload due the unsuccessful attempts to compress, but I think that's more likely true only for the "garbage input", confusing ones, ones with too many recognizable details, like pictures of dumping grounds, general "disorder"; that is, something IS recognized, some part are "meaningful", but there's an overload in making sense of the whole picture. In the case of noise, the properties are highly local and homogeneous.

Todor "Tosh" Arnaudov said...

2. Higher levels are not always smarter

Boris: "... Intelligent variation must be driven by feedback within cognitive hierarchy: higher levels are presumably “smarter” than lower ones. That is, higher-level inputs represent operations that formed them, and are evaluated to alter future lower-level operations. ..."

Todor:

I don't think that's always true (only "presumably" and sometimes), the "cleverness" of the higher level depends on the specific environment and the history of the experience and the biases, for how long the higher level was relatively "stable", how "stable", how "stability" is measured, how stable are/were the lower levels etc. In a more dynamic environment the lower levels might be "smarter", because they would adapt and react first and higher level plans might go completely wrong, with (correct) prediction value = 0, like with the branch prediction in the CPUs or the cache misses.

There's also something that Kant and Schopenhauer call "facculty of judgment", which is the glue between different levels of generalization and the capability to recognize and classify correctly the lowest sensory input to correct respective higher level representations.

The higher level is given higher *power* in longer term, because it may push the agent to this or that path for a bigger spatio-temporal radius, however whether this is more "predictive" in the long run depends on the whole Universe of things. "Smart" leaders often send their followers to hell and apparently wise maxims have kept people in darkness, and the world is occupied with other agents of which there are limitations for predictability. Their intentions are both partially hidden/unattainable or their goals and behavior is flexible and may change in reaction to your predictions/actions.

Brain is not strictly cognitive and the prefrontal cortex integrates too many inputs, but the world is also not just cognitive, and the top-down tyranny in general is prone to and causes corruption, conservatism, political tyranny, pseudo-democracy.

Todor "Tosh" Arnaudov said...

3. Lower levels of cognition are not unconscius

Boris: "All our knowledge is ultimately derived from senses, but lower levels of perception are unconscious. Only generalized concepts make it into human consciousness, AKA declarative memory, where we assign them symbols (words) to facilitate communication."

Todor:

The artists of all arts and the athletes, especially in technical sports are supposed to operate masterfully with "lowest level" stuff, that is simple actions or perceptions. Making a concept verbal might be just labeling, a quick addressing. A talented artist knows and operate these concepts (patterns) even without giving them a verbal name. Lower levels are "unconsious" only for the ones who do not understand/recognize or cannot control their actions with an appropriate/sufficient resolution of causality and who lack introspective capabilities. There are degrees in that.

If consciousness is the ability to know, to alter, to select, to address etc. these lower level components, then one who does is conscious about them.

Sure, you may call these "general representations" as well, but for example the process of drawing and painting the spectrum of all different specific basic trajectories/forms/shapes/gradients etc. is as specific (low level) as possible for a motion/action/transformations performed with the respective tools of the trade, and the artist can control each pixel, color, shade etc.

In comments to a book (200, 210 of http://razumir.twenkid.com/kakvomu_notes.html#210, http://razumir.twenkid.com/kakvomu_notes.html#200) I discuss "The Hypothesis of the Deeper Consciousness" of mine (in Bulgarian).

Jeff Hawkins' hypothesis about the motion of the representations lower in the hierarchy through training is related to that.

For example, there's a common confusion regarding creativity. A guitar player who's improvising at a high speed might appear as "unconscious" to the audience ("instinctive"), but she's not. She is not "unconscious" about the future tonal possibilities, respectively her motor actions (predictions), she knows what would sound well, respectively what motions she could and want to apply etc., thus she chooses this or that, and she has to control the motions at the maximum possible resolution. She expresses her predictions and what she feels/cognizes through these motions, because speech is inappropriate and too slow, but that doesn't mean that she doesn't understand (cognize) her actions and their very components.

Todor "Tosh" Arnaudov said...

4. Causality, causes and correlations (comment to a G+ topic on the "causes")

Causality as a concept is useful for benchmarking the system in a simulated or in the real world. The causes are defined there as physical laws or others, the correlations are found by the system and compared. They are marks of the maximum or the target resolution of perception and causation (alteration, change), achievable in the given environment. The "God" of the cognitive algorithm aims at making his pupil reach to that maximum possible resolution or to the wanted one, according to given point of view and premises.

Yes, causality is a matter of degree and POV, especially when looking at the world not literally at a full resolution (in that case it's a literal record) and when asking questions like "Why this car hit that man, who was crossing the street?" or "Why that car has parked here and not there?". The "Why" question requires target resolution/ limitations/ bandwidth/ agents/ attention[span]/ radius[space-time]/ motivation ... and other premises to be given, before answered. If all is run in the same pool, the premises are also based on the primary causes, but are at different levels and require some kind of Understanding that selects, encodes and decodes different types and levels of "causes". The physical laws do not understand life or human thoughts.

Regarding the claim that we don't know more of "correlations" - that's somewhat Locke-Hume-Mach-style of thought, - in opposition to Kant-Schopenhauer-Marx-Engels-Lenin. Usually that's said when the philosophers forget about the existence of their bodies. Through the experience we see too clearly that our bodies are built of a big chunk of these same "correlations" of the physical laws and are result of series/networks of them, therefore our thought and action processes are expressions of these very "correlations".

The physical laws are implanted in us and we are directly connected to the causal chain with a concept/something that was once called "Will" by Schopenhauer, related to "The Thing in Itself" of Kant and to the "Will To Power" of Nietsche.

Sure, technically, for the implementation, it doesn't matter whether you call it "strong patterns" or parts of the "causal network".

Boris Kazachenko said...

Todor,

1). You should know by now that by “pattern” and “noise” I mean components (not categories) of real inputs. You are saying that all noisy inputs have *some* sort of pattern in them. That’s nothing new, I am talking about majority-patterns: spans of above-average match, and majority-noise: spans of below-average match.

2). We've had this discussion before. The only feasible “glue” between levels of generalization is even higher level of generalization.

3). Yes, so what?

4). "Sure, technically, for the implementation, it doesn't matter whether you call it "strong patterns" or parts of the "causal network"."
It does matter, “pattern” is constructive and “causal” is distracting.

Boris Kazachenko said...

Todor,

In general, I don’t want to argue about part i, it is mostly suggestive. I know you love to go on a tangent, but it serves no constructive purpose. What I really need to figure is the process of forming incrementally higher orders of feedback: part 4. Feedback is the most fundamental concept in my approach. I have three basic orders so far: templates: past inputs for comparison, filters: averages of past inputs for evaluation, and expectations: back-projected past inputs for prevaluation (which allows for skipping whole spans of future inputs, rather than filtering them one-by-one).

The first two are straightforward: templates are not modified at all, and filter = higher-level cumulative match / (higher-level span / lower-level span).
But back-projection involves re-integrating co-derived differences, which doesn’t fit the “pattern” of simple span-normalization formed by the first two orders. And I need a common “meta-pattern” to derive subsequent orders of feedback.
Also, averages of past inputs should be compared, and input-to-average match should contribute to ultimate filtering by cumulative past match.
Any ideas? BTW, this is getting to be too much for comment section here. Let me know if you plan to continue, I will start a new “discussion” post.

Todor "Tosh" Arnaudov said...

Yes, I do plan to give a try of the new material.

1. OK, I mean also that the above-average match sounds a bit dogmatic to me for the earliest stages, the "child" machine may explore more and collect more low level data.

Also, is your system supposed to have external/additional/"lower"/interface drives to push its development, to help it avoid local minima, obstacles, lack of immediate predictive value?

That's both "emotions" and a "teacher/mentor", and is related also to early navigational/exploration behaviors, like babies scanning the area with their eyes when put in the dark (search for contrast, novelty seeking), or changing the direction of search/scan/coordinate adjustments when reaching the end of the coordinate space/dimension/... These may emerge from the principles of the cognitive algorithm, but may be partially "external", driven in case the environment is not "good"/stimulating enough or that kind of knowledge is inappropriate/inefficient to learn through "free" exploration.

Boris Kazachenko said...

Yes, selectivity will increase with experience, via feedback from higher levels, which are added incrementally. All filters (averages) are initialized at 0. I just wish you get over your holistic, analogic, qualitative, anthropomorphic thinking, - it's not constructive.

Supervision and reinforcement can be added at any point, but that's trivial, not an interesting part. Real work is in designing scalable unsupervised learning algorithm.
I should post an update soon, mostly on parts 3 and 4.

keghn feem said...

I have a preliminary compete AGI theory:

https://groups.google.com/forum/#!topic/artificial-general-intelligence/UVUZ93Zep6Y

https://groups.google.com/forum/#!forum/artificial-general-intelligence

Boris Kazachenko said...

Thanks Keghn. You don't go into detail, but rely on ANN. I stated specific objections to using ANNs in the first section.

keghn feem said...

Yes. I do unsupervised learning with Kolmogorov complexity and Frieze logic
patterns. I fist did clustering with this logic in a non ANN way. And then
later found out a way to do image clustering with NN, with the above logic. And
then temporal clustering.

http://www.scientificcomputing.com/articles/2015/09/patterns-are-math-we-love-look
https://en.wikipedia.org/wiki/Frieze_group

The AGI NN Brain is a cascading LSTM:
https://www.reddit.com/r/alife/comments/3dkcdv/agi_brain_a_cascading_rnn/


For me it is easier to do it in math and programming in openCV then NN. Then later on i would make the conversion to nn.

So if you do not like NN i will not talk about it any more, from here on out.

My system is take raw input into a giant ring buffer and look for repeating patterns. Then hit a pattern with a output. In hope of improving the pattern.
This is done in evolutionary way. Staring from a micro movements then up.

My engram snap shots are jpg images of edge detectors, and the bottom half of
the image is meta data of arm, leg position. Sound track and other info. like
kalman filters
Each snap shot is taken every 1/20 of a second. I use differential compression.
The fist image is the reference and any and only changes will be in the next
snap shots.

These engram snap shot are formed into graph theory maps of the world:
https://en.wikipedia.org/wiki/Graph_theory

My theory is very detailed. I only post little details at a time.
So if there is any thing that resonates with you then i will talk more about it.
Because everyone want to do AI in their style. Bill Gate can fund AGI and have
it next year. But he want to have one of his kids to develop it. So he will
fund them whether they have a working theory or not.




Boris Kazachenko said...

It's good that you are working on it Keghn, but I don't think you have a theory yet. Otherwise you would have a coherent write-up, one simply doesn't happen without another.

keghn feem said...

Well, if these brief description of my work does not resonate with you then
we are different ball parks. Your work does not resonate with me.
An thus it is not in my interest to give the full description of my work to
some one who will find it uninteresting.

Todor "Tosh" Arnaudov said...

Filter is used many times in different contexts and it somewhat sounds as something abstract/undefined.

You mention "average" and it's supposed to be a "selection criteria", however it'd be more clear for beginners if it's emphasized that "f." is just a value/scalar? to which input/variables (values) are compared.

Depending on the outcome, the evaluated item is either selected or skipped, i.e. the f. are borderlines/limits.

Boris Kazachenko said...

That's because filtering (feedback) is incrementally complex, starting from bit filtering by input and coordinate LSB and MSB.
Average is the second-order: integer filter. But I start explanation with average because it's easier to understand and bit filtering is pointless with canned images.

It's actually dual hierarchy: of filter resolution ) complexity, and of feedback range ) depth. I cover it in part 4, but the only way to make it fully explicit is code.

BTW, it's not necessarily selected or skipped, rather included into positive or negative pattern. What is happens to these patterns then depends on the level of processing.

Todor "Tosh" Arnaudov said...

BK: "However, average match in our space-time is presumably equal over all four dimensions."

TA: Is it? It does in a mathematical random environment or may be assumed because of the spherical propagation of gravity and other forces? IMO initially it's more likely equally *unknown* to an Alg. Also, in real environments, it doesn't feel equal in *all* dimensions to me.

It's equal if you define match as some average that covers all four dimensions and treat the samples as in one common bucket, thus they are defined as equal, but that requires initial expectation of these dimensions.

Also aver.match is very coarse and not meaningful. What's the average match (to what average) for a starry sky - or the Universe taken as a "photo"?

Within the area of a star or moon, there is continuity of high intensity. Within the dark areas there's continuity of lacking intensity. The global average match to some "average" is perhaps ~ 0,000...01, because the most is dark.

That's one of the early inputs humans have systematically searched for patterns.

Besides the 4D (time) is different. The CogAlg is not supposed to think in relativistic terms, but the relativistic 4D is "space-time", where time is not an independent dimension.
In Newtonian (and probable practical) treatment, the outcomes in "time" are generally irreversible and "unique", while the spatial are "more reversible" and repeatable/"randomly" accessible (from the POV of the operating mind).

Furthermore, "time" is more "introspective" and "abstract" than the others, unless it's built-in - thus not discovered, but just mapped to specific inputs.

It's either built-in as "different" (or assumed as a position in the input queues), or it's discovered as some kind of pattern within the other spaces (dimensions), in the memory, and that pattern (model) is called "time" and is then also not independent from the other dimensions.

While on the other hand the positional dimensions in a simple virtual implementation are directly mapped to some input/or feedback,coordinates/variables.

Boris Kazachenko said...

Yes, space-time is an add-on, not part of my core algorithm. As are dimensions in general, core algorithm is 1D only. I have 4D built-in as an implementation shortcut, discovering it may not be easy.

Whatever specific anisotropy there is among individual dimensions is another layer of add-ons, this one to be learned by the algorithm itself.

As for averages, they are selected feedback of individual higher-level patterns, as fine or coarse as the patterns themselves.

keghn feem said...

Hi
I have completed my AGI theory. Well pretty much.

Todor "Tosh" Arnaudov said...

BK: "Crucial differences here is that conventional clustering methods initialize centroids with arbitrary random weights, while I use matches (and so on) from past comparisons."

T: Is it so crucial, I think it doesn't matter - at least in the simple K-means, the center should converge after scanning all patterns, the "past comparisons" are the differences ("the distance") to other patterns within the set which adjust the center. The number of buckets/sets matter, though + the specific empirical values, they define how the borderline cases are spread within flat/one-stage (non-hierarchical) clustering.