This intro is out of date, please see Readme on GitHub

I am designing this algorithm for strictly bottom-up hierarchical clustering, from pixels to eternity, draft code: agg_recursion. The process is derived from the definition of intelligence as the ability to predict from prior or proximate input. Practically, that means forming and tracing causal links in segmented graphs (connectivity-based clusters or patterns), same as much-ballyhooed reasoning and planning. Any prediction is interactive projection of known patterns, hence the primary process must be pattern discovery. AKA unsupervised learning: an obfuscating negation-first term. This perspective is not novel, pattern recognition is a core of any IQ test, and main focus of ML.

But in statistical ML, generalization / pattern discovery is a side-effect of randomization, not an explicit objective. Such methods are not a rational design, they rely on brute-force fitting. Especially Neural Nets, neuromorphic or artificial:

- Hebbian Learning is a local weight adjustment by output / input coincidence: a binary version of direct similarity between input and normalized sum of weighted inputs. So each neuron becomes a fuzzy centroid-based cluster of its inputs.

- In Deep Learning, the weights are adjusted by backprop of decomposed error: inverse similarity to the target layer. Basic ANN: multi-layer perceptron or KAN performs lossy stochastic chain-rule curve fitting. The logic is basically prototype-based clustering, clusters being the features represented by hidden-layer nodes. But it's non-linear and fuzzy: fully connected in MLP, with multi-layer summation / distribution in backprop. All this fitting is a vestige of supervised learning, that's why they call it "self-supervised", it doesn't belong in purely unsupervised learning.

Modern ANNs combine such vertical training with lateral cross-correlation, within an input vector. CNN filters are designed to converge on edge-detection in initial layers. Edge detection means computing lateral gradient, by weighted pixel cross-comparison within kernels. Graph NNs embed lateral edges, representing similarity or/and difference between nodes, also produced by their cross-comparison. Popular transformers can be seen as a variation of Graph NN. Their first step is self-attention: computing dot product between KV vectors within context window of an input. This is a form of cross-comparison because dot product serves as a measure of similarity, although an unprincipled one.

So basic operation in both trained CNN and self-attention is what I call cross-comparison, but the former selects for variance and the latter for similarity. I think the difference is due to relative rarity of each in respective target data: mostly low gradients in raw images and sparse similarities in compressed text. This rarity or surprise determines information content of the input. But almost all text ultimately describes generalized images and objects therein, so there should be a gradual transition between the two. In my scheme higher-level cross-comparison computes both variance and similarity, for differential correlation clustering.

GNN, transformers, and Hinton's Capsule Networks all use positional embeddings, as I use explicit coordinates. But they are still trained through destructive backprop: randomized summation first, meaningful output-to-template comparison last. This primary summation degrades resolution of the whole learning process, exponentially with the number of layers. Hence, a ridiculous number of backprop cycles is needed to fit hidden layers into generalized representations of the input. Most practitioners agree that this process is not very smart, that noise-worship alone is the definition of stupidity. I think it's just a low-hanging fruit for terminally lazy evolution, and slightly more disciplined human coding. And it's easy to parallelize, which is crucial for glacially slow cell-based biology.

I propose a reversed sequence: cross-comparison of original inputs, followed by summing into match-defined clusters. That's a lateral proximity-constrained connectivity-based clustering, vs. stochastic vertical fitting in NN. This cross-comp and clustering should be recursively hierarchical, forming patterns of patterns, and so on. Initial connectivity is in space-time, but feedback should re-order input along all sufficiently predictive derived dimensions (eigenvectors). This is similar to spectral clustering, but the actual clustering is still connectivity-based, within a new frame of reference. Feedback will only adjust hyperparameters to filter future inputs: no top-down training, just bottom-up learning.

Connectivity likely represents local interactions, which may form both similarity clusters and their high-variance boundaries. Such boundaries reflect stability (resilience to external impact) of the core similarity cluster. The most basic example is image contours, which are initially more informative than the flat areas they demarcate. Cross-similarity is not likely to continue immediately beyond such contours, they represent separability of the core cluster. The next cross-comp should be between new higher-composition and higher-derivation cluster + contour representations. This "complemented" clustering is inherently discontinuous and much more complex, due to the layer of new derivatives added in cross-comparison. So we need to alternate generative and compressive phases in each level:

- cross-comp of incremental range and derivation: a generative stage, followed by two compressive clustering stages:

- exemplar selection and centroid clustering, reducing multiple similar patterns to a single template / type,

- connectivity clustering of compared nodes and correlation clustering of resulting links,

- feedback to scale filters and input coordinates, potentially reframing to consolidate connections and projecting to simulate.

This is vaguely similar to alternating self-attention and feed-forward in transformer layers, but with explicit clustering and pruning, both strictly laminar and vastly different in detail.

Connectivity clustering is among the oldest methods in ML, but I think my scheme should be uniquely scalable in complexity of discoverable patterns:

- Links are valued by both similarity and variance between the nodes.

- Similarity is defined as compression (vs. dot product), direct or inverse. Variance has intrinsically negative value, borrowed from co-projected similarity.

- This projected similarity is a general cognitive objective / fitness function, which can be used to optimize the process automatically.

- Comparison derivatives parameterize clusters for higher-order cross-comp, incremental in range, derivation, and composition (of both nodes and their param sets).

The process is self-contained, there is no preprocessing, and all operations are derived from the above principles. This conceptual integrity provides the confidence to design an indefinitely nested param set per node, to be compared on a higher composition level. Such compressive encoding should be far more meaningful and interpretable than huge flat weight matrices in ANNs. But it’s very complex to consistently design and parallelize, precluding immediate trial and error that dominates ML. Which is probably why I can't find anything that's close enough.

Below I describe it in more detail, then extend comparisons to ANN and BNN. This is an open project: CogAlg, we need help with design and implementation, in Python. I offer awards for contributions, or monthly payment if there is a track record, see the last part here.

This content is published under Creative Commons Attribution 4.0 International License.

Outline of my approach

Initial clustering levels, positional resolution (macro) lags value resolution (micro) by one quantization order:

Inputs	Comparison	Positional resolution	Outputs	Conventionally known as
unary intensity	AND	none, all in the same coordinates	pixels of intensity	digitization
integer pixels	SUB	binary: direction of comparison	blobs of gradient	edge detection, flood fill
float: averaged params of blobs	DIV: comp ave params	integer: distance between blob centers	graphs of blobs	connectivity-based clustering
complex: norm. params of graphs	LOG: params hierarchy	float: distance between graph centers	hierarchical graphs	agglomerative clustering

And so on, higher levels should be added recursively. Such process is very complex and deeply structured, there is no way it could evolve naturally. Since the code is supposed to be recursive, testing before it is complete is useless. Which is probably why no one seems to work on such methods. But once the design is done, there is no need for interminable glacial and opaque training, my feedback only adjusts hyperparameters.

So, a pattern is a cluster of matching input elements, where match is compression achieved by encoding the input as derivatives, see “Comparison” section below. Some define pattern as a recurring item or group, in my terms these are pattern elements. If the items co-vary: don't match but their derivatives do, then they form higher-derivation pattern, where the elements are derivatives.

But lower-derivation and shorter-range cross-comp must be done first, starting with consecutive atomic inputs. That means sensory input at the limit of resolution: adjacent pixels of video or equivalents in other modalities. All primary modalities form dense array of such inputs in Cartesian dimensions, symbolic data is subsequent encoding. To discover meaningful patterns, the symbols must be decoded, which is exponentially more difficult with the level of encoding. Thus a start with raw sensory input is by far the easiest to implement (part 0).

This low-level process, directly translated into my code, seems like quite a jump from the generalities above. But it really isn’t, internally consistent pattern discovery must be strictly bottom-up, in complexity of both inputs and operations. And there is no ambiguity at the bottom: initial predictive value that defines patterns is a match from cross-comparison among their elements, starting with pixels. So, I think my process is uniquely consistent with high-level definitions, please let me know if you see any discrepancy in either.

Comparison, more in part 1:

Basic comparison is inverse arithmetic operation between single-variable comparands, of incremental power: Boolean, subtraction, division, etc. Each order of comparison forms miss or variance: XOR, difference, ratio, etc., and match or similarity, which can be defined directly or as inverse deviation of miss. Direct match is compression of represented magnitude by replacing larger input with corresponding miss between the inputs: Boolean AND, the smaller input in comp by subtraction, integer part of ratio in comp by division, etc.

These direct similarity measures work if input intensity corresponds to some measure of stability of an object: mass, energy, hardness. This is the case in tactile but not in visual input: brightness doesn’t correlate with inertia or invariance, dark objects are just as stable as bright ones. Thus, initial match in vision should be defined indirectly, as inverse deviation of variation in intensity. 1D variation is difference, ratio, etc., while multi-D comparison has to combine them into Euclidean distance and gradient, as in common edge detectors.

Patterns, more in part 2:

Cross-comparison among patterns forms match and miss per parameter, as well as dimensions and distances: external match and miss (these are separate parameters: value = precision of what * precision of where). Comparison is limited by max. distance between patterns. Overall hierarchy has incremental dimensionality: search levels ( param levels ( pattern levels)).., and pattern comparison is selectively incremental per such level. This is hard to explain in NL, please see the code, starting with line_Ps and line_PPs.

Resulting matches and misses are summed into lateral match and miss per pattern. Proximate input patterns with above-average match to their nearest neighbors are clustered into higher-level patterns. This adds two pattern levels: of composition and derivation, per level of search. Conditional cross-comp over incremental range and derivation, among the same inputs, may also add sub-levels in selected newly formed patterns. On a pixel level, incremental range is using larger kernels, and incremental derivation starts with using Laplacian.

Feedback, more in part 3 (needs editing):

Average match is the first order of value filter, computed on higher levels. There are also positional filters, starting with pixel size and kernel size, which determine external dimensions of the input. Quantization (bit, integer, float..) of internal and external filters corresponds to the order of comparison, The filters are similar to hyperparameters in Neural Nets, with values updated by feedback. But I have no equivalent of weight matrix: my learning is connectivity clustering, vs. vertical clustering via backprop or Hebbian learning.

All filter types represent co-averages to a higher-level average value, locally projected by higher-level patterns. Clustering on a filtered level is by the sign of deviation from those filters (cross-input-element-match - filter), so using averages balances positive and negative patterns: spans of above- and below- average cross-match in future inputs. Resulting positive patterns contain input elements that are both novel: exceeding expectations of higher levels, and similar to each other: making them predictive of future input.

Hierarchy, part 4 but out of date:

There is a single global hierarchy: feedforward inputs and feedback filters pass through the same levels of search and composition. Each higher level is a nested hierarchy, with depth proportional to elevation, but sub-hierarchies are unfolded sequentially. That’s why I don’t have many diagrams: they are good at showing relations in 2D, but I have a simple 1D sequence of levels. Nested sub-hierarchies are generated by the process itself, depending on elevation in a higher-order hierarchy. That means I can’t show them in a generic diagram.

Brain-inspired schemes have separate sensory and motor hierarchies, in mine they combined into one. The equivalent of motor patterns in my scheme are positional filter patterns, which ultimately move the sensor. The first level is co-located sensors: targets of input filters, and more coarse actuators: targets of positional filters. I can think of two reasons they are separated in the brain: neurons and axons are unidirectional, and training process has to take the whole hierarchy off-line. Neither constraint applies to my scheme.

Final algorithm will consist of first-level operations + recursive increment in operations per level. The latter is a meta-algorithm that extends working level-algorithm, to handle derivatives added to current inputs. So, the levels are: 1st level: G(x), 2nd level: F(G)(x), 3rd level: F(F(G))(x).., where F() is the recursive code increment.

Resulting hierarchy is a pipeline: patterns are outputted to the next level, forming a new level if there is none. Given novel inputs, higher levels will discover longer-range spatio-temporal and then conceptual patterns.

Some notes:

- There should be a unique set of operations added per level, hence a singular in “cognitive algorithm”.

- Core design must be done theoretically: generality requires large upfront investment in process complexity, which makes it a huge overkill for any specific task. That’s one reason why such schemes are not explored.

- Many readers note disconnect between abstractions in this outline, and the amount of detail in current code. That’s because we are in space-time continuum: search must follow proximity in each dimension, which requires specific processing. It’s not specific to vision, the process is mostly the same for all raw modalities.

- Another complaint is that I don't use mathematical notation, but it simply doesn't have the flexibility to express deeply conditional process, with recursively increasing complexity.

- Most people who aspire to work on AGI think in terms behavior and robotics. I think this is far too coarse to make progress, the most significant mechanisms are on the level of perception. Feedforward (perception) must drive feedback (action), not the other way around.

- Other distractions are supervision and reinforcement. These are optional task-specific add-ons, core cognitive process is unsupervised pattern discovery, and main problem here is scaling in complexity.

- Don’t even start me on chatbots.

Comparison to artificial and biological neural networks

All unsupervised learning is some form of pattern discovery, by input comparison and clustering. I do both laterally: among inputs within a level, while in statistical learning they are vertical: between layers of weighted summation. Weight adjustment from error in final comparison is a soft clustering: modulated inclusion or exclusion of subsequent inputs into next output. So, vertical weighted summation is primary to comparison, which makes the comparands distant. This is a conceptual flaw: comparison must follow proximity.

Neural Nets is a version of statistical learning, I think it is best understood as centroid clustering (centroid doesn’t have to be a single value, fitted line in linear regression can be considered a one-dimensional centroid). Basic ANN is a multi-layer perceptron: each node weighs the inputs at synapses, then sums and thresholds them into output. This normalized sum of inputs is their centroid. Output of the top layer is compared to some template, forming an error. Stochastic Gradient Descent then backpropagates the error, training initially random weights into transformations (reversed vertical derivatives) that reduce future error.

That usually means training CNN to perform some sort of edge-detection or cross-correlation (same as my comparison but the former terms lose meaning on higher levels of search). But CNN operations are initially random, while my process is designed for cross-comp from the start. This is why it can be refined by my feedback, updating the filters, which far more subtle and selective than training by backprop.

So, I have several problems with basic process in ANN:

- Vertical learning (via feedback of error) takes tens of thousands of cycles to form accurate representations. That's because summation per layer degrades positional input resolution. With each added layer, the output that ultimately drives learning contains exponentially smaller fraction of original information. Cross-comp and clustering is far more complex per level, but the output contains all information of the input. Lossy selection is only done on the next level, after evaluation per pattern (vs. before evaluation in statistical methods).

- Both initial weights and sampling that feeds SGD are randomized. Also driven by random variation are RBMs, GANs, VAEs, etc. But randomization is antithetical to intelligence, it's only useful in statistical methods because they merge inputs with weights irreversibly. Thus, any non-random initialization and variation will introduce bias. All input modification in my scheme is via hyper-parameters, stored separately and then used to normalize (remove bias) inputs for comparison to inputs formed with different-value hyper-parameters.

- SGD minimizes error (top-layer miss), which is quantitatively different from maximizing match: compression. And that error is w.r.t. some specific template, while my match is summed over all past input / experience. The “error” here is plural: lateral misses (differences, ratios, etc.), computed by cross-comparison within a level. All inputs represent environment and have positive value. But then they are packed (compressed) into patterns, which have different range and precision, thus different relative value per relatively fixed record cost.

- Representation in ANN is fully distributed, similar to the brain. But the brain has no alternative: there is no substrate for local memory or program in neurons. Computers have RAM, so parallelization is a simple speed vs. efficiency trade-off, useful only for complex semantically isolated nodes. Such nodes are patterns, encapsulating a set of co-derived “what” and “where” parameters. This is similar to neural ensemble, but parameters that are compared together should be localized in memory, not distributed across a network.

More basic neural learning mechanism is Hebbian, though it is rarely used in ML. Conventional spiking version is that weight is increased if the synapse often receives a spike just before the node fires, else the weight is decreased. But input and output don't have to be binary, the same logic can be applied to scalar values: the weight is increased / decreased in proportion to some measure of similarity between its input and following output of the node. That output is normalized sum of all inputs, or their centroid.

Such learning is local, within each node. But it’s still a product of vertical comparison: centroid is higher order of composition than individual inputs. This comparison across composition drives all statistical learning, but it destroys positional information at each layer. Compared to autoencoders: main backprop-driven unsupervised learning technique, Hebbian learning lacks the decoding stage (as does the proposed algorithm). Decoding decomposes hidden layers, to equalize composition orders of output and compared template.

Inspiration by the brain kept ANN research going for decades before they became useful. Their “neurons” are mere stick figures, but that’s not a problem, most of neuron’s complexity is due to constraints of biology. The problem is that core mechanism in ANN, weighted summation, may also be a no-longer needed compensation for such constraints: neural memory requires dedicated connections. That makes representation and cross-comparison of individual inputs very expensive, so they are summed. But again, we now have dirt-cheap RAM.

Other biological constraints are very slow neurons, and the imperative of fast reaction for survival in the wild. Both favor fast though crude summation, at the cost of glacial training. Reaction speed became less important: modern society is quite secure, while continuous learning is far more important because of accelerating progress. Summation also reduces noise, which is very important for neurons that often fire at random, to initiate and maintain latent connections. But that’s irrelevant for electronic circuits.

Evolution is extremely limited in complexity that can be added before it is pruned by natural selection, I see no way it could produce proposed algorithm. And that selection is for reproduction, while intelligence is distantly instrumental. The brain evolved to guide the body, with neurons originating as instinctive stimulus-to-response converters. Hence, both SGD and Hebbian learning is fitting, driven by feedback of action-triggering weighted input sum. Pattern discovery is their instrumental upshot, not an original purpose.

Uri Hasson, Samuel Nastase, Ariel Goldstein reach a similar conclusion in “Direct fit to nature: an evolutionary perspective on biological and artificial neural networks”: “We argue that neural computation is grounded in brute-force direct fitting, which relies on over-parameterized optimization algorithms to increase predictive power (generalization) without explicitly modeling the underlying generative structure of the world. Although ANNs are indeed highly simplified models of BNNs, they belong to the same family of over-parameterized, direct-fit models, producing solutions that are mistakenly interpreted in terms of elegant design principles but in fact reflect the interdigitation of ‘‘mindless’’ optimization processes and the structure of the world.”

Comparison to Capsule Networks

The nearest experimentally successful method is recently introduced “capsules”. Some similarities to CogAlg:

- capsules also output multivariate vectors, “encapsulating” several parameters, similar to my patterns,

- these parameters also include pose: coordinates and dimensions, compared to compute transformations,

- these transformations are compared to find affine transformations or equivariance: my match of misses,

- capsules also send direct feedback to lower layer: dynamic routing, vs. trans-hidden-layer backprop in ANN.

My main problems with CapsNet and alternative treatment:

- Object is defined as a recurring configuration of different parts. But such recurrence can’t be assumed, it should be derived by cross-comparing relative position among parts of matching objects. This can only be done after their positions are cross-compared, which is after their objects are cross-compared: two levels above the level that forms initial objects. So, objects formed by positional equivariance would be secondary, though they may displace initial segmentation objects as a primary representation. Stacked Capsule Autoencoders also have exclusive segmentation on the first layer, but proximity doesn’t matter on their higher layers.

- Routing by agreement is basically recursive centroid clustering, by match of input vector to the output vector. The output (centroid) represents inputs at all locations, so its comparison to inputs is effectively mixed-distance. Thus, clustering in CapsNet is fuzzy and discontinuous, forming redundant representations. Routing by agreement reduces that redundancy, but not consistently so, it doesn’t specifically account for it. My default clustering is exclusive segmentation: each element (child) initially belongs to one cluster (parent). Fuzzy clustering is selective to inputs valued above the cost of adjusting for overlap in representation, which increases with the range of cross-comparison. This conditional range increase is done on all levels of composition.

- Instantiation parameters are application-specific, CapsNet has no general mechanism to derive them. My general mechanism is cross-comparison of input capsule parameters, which forms higher-order parameters. First level forms pixel-level gradient, similar to edge detection in CNN. But then it forms proximity-constrained clusters, defined by gradient and parameterized by summed pixel intensity, dy, dx, gradient, angle. This cross-comparison followed by clustering is done on all levels, with incremental number of parameters per input.

- Number of layers is fixed, while I think it should be incremental with experience. My hierarchy is a dynamic pipeline: patterns are displaced from a level by criterion sign change and sent to existing or new higher level. So, both hierarchy of patterns per system and sub-hierarchy of derivatives per pattern expand with experience. The derivatives are summed within a pattern, then evaluated for extending intra-pattern search and feedback.

- Output vector of higher capsules combines parameters of all lower layers into Euclidean distance. That is my default too, but they should also be kept separate, for potential cross-comp among layer-wide representations.

Overall, CapsNet is a variation of ANN, with input summation first and dynamic routing second. So, it’s a type of Hebbian learning, with most of the problems that I listed in the previous section.

Elaboration, parts 4 and below are out of date:

0. Cognition vs. evolution, analog vs. symbolic initial input

Some say intelligence can be recognized but not defined. I think that’s absurd: we recognize some implicit definition. Others define intelligence as a problem-solving ability, but the only general problem is efficient search for solutions. Efficiency is a function of selection among inputs, vs. brute-force all-to-all search. This selection is by predicted value of the inputs, and prediction is interactive projection of their patterns. Some agree that intelligence is all about pattern discovery, but define pattern as a crude statistical coincidence.

Of course, the only mechanism known to produce human-level intelligence is even cruder, and that shows in haphazard construction of our brains. Algorithmically simple, biological evolution alters heritable traits at random and selects those with above-average reproductive fitness. But this process requires almost inconceivable computing power because selection is extremely coarse: on the level of whole genome rather than individual traits, and also because intelligence is only one of many factors in reproductive fitness.

Random variation in evolutionary algorithms, generative RBMs, and so on, is antithetical to intelligence. Intelligent variation must be driven by feedback within cognitive hierarchy: higher levels are presumably “smarter” than lower ones. That is, higher-level inputs represent operations that formed them, and are evaluated to alter future lower-level operations. Basic operations are comparison and summation among inputs, defined by their range and resolution, analogous to reproduction in genetic algorithms.

Range of comparison per conserved-resolution input should increase if projected match (cognitive fitness function) exceeds average match per comparison. In any non-random environment, average match declines with the distance between comparands. Thus, search over increasing distance requires selection of above- average comparands. Any delay, coarseness, and inaccuracy of such selection is multiplied at each search expansion, soon resulting in combinatorial explosion of unproductive (low additive match) comparisons.

Hence, my model is strictly incremental: search starts with minimal-complexity inputs and expands with minimal increments in their range and complexity (syntax). At each level, there is only one best increment, projected to discover the greatest additive match. No other AGI approach follows this principle.

I guess people who aim for human-level intelligence are impatient with small increments and simple sensory data. Yet, this is the most theoretical problem ever, demanding the longest delay in gratification.

symbolic obsession and its discontents

Current Machine Learning and related theories (AIT, Bayesian inference, etc.) are largely statistical also because they were developed primarily for symbolic data. Such data, pre-compressed and pre-selected by humans, is far more valuable than sensory inputs it was ultimately derived from. But due to this selection and compression, proximate symbols are not likely to match, and partial match between them is very hard to quantify. Hence, symbolic data is a misleading initial target for developing conceptually consistent algorithm.

Use of symbolic data as initial inputs in AGI projects betrays profound misunderstanding of cognition. Even children, predisposed to learn language, only become fluent after years of directly observing things their parents talk about. Words are mere labels for concepts, the most important of which are spatio-temporal patterns, generalized from multi-modal sensory experience. Top-down reconstruction of such patterns solely from correlations among their labels should be exponentially more difficult than their bottom-up construction.

All our knowledge is ultimately derived from senses, but lower levels of human perception are unconscious. Only generalized concepts make it into our consciousness, AKA declarative memory, where we assign them symbols (words) to facilitate communication. This brain-specific constraint creates heavy symbolic vs. sub-symbolic bias, especially strong in artificial intelligentsia. Which is putting a cart in front of a horse: most words are meaningless unless coupled with implicit representations of sensory patterns.

To be incrementally selective, cognitive algorithm must exploit proximity first, which is only productive for continuous and loss-tolerant raw sensory data. Symbolic data is already compressed: consecutive characters and words in text won’t match. It’s also encoded with distant cross-references, that are hardly ever explicit outside of a brain. Text looks quite random unless you know the code: operations that generalized pixels into patterns (objects, processes, concepts). That means any algorithm designed specifically for text will not be consistently incremental in the range of search, which will impair its scalability.

In Machine Learning, input is string, frame, or video sequence of a defined length, with artificial separation between training and inference. In my approach, learning is continuous and interactive. Initial inputs are streamed pixels of maximal resolution, and higher-level inputs are multi-variate patterns formed by comparing lower-level inputs. Spatio-temporal range of inputs, and selective search across them, is extended indefinitely. This expansion is directed by higher-level feedback, just as it is in human learning.

Everything ever written is related to my subject, but nothing is close enough: not other method is meant to be fully consistent. Hence a dire scarcity of references here. My approach is self-contained, it doesn’t require references. But it does require clean context, hopefully cleaned-up by reader‘s introspective generalization.

1. Atomic comparison: quantifying match and miss between two variables

First, we need to quantify predictive value. Algorithmic information theory defines it as compressibility of representation, which is perfectly fine. But compression is currently computed only for sequences of inputs, while I think a logical start is analog input digitization: a rock bottom of organic compression hierarchy. The next level is cross-comparison among resulting pixels, commonly known as edge detection, and higher levels will cross-compare resulting patterns. Partial match computed by comparison is a measure of compression.

Partial match between two variables is a complementary of miss, in corresponding power of comparison:

- Boolean match is AND and miss is XOR (two zero inputs form zero match and zero miss),

- comparison by subtraction increases match to a smaller comparand and reduces miss to a difference,

- comparison by division increases match to min * integer part of ratio and reduces miss to a fractional part

(direct match works for tactile input. but reflected-light in vision requires inverse definition of initial match)

In other words, match is a compression of larger comparand’s magnitude by replacing it with miss. Which means that match = smaller input: a common subset of both inputs, = sum of AND between their uncompressed (unary code) representations. Ultimate criterion is recorded magnitude, rather than bits of memory it occupies, because the former represents physical impact that we want to predict. The volume of memory used to record that magnitude depends on prior compression, which is not an objective parameter.

Some may object that match includes the case when both inputs equal zero, but then match should also be zero. The purpose here is prediction, which represents conservation of some physical property of observed objects. Ultimately, we’re predicting potential impact on observer, represented by input. Zero input means zero impact, which has no conservable property (inertia), thus no intrinsic predictive value.

Given incremental complexity, initial inputs should have binary resolution and implicit coordinate (which is a macro-parameter, so its resolution lags that of an input). Compression of bit inputs by AND is well known as digitization: substitution of two lower 1 bits with one higher 1 bit. Resolution of coordinate (input summation span) is adjusted by feedback to form integers that are large enough to produce above-average match.

Next-order compression is comparison between consecutive integers, with binary (before | after) coordinate.

Additive match is achieved by comparison of a higher power than that which produced comparands: AND will not further compress integers digitized by AND. Rather, initial comparison between integers is by subtraction, resulting difference is miss, and smaller input is absolute match. Compression of represented magnitude is by replacing i1, i2 with their derivatives: match (min) and miss (difference). If we sum each pair:

inputs: 5 + 7 -> 12, derivatives: match = 5 + miss = 2 -> 7. Compression by replacing = match: 12 - 7 -> 5. Difference is smaller than XOR (non-zero complementary of AND) because XOR may include opposite-sign (opposite-direction) bit pairs 0, 1 and 1, 0, which are cancelled-out by subtraction.

Comparison by division forms ratio, which is a compressed difference. This compression is explicit in long division: match is accumulated over iterative subtraction of smaller comparand from remaining difference. In other words, this is also a comparison by subtraction, but between different orders of derivation. Resulting match is smaller comparand * integer part of ratio, and miss is final reminder or fractional part of ratio. The ratio can be further compressed by converting it to radix or logarithm, and so on.

By reducing miss, higher-power comparison increases complementary match (match = larger input - miss):

to be compressed: larger input | XOR | difference: combined current-order match & miss

additive match: AND | opposite-sign XOR | multiple: of a smaller input within a difference

remaining miss: XOR | difference | fraction: complementary to multiple within a ratio

But the costs of operations and incidental sign, fraction, irrational fraction, etc. may grow even faster. To justify the costs, the power of comparison should only increase in patterns of above-average match from prior order of comparison: AND for bit inputs, SUB for integer inputs, DIV for pattern inputs, etc. Inclusion into such patterns is by relative match: match - ave: past match that co-occurs with average higher-level match.

Match value should be weighted by the correlation between input intensity and its stability: mass / energy / hardness of an observed object. Initial input, such as reflected light, is likely to be incidental: such correlation is very low. Since match is the magnitude of smaller input, its weight should also be low if not zero. In this case projected match consists mainly of its inverse component: match cancellation by co-derived miss, see below.

The above discussion is on match from current comparison, but we really want to know projected match to future or distant inputs. That means the value of match needs to be projected by co-derived miss. In comparison by subtraction, projected match = min (i1, i2) * weight (fractional) - difference (i1, i2) / 2 (divide by 2 because the difference only reduces projected input, thus min( input, projected input), in the direction in which it is negative. It doesn’t affect min in the direction where projected input is increasing).

quantifying lossy compression

There is a general agreement that compression is a measure of similarity, but no one seems to apply it from the bottom up, the bottom being single scalars. Also, any significant compression must be lossy. This is currently evaluated by perceived similarity of reconstructed input to the original input, as well as compression rate. Which is very coarse and subjective. Compression in my level of search is lossless, represented by match on all levels of pattern. All derived representations are redundant, so it’s really an expansion vs. compression overall.

The lossy part comes after evaluation of resulting patterns on the next level of search. Top level of patterns is cross-compared by default, evaluation is per lower level: of incremental derivation and detail in each pattern. Loss is when low-relative-match buffered inputs or alternative derivatives are not cross-compared. Such loss is quantified as the scope * resolution of representation in these lower levels, not some subjective quality.

2. Forward search and patterns, implementation for image recognition in video

Pattern is a contiguous span of inputs that form above-average matches, similar to conventional cluster.

As explained above, matches and misses (derivatives) are produced by comparing consecutive inputs. These derivatives are summed within a pattern and then compared between patterns on the next level of search, adding new derivatives to a higher pattern. Patterns are defined contiguously on each level, but positive and negative patterns are always interlaced, thus next-level same-sign comparison is discontinuous.

Negative patterns represent contrast or discontinuity between positive patterns, which is a one- or higher- dimensional equivalent of difference between zero-dimensional pixels. As with differences, projection of a negative pattern competes with projection of adjacent positive pattern. But match and difference are derived from the same input pair, while positive and negative patterns represent separate spans of inputs.

Negative match patterns are not predictive on its own but are valuable for allocation: computational resources of no-longer predictive pattern should be used elsewhere. Hence, the value of negative pattern is borrowed from predictive value of co-projected positive pattern, as long as combined additive match remains above average. Consecutive positive and negative patterns project over same future input span, and these projections partly cancel each other. So, they should be combined to form feedback, as explained in part 3.

Initial match is evaluated for inclusion into higher positive or negative pattern. The value is summed until its sign changes, and if positive, evaluated again for cross-comparison among constituent inputs over increased distance. Second evaluation is necessary because the cost of incremental syntax generated by cross-comparing is per pattern rather than per input. Pattern is terminated and outputted to the next level when value sign changes. On the next level, it is compared to previous patterns of the same compositional order.

Initial inputs are pixels of video, or equivalent limit of positional resolution in other modalities. Hierarchical search on higher levels should discover patterns representing empirical objects and processes, and then relational logical and mathematical shortcuts, eventually exceeding generality of our semantic concepts.

In cognitive terms, everything we know is a pattern, the rest of input is noise, filtered out by perception. For online learning, all levels should receive inputs from lower levels and feedback from higher levels in parallel.

space-time dimensionality and initial implementation

Any prediction has two components: what and where. We must have both: value of prediction = precision of what * precision of where. That “where” is currently neglected: statistical ML represents space-time at greatly reduced resolution, if at all. In the brain and some neuromorphic models, “where” is represented in a separate network. That makes transfer of positional information very expensive and coarse, reducing predictive value of representations. There is no such separation in my patterns, they represent both what and where as local vars.

My core algorithm is 1D: time only (part 4). Our space-time is 4D, but each of these dimensions can be mapped on one level of search. This way, levels can select input patterns that are strong enough to justify the cost of representing additional dimension, as well as derivatives (matches and differences) in that dimension.

Initial 4D cycle of search would compare contiguous inputs, similarly to connected-component analysis:

level 1 compares consecutive 0D pixels within horizontal scan line, forming 1D patterns: line segments.

level 2 compares contiguous 1D patterns between consecutive lines in a frame, forming 2D patterns: blobs.

level 3 compares contiguous 2D patterns between incremental-depth frames, forming 3D patterns: objects.

level 4 compares contiguous 3D patterns in temporal sequence, forming 4D patterns: processes.

(in simple video, time is added on level 3 and depth is computed from derivatives)

Subsequent cycles would compare 4D input patterns over increasing distance in each dimension, forming longer-range discontinuous patterns. These cycles can be coded as implementation shortcut, or form by feedback of core algorithm itself, which should be able to discover maximal dimensionality of inputs. “Dimension” here is parameter that defines external sequence and distance among inputs. This is different from conventional clustering, were both external and internal parameters are dimensions. More in part 6.

However, average match at a given distance in our space-time is presumably equal over all four dimensions. That means patterns defined in fewer dimensions will be fundamentally limited and biased by the angle of scanning. Hence, initial pixel comparison and clustering into patterns should also be over 4D at once, or at least over 2D for images and 3D for video. This is our-universe-specific extension of my core algorithm.

There is also a vision-specific adaptation in the way I define initial match. Predictive visual property is albedo, which means locally stable ratio of brightness / intensity. Since lighting is usually uniform over much larger area than pixel, the difference in brightness between adjacent pixels should also be stable. Relative brightness indicates some underlying property, so it should be cross-compared to form patterns. But it’s reflected: doesn’t really represent physical quantity / density of an object. Thus, initial match is inverse deviation of gradient.

We are currently coding 1^st level algorithm: https://github.com/boris-kz/CogAlg/wiki. 1D code is complete, but not macro-recursive. We are extending it to 2D for image recognition, then to 3D video for object and process recognition. Higher levels for each D-cycle algorithm will process discontinuous search among full-D patterns. Complete hierarchical (meta-level) algorithm will consist of:

- 1st level algorithm: contiguous cross-comparison over full-D cycle, plus bit-filter feedback

- recurrent increment in complexity, extending current-level alg to next-level alg. It will unfold increasingly complex higher-level input patterns for cross-comparison, then combine results for evaluation and feedback.

We will then add colors, maybe audio and text. Initial testing could be recognition of labeled images, but 2D is a poor representation of our 4D world, video or stereo video is far better. Variation across space is a product of past interactions, thus predictive of variation over time (which is normally lower: we can’t speed-up time).

3. Feedback filters, attentional input selection, imagination, motor action

After evaluation for inclusion into higher-level pattern, the input is also accumulated into feedback to lower levels. Feedback is update to filters that evaluate forward (Λ) and feedback (V), as described above but on lower level. Feedback value = absolute value of summed input parameter - filter-filter (opportunity cost of filter update). Default feedback is combined level-sequentially, while more expensive shortcut feedback may be sent to selected levels to filter inputs that are already in the pipeline, or to rearrange levels in the hierarchy.

There is internal filter for each compared variable of an input, and external filter per coordinate in which the inputs are ordered. Basic internal filter is average projected match that co-occurs with (predicts) average higher-level match, and basic external filter is a distance to the next input of projected average value. Thus, coordinate filter is a span of inputs skipped because they are projected to be either too predictable or too noisy to bother with. External filters have lower resolution / higher scope at the same order of quantization.

Both input and coordinate filters discussed above are integers, but they can be of any order of quantization. Binary filters are the least and the most significant bits of input value and coordinate (input summation span). For coordinate filter, LSB is pixel size and MSB is frame size. These filters are adjusted to balance overflow and underflow. Then there are higher-than-integer filters: ratios or coefficients, AKA weights, and so on. They adjust magnitude per input variable type, in proportion to relative higher-level match of these variables.

Lower filters are min values for input inclusion in higher-composition inputs, and upper filters are max values that trigger higher-input termination: bit -> pixel -> pattern -> pattern_of_patterns (code starts from pixels).

The number of updateable filters will increase with elevation:

1^st level may update only:

- value bit filters: lower: LSB, upper: word size -> MSB, and

- coord lower bit filter: LSB, which is a pixel size

2^nd level may also update:

- coord upper bit filters, such as frame dimensions -> coordinate MSB, causing premature P termination

- value integer filters: lower: average match, upper: max Match -> premature average match feedback

3^rd level may add:

coord lower integer filter: starting coordinate (next C or skip-to distance), and

coord upper integer filter: max next C -> premature next-C feedback

Etc.

novelty vs. generality

Any system must have a common fitness function or selection criterion. Two obvious criteria in cognition are novelty and generality: miss and match. But we can’t select for both, they exhaust all possibilities. Novelty can’t be primary criterion: it would select for noise and filter out all patterns, which are defined by match. On the other hand, to maximize match of inputs to memory we can stare at a wall: lock into predictable environments. But of course, natural curiosity actively skips predictable locations, thus reducing the match.

This dilemma is resolved if we maximize predictive power: projected vs. confirmed match of inputs to records (all records are predictions, else they are forgotten). To the extent that new match is predictable, it doesn’t add to total projected match of the model. But neither does noise: novelty (difference from records) of inputs that won’t match in the future. So, match is positive in feedforward but negative in feedback: the sign is reversed with direction. Projected match is the same as compression, which includes skipping low-value input spans.

We can see this in individual derivatives:

- higher-level match is specific to past inputs, thus it’s a filter for future inputs, projected from the past.

- higher-higher level match of a match is more detached from specific inputs, thus less accurate as a filter.

On the opposite, it projects higher match among future inputs, independently from their match to past inputs.

And so on, higher derivation orders of match are increasingly positive (less filtering) for future inputs.

So, selection for novelty is done by subtracting higher-level projection from corresponding input parameter. Higher-order positional selection is skipping (or avoiding processing) predictable future input spans. Skipped input span is formally a *coordinate* filter feedback: next coordinate of inputs with expected above-average *additive* predictive value. Thus, next input location is selected first by proximity and then by novelty, both relative to a template comparand. This is covered in more detail in part 4, level 3.

Vertical evaluation computes deviations, to form positive or negative higher-level patterns. Evaluation is relative to higher-level averages, which represent past inputs, thus should be projected over feedback delay: average += average difference * (delay / average span) /2. Average per input variable may also be a feedback, representing redundancy to higher level, which also depends on higher-level match rate: rM = match / input.

If rM > average per cost of processing: additive match = input match - input-to-average match * rM.

Lateral comparison computes differences, to project corresponding parameters of all derivation orders:

difference in magnitude of initial inputs: projected next input = last input + difference/2,

difference in input match, a subset of magnitude: projected next match = last match + match difference/2,

difference in match of match, a sub-subset of magnitude, projected correspondingly, and so on.

Ultimate criterion is top order of match on a top level of search: the most predictive parameter in a system.

imagination, planning, action

Imagination is never truly original, it can only be formalized as interactive projection of known patterns. As explained above, patterns send feedback to filter lower-level sources. This feedback is to future sources, where the patterns are projected to continue or re-occur. Stronger upstream patterns and correspondingly higher filters reduce resolution of or totally skip predictable input spans. But when multiple originally distant patterns are projected into the same location, their feedback cancels out in proportion to their relative difference.

In other words, combined filter is cancelled-out to the extent that co-projected patterns are mutually exclusive:

filter = max_pattern_feedback - alt_pattern_feedback * match_rate. By default, match_rate used here is average (match / max_comparand). But it has average error: average abs(match_rate - average_match_rate). To improve filter accuracy, we can derive actual match rate by cross-comparing co-projected patterns. I think imagination is just that: search across co-projected patterns, before accessing their external target sources.

Patterns are projected in space and time, depending on their past S-T span and a vector of input derivatives over that span. So, pattern input parameters in some future location can be projected as:

(recorded input parameters) + (corresponding derivatives * relative distance) / 2.

Where relative distance = (projected coords - current coords) / span of the pattern in the same direction.

Any search is defined by location: contiguous coordinate span. Span of feedback target is that of feedback source’ input pattern: narrower than the span of feedback source unit’ output pattern. So, search across co-projected patterns is performed on a conceptually lower level, but patterns themselves belong to higher level. Meaning that search will be within intersection of co-projected patterns, vs. whole patterns. Intersection is a location within each of the patterns, and cross-comparison will be among pattern elements in that location.

Combined filter is then prevaluated: projected value of positive patterns is compared to the cost of evaluating all inputs, both within a target location. If prevalue is negative: projected inputs are not worth evaluating, their location is skipped and “imagination” moves to the next nearest one. Filter search continues until prevalue turns positive (with above-average novelty) and the sensor is moved that location. This sensor movement, along with adjustment of its threshold, is the most basic form of motor feedback, AKA action.

Cognitive component of action is planning: a form of imagination where projected patterns include those that represent the system itself. Feedback of such self-patterns eventually reaches the bottom of representational hierarchy: sensors and actuators, adjusting their sensitivity | intensity and coordinates. This adjustment is action. Such environmental interface is a part of any cognitive system, although actuators are optional.

4. Initial levels of search and corresponding orders of feedback (fine to skip)

This part recapitulates and expands on my core algorithm, which operates in one dimension: time only. Spatial and derived dimensions are covered in part 6. Even within 1D, the search is hierarchical in scope, containing any number of levels. New level is added when current top level terminates and outputs the pattern it formed.

Higher-level patterns are fed back to select future inputs on lower levels. Feedback is sent to all lower levels because span of each pattern approximates combined span of inputs within whole hierarchy below it.

So, deeper hierarchy forms higher orders of feedback, with increasing elevation and scope relative to its target: same-level prior input, higher-level match average, beyond-the-next-level match value average, etc.

These orders of feedback represent corresponding order of input compression: input, match between inputs, match between matches, etc. Such compression is produced by comparing inputs to feedback of all orders.

Comparisons form patterns, of the order that corresponds to relative span of compared feedback:

1: prior inputs are compared to the following ones on the same level, forming difference patterns dPs,

2: higher-level match is used to evaluate match between inputs, forming deviation patterns vPs,

3: higher-hierarchy value revaluates positive values of match, forming more selective shortcut patterns sPs

Feedback of 2^nd order consists of input filters (if) defining value patterns, and coordinate filters (Cf) defining positional resolution and relative distance to future inputs.

Feedback of 3^rd order is shortcut filters for beyond-the-next level. These filters, sent to a location defined by attached coordinate filters, form higher-order value patterns for deeper internal and distant-level comparison.

Higher-order patterns are more selective: difference is as likely to be positive as negative, while value is far more likely to be negative, because positive patterns add costs of re-evaluation for extended cross-comparison among their inputs. And so on, with selection and re-evaluation for each higher order of positive patterns. Negative patterns are still compared as a whole: their weak match is compensated by greater span.

All orders of patterns formed on the same level are redundant representations of the same inputs. Patterns contain representation of match between their inputs, which are compared by higher-order operations. Such operations increase overall match by combining results of lower-order comparisons across pattern’s variables:

0Le: AND of bit inputs to form digitized integers, containing multiple powers of two

1Le: SUB of integers to form patterns, over additional external dimensions = pattern length L

2Le: DIV of multiples (L) to form ratio patterns, over additional distances = negative pattern length LL

3Le: LOG of powers (LLs), etc. Starting from second level, comparison is selective per element of an input.

Such power increase also applies in comparison to higher-order feedback, with a lag of one level per order.

Power of coordinate filters also lags the power of input filters by one level:

1Le fb: binary sensor resolution: minimal and maximal detectable input value and coordinate increments

2Le fb: integer-valued average match and relative initial coordinate (skipping intermediate coordinates)

3Le fb: rational-valued coefficient per variable and multiple skipped coordinate range

4Le fb: real-valued coefficients and multiple coordinate-range skip

I am defining initial levels to find recurring increments in operations per level, which could then be applied to generate higher levels recursively, by incrementing syntax of output patterns and of feedback filters per level.

Operations per generic level (out of date)

Level 0 digitizes inputs, filtered by minimal detectable magnitude: least significant bit (i LSB). These bits are AND- compared, then their matches are AND- compared again, and so on, forming integer outputs. This is identical to iterative summation and bit-filtering by sequentially doubled i LSB.

Level 1 compares consecutive integers, forming ± difference patterns (dP s). dP s are then evaluated to cross-compare their individual differences, and so on, selectively increasing derivation of patterns.

Evaluation: dP M (summed match) - dP aM (dP M per average match between differences in level 2 inputs).

Integers are limited by the number of digits (#b), and input span: least significant bit of coordinate (C LSB).

No 1^st level feedback: fL cost is additive to dP cost, thus must be justified by the value of dP (and coincident difference in value of patterns filtered by adjusted i LSB), which is not known till dP is outputted to 2^nd level.

Level 2 evaluates match within dP s | bf L (dP) s, forming ± value patterns: vP s | vP (bf L) s. +vP s are evaluated for cross-comparison of their dP s, then of resulting derivatives, then of inputted derivation levels. +vP (bf L) s are evaluated to cross-compare bf L s, then dP s, adjusted by the difference between their bit filters, and so on.

dP variables are compared by subtraction, then resulting matches are combined with dP M (match within dP) to evaluate these variables for cross-comparison by division, to normalize for the difference in their span.

// match filter is also normalized by span ratio before evaluation, same-power evaluation and comparison?

Feedback: input dP s | bf L (dP) are back-projected and resulting magnitude is evaluated to increment or decrement 0^th level i LSB. Such increments terminate bit-filter span ( bf L (dP)), output it to 2^nd level, and initiate a new i LSB span to filter future inputs. // bf L (dP) representation: bf , #dP, Σ dP, Q (dP).

Level 3 evaluates match in input vP s or f L (vP) s, forming ± evaluation-value patterns: eP s | eP (fL) s. Positive eP s are evaluated for cross-comparison of their vP s ( dP s ( derivatives ( derivation levels ( lower search-level sources: buffered or external locations (selected sources may directly specify strong 3^rd level sub-patterns).

Feedback: input vP is back-projected, resulting match is compared to 2^nd level filter, and the difference is evaluated vs. filter-update filter. If update value is positive, the difference is added to 2^nd level filter, and filter span is terminated. Same for adjustment of previously covered bit filters and 2^nd level filter-update filters?

This is similar to 2^nd level operations, but input vP s are separated by skipped-input spans. These spans are a filter of coordinate (Cf, higher-order than f for 2^nd level inputs), produced by pre-valuation of future inputs:

projected novel match = projected magnitude * average match per magnitude - projected-input match?

Pre-value is then evaluated vs. 3^rd level evaluation filter + lower-level processing cost, and negative prevalue-value input span (= span of back-projecting input) is skipped: its inputs are not processed on lower levels.

// no prevaluation on 2^nd level: the cost is higher than potential savings of only 1^st level processing costs?

As distinct from input filters, Cf is defined individually rather than per filter span. This is because the cost of Cf update: span representation and interruption of processing on all lower levels, is minor compared to the value of represented contents? ±eP = ±Cf: individual skip evaluation, no flushing?

or interruption is predetermined, as with Cb, fixed C f within C f L: a span of sampling across fixed-L gaps?

alternating signed Cf s are averaged ±vP s?

Division: between L s, also inputs within minimal-depth continuous d-sign or m-order derivation hierarchy?

tentative generalizations and extrapolations

So, filter resolution is increased per level, first for i filters and then for C filters: level 0 has input bit filter,

level 1 adds coordinate bit filter, level 2 adds input integer filter, level 3 adds coordinate integer filter.

// coordinate filters (Cb, Cf) are not input-specific, patterns are formed by comparing their contents.

Level 4 adds input multiple filter: eP match and its derivatives, applied in parallel to corresponding variables of input pattern. Variable-values are multiplied and evaluated to form pattern-value, for inclusion into next-level ±pattern // if separately evaluated, input-variable value = deviation from average: sign-reversed match?

Level 5 adds coordinate multiple filter: a sequence of skipped-input spans by iteratively projected patterns, as described in imagination section of part 3. Alternatively, negative coordinate filters implement cross-level shortcuts, described in level 3 sub-part, which select for projected match-associated novelty.

Additional variables in positive patterns increase cost, which decreases positive vs. negative span proportion.

Increased difference in sign, syntax, span, etc., also reduces match between positive and negative patterns. So, comparison, evaluation, pre-valuation... on higher levels is primarily for same-sign patterns.

Consecutive different-sign patterns are compared due to their proximity, forming ratios of their span and other variables. These ratios are applied to project match across different-sign gap or contrast pattern:

projected match += (projected match - intervening negative match) * (negative value / positive value) / 2?

ΛV selection is incremented by induction: forward and feedback of actual inputs, or by deduction: algebraic compression of input syntax, to find computational shortcuts. Deduction is faster, but actual inputs also carry empirical information. Relative value of additive information vs. computational shortcuts is set by feedback.

following parts cover three initial levels in more detail, but mostly out of date:

Level 1: comparison to past inputs, forming difference patterns and match patterns

Inputs to the 1^st level of search are single integers, representing pixels of 1D scan line across an image, or equivalents from other modalities. Consecutive inputs are compared to form differences, difference patterns, matches, relative match patterns. This comparison may be extended, forming higher and distant derivatives:

resulting variables per input: *=2 derivatives (d,m) per comp, + conditional *=2 (xd, xi) per extended comp:

8 derivatives // ddd, mdd, dd_i, md_i, + 1-input-distant dxd, mxd, + 2-input-distant d_ii, m_ii,

/ \

4 der 4 der // 2 consecutive: dd, md, + 2 derivatives between 1-input-distant inputs: d_i and m_i,

/ \ / \

d,m d,m d,m // d, m: derivatives from default comparison between consecutive inputs,

/ \ / \ / \

i >> i >> i >> i // i: single-variable inputs.

This is explained / implemented in my draft python code: line_patterns. That first level is for generic 1D cognitive algorithm, its adaptation for image and then video recognition algorithm will be natively 2D.

That’s what I spend most of my time on, the rest of this intro is significantly out of date.

bit-filtering and digitization

1^st level inputs are filtered by the value of most and least significant bits: maximal and minimal detectable magnitude of inputs. Maximum is a magnitude that co-occurs with average 1^st level match, projected by outputted dP s. Least significant bit value is determined by maximal value and number of bits per variable.

This bit filter is initially adjusted by overflow in 1^st level inputs, or by a set number of consecutive overflows.

It’s also adjusted by feedback of higher-level patterns, if they project over- or under- flow of 1^st level inputs that exceeds the cost of adjustment. Underflow is average number of 0 bits above top 1 bit.

Original input resolution may be increased by projecting analog magnification, by impact or by distance.

Iterative bit-filtering is digitization: bit is doubled per higher digit, and exceeding summed input is transferred to next digit. A digit can be larger than binary if the cost of such filtering requires larger carry.

Digitization is the most basic way of compressing inputs, followed by comparison between resulting integers.

hypothetical: comparable magnitude filter, to form minimal-magnitude patterns

This doesn’t apply to reflected brightness, only to types of input that do represent physical quantity of a source.

Initial magnitude justifies basic comparison, and summation of below-average inputs only compensates for their lower magnitude, not for the cost of conversion. Conversion involves higher-power comparison, which must be justified by higher order of match, to be discovered on higher levels.

iP min mag span conversion cost and comparison match would be on 2^nd level, but it’s not justified by 1^st level match, unlike D span conversion cost and comparison match, so it is effectively the 1^st level of comparison?

possible +iP span evaluation: double evaluation + span representation cost < additional lower-bits match?

The inputs may be normalized by subtracting feedback of average magnitude, forming ± deviation, then by dividing it by next+1 level feedback, forming a multiple of average absolute deviation, and so on. Additive value of input is a combination of all deviation orders, starting with 0^th or absolute magnitude.

Initial input evaluation if any filter: cost < gain: projected negative-value (comparison cost - positive value):

by minimal magnitude > ± relative magnitude patterns (iP s), and + iP s are evaluated or cross-compared?

or by average magnitude > ± deviations, then by co-average deviation: ultimate bit filter?

Summation *may* compensate for conversion if its span is greater than average per magnitude spectrum?!

Summation on higher levels also increases span order, but within-order conversion is the same, and between-order comparison is intra-pattern only. bf spans overlap vP span, -> filter conversion costs?

Level 2: additional evaluation of input patterns for feedback, forming filter patterns (out of date)

Inputs to 2^nd level of search are patterns derived on 1^st level. These inputs are evaluated for feedback to update 0^th level i LSB, terminating same-filter span.

Feedback increment of LSB is evaluated by deviation (∆) of magnitude, to avoid input overflow or underflow:

∆ += I/ L - LSB a; |∆| > ff? while (|∆| > LSB a){ LSB ±; |∆| -= LSB a; LSB a *2};

LSB a is average input (* V/ L?) per LSB value, and ff is average deviation per positive-value increment;

Σ (∆) before evaluation: no V patterns? #b++ and C LSB-- are more expensive, evaluated on 3^rd level?

They are also compared to previously inputted patterns, forming difference patterns dPs and value patterns vPs per input variable, then combined into dPP s and vPP s per input pattern.

L * sign of consecutive dP s is a known miss, and match of dP variables is correlated by common derivation.

Hence, projected match of other +dP and -dP variables = amk * (1 - L / dP). On the other hand, same-sign dP s are distant by L, reducing projected match by amk * L, which is equal to reduction by miss of L?

So, dP evaluation is for two comparisons of equal value: cross-sign, then cross- L same-sign (1 dP evaluation is blocked by feedback of discovered or defined alternating sign and co-variable match projection).

Both of last dP s will be compared to the next one, thus past match per dP (dP M) is summed for three dP s:

dP M ( Σ ( last 3 dP s L+M)) - a dP M (average of 4Le +vP dP M) -> v, vs;; evaluation / 3 dP s -> value, sign / 1 dP.

while (vs = ovs){ ovs = vs; V+=v; vL++; vP (L, I, M, D) += dP (L, I, M, D);; default vP - wide sum, select preserv.

vs > 0? comp (3 dP s){ DIV (L, I, M, D) -> N, ( n, f, m, d); vP (N, F, M, D) += n, f, m, d;; sum: der / variable, n / input?

vr = v+ N? SUB (nf) -> nf m; vd = vr+ nf m, vds = vd - a;; ratios are too small for DIV?

while (vds = ovds){ ovds = vds; Vd+=vd; vdL++; vdP() += Q (d | ddP);; default Q (d | ddP) sum., select. preserv.

vds > 0? comp (1^st x l^st d | ddP s of Q (d) s);; splicing Q (d) s of matching dP s, cont. only: no comp ( Σ Q (d | ddP)?

Σ vP ( Σ vd P eval: primary for -P, redundant to individual dP s ( d s for +P, cost *2, same for +P' I and -P' M,D?

no Σ V | Vd evaluation of cont. comp per variable or division: cost + vL = comp cost? Σ V per fb: no vL, #comp;

- L, I, M, D: same value per mag, power / compression, but I | M, D redund = mag, +vP: I - 2a, - vP: M, D - 2a?

- no variable eval: cost (sub + vL + filter) > comp cost, but match value must be adjusted for redundancy?

- normalization for comparison: min (I, M, D) * rL, SUB (I, M, D)? Σ L (pat) vs C: more general but interrupted?

variable-length DIV: while (i > a){ while (i> m){ SUB (i, m) -> d; n++; i=d;}; m/=2; t=m; SUB (d, t); f+= d;}?

additive compression per d vs. m*d: > length cost?

tdP ( tM, tD, dP(), ddP Σ ( dMΣ (Q (dM)), dDΣ (Q (dD)), ddLΣ (Q (ddL)), Q (ddP))); // last d and D are within dP()?

Input filter is a higher-level average, while filter update is accumulated over multiple higher-level spans until it exceeds filter-update filter. So, filter update is 2^nd order feedback relative to filter, as is filter relative to match.

But the same filter update is 3^rd order of feedback when used to evaluate input value for inclusion into pattern defined by a previous filter: update span is two orders higher than value span.

Higher-level comparison between patterns formed by different filters is mediated, vs. immediate continuation of current-level comparison across filter update (mediated cont.: splicing between different-filter patterns by vertical specification of match, although it includes lateral cross-comparison of skip-distant specifications).

However, filter update feedback is periodic, so it doesn’t form continuous cross-filter comparison patterns xPs.

adjustment of forward evaluation by optional feedback of projected input

More precisely, additive value or novel magnitude of an input is its deviation from higher-level average. Deviation = input - expectation: (higher-level summed input - summed difference /2) * rL (L / hL).

Inputs are compared to last input to form difference, and to past average to form deviation or novelty.

But last input is more predictive of the next one than a more distant average, thus the latter is compared on higher level than the former. So, input variable is compared sequentially and summed within resulting patterns. On the next level, the sum is compared vertically: to next-next-level average of the same variable.

Resulting vertical match defines novel value for higher-level sequential comparison:

novel value = past match - (vertical match * higher-level match rate) - average novel match:

nv = L+M - (m (I, (hI * rL)) * hM / hL) - hnM * rL; more precise than initial value: v = L+M - hM * rL;

Novelty evaluation is done if higher-level match > cost of feedback and operations, separately for I and D P s:

I, M ( D, M feedback, vertical SUB (I, nM ( D, ndM));

Impact on ambient sensor is separate from novelty and is predicted by representational-value patterns?

- next-input prediction: seq match + vert match * relative rate, but predictive selection is per level, not input.

- higher-order expectation is relative match per variable: pMd = D * rM, M/D, or D * rMd: Md/D,

- if rM | rMd are derived by intra-pattern comparison, when average M | Md > average per division?

one-input search extension within cross-compared patterns

Match decreases with distance, so initial comparison is between consecutive inputs. Resulting match is evaluated, forming ±vP s. Positive P s are then evaluated for expanded internal search: cross-comparison among 1-input-distant inputs within a pattern (on same level, higher-level search is between new patterns).

This cycle repeats to evaluate cross-comparison among 2-input-distant inputs, 3-input-distant inputs, etc., when summed current-distance match exceeds the average per evaluation.

So, patterns of longer cross-comparison range are nested within selected positive patterns of shorter range. This is similar to 1^st level ddP s being nested within dP s.

Same input is re-evaluated for comparison at increased distance because match will decay: projected match = last match * match rate (mr), * (higher-level mr / current-level mr) * (higher-level distance / next distance)?

Or = input * average match rate for that specific distance, including projected match within negative patterns.

It is re-evaluated also because projected match is adjusted by past match: mr *= past mr / past projected mr?

Also, multiple comparisons per input form overlapping and redundant patterns (similar to fuzzy clusters),

and must be evaluated vs. filter * number of prior comparisons, reducing value of projected match.

Instead of directly comparing incrementally distant input pairs, we can calculate their difference by adding intermediate differences. This would obviate multiple access to the same inputs during cross-comparison.

These differences are also subtracted (compared), forming higher derivatives and matches:

ddd, x1dd, x2d ( ddd: 3^rd derivative, x1dd: d of 2-input-distant d s, x2d: d of 2-input-distant inputs)

/ \

dd, x1d dd, x1d ( dd: 2^nd derivative, x1d = d+d = difference between 1-input-distant inputs)

/ \ / \

d d d ( d: difference between consecutive inputs)

/ \ / \ / \

i i i i ( i: initial inputs)

As always, match is a smaller input, cached or restored, selected by the sign of a difference.

Comparison of both types is between all same-type variable pairs from different inputs.

Total match includes match of all its derivation orders, which will overlap for proximate inputs.

Incremental cost of cross-comparison is the same for all derivation orders. If projected match is equal to projected miss, then additive value for different orders of the same inputs is also the same: reduction in projected magnitude of differences will be equal to reduction in projected match between distant inputs?

multi-input search extension, evaluation of selection per input: tentative

On the next level, average match from expansion is compared to that from shorter-distance comparison, and resulting difference is decay of average match with distance. Again, this decay drives re-evaluation per expansion: selection of inputs with projected decayed match above average per comparison cost.

Projected match is also adjusted by prior match (if local decay?) and redundancy (symmetrical if no decay?)

Slower decay will reduce value of selection per expansion because fewer positive inputs will turn negative:

Value of selection = Σ |comp cost of neg-value inputs| - selection cost (average saved cost or relative delay?)

This value is summed between higher-level inputs, into average value of selection per increment of distance. Increments with negative value of selection should be compared without re-evaluation, adding to minimal number of comparisons per selection, which is evaluated for feedback as a comparison-depth filter:

Σ (selection value per increment) -> average selection value;; for negative patterns of each depth, | >1 only?

depth adjustment value = average selection value; while (|average selection value| > selection cost){

depth adjustment ±±; depth adjustment value -= selection value per increment (depth-specific?); };

depth adjustment > minimal per feedback? >> lower-level depth filter;; additive depth = adjustment value?

- match filter is summed and evaluated per current comparison depth?

- selected positive relative matches don’t reduce the benefit of pruning-out negative ones.

- skip if negative selection value: selected positive matches < selection cost: average value or relative delay?

Each input forms a queue of matches and misses relative to templates within comparison depth filter. These derivatives, both discrete and summed, overlap for inputs within each other’s search span. But representations of discrete derivatives can be reused, redundancy is only necessary for parallel comparison.

Assuming that environment is not random, similarity between inputs declines with spatio-temporal distance. To maintain proximity, a n-input search is FIFO: input is compared to all templates up to maximal distance, then added to the queue as a new template, while the oldest template is outputted into pattern-wide queue.

value-proportional combination of patterns: tentative

Summation of +dP and -dP is weighted by their value: L (summed d-sign match) + M (summed i match).

Such relative probability of +dP vs. - dP is indicated by corresponding ratios: rL = +L/-L, and rM = +M/-M.

(Ls and Ms are compared by division: comparison power should be higher for more predictive variables).

But weighting complementation incurs costs, which must be justified by value of ratio. So, division should be of variable length, continued while the ratio is above average. This is shown below for Ls, also applies to Ms:

dL = +L - -L, mL = min (+L, -L); nL =0; fL=0; efL=1; // nL: L multiple, fL: L fraction, efL: extended fraction.

while (dL > adL){ dL = |dL|; // all Ls are positive; dL is evaluated for long division by adL: average dL.

while (dL > 0){ dL -= mL; nL++;} dL -= mL/2; dL >0? fL+= efL; efL/=2;} // ratio: rL= nL + fL.

Ms’ long-division evaluation is weighted by rL: projected rM value = dM * nL (reduced-resolution rL) - adM.

Ms are then combined: cM = +M + -M * rL; // rL is relative probability of -M across iterated cL.

Ms are not projected (M+= D * rcL * rM D (MD/cD) /2): precision of higher-level rM D is below that of rM?

Prior ratios are combination rates: rL is probability of -M, and combined rL and rM (cr) is probability of -D.

If rM < arM, cr = rL, else: cr = (+L + +M) / (-L + -M) // cr = √(rL * rM) would lose L vs. M weighting.

cr predicts match of weighted cD between cdPs, where negative-dP variable is multiplied by above-average match ratio before combination: cD = +D + -D * cr. // after un-weighted comparison between Ds?

Averages: arL, arM, acr, are feedback of ratios that co-occur with above-average match of span-normalized variables, vs. input variables. Another feedback is averages that evaluate long division: adL, adM, adD.

Both are feedback of positive C pattern, which represents these variables, inputted & evaluated on 3^rd level.

; or 4^th level: value of dPs * ratio is compared to value of dPs, & the difference is multiplied by cL / hLe cL?

Comparison of opposite-sign Ds forms negative match = smaller |D|, and positive difference dD = +D+ |-D|.

dD magnitude predicts its match, not further combination. Single comparison is cheaper than its evaluation.

Comparison is by division if larger |D| co-occurs with hLe nD of above-average predictive value (division is sign-neutral & reductive). But average nD value is below the cost of evaluation, except if positive feedback?

So, default operations for L, M, D of complementary dPs are comparison by long division and combination.

D combination: +D -D*cr, vs. - cD * cr: +D vs. -D weighting is lost, meaningless if cD=0?

Combination by division is predictive if the ratio is matching on higher level (hLe) & acr is fed back as filter?

Resulting variables: cL, rL, cM, rM, cr, cD, dD, form top level of cdP: complemented dP.

Level 3: prevaluation of projected filter patterns, forming updated-input patterns

(out of date)

3^rd level inputs are ± V patterns, combined into complemented V patterns. Positive V patterns include derivatives of 1^st level match, which project match within future inputs (D patterns only represent and project derivatives of magnitude). Such projected-inputs-match is pre-valuated, negative prevalue-span inputs are summed or skipped (reloaded), and positive prevalue-span inputs are evaluated or even directly compared.

Initial upward (Λ) prevaluation by E filter selects for evaluation of V patterns, within resulting ± E patterns. Resulting prevalue is also projected downward (V), to select future input spans for evaluation, vs. summation or skipping. The span is of projecting V pattern, same as of lower hierarchy. Prevaluation is then iterated over multiple projected-input spans, as long as last |prevalue| remains above average for the cost of prevaluation.

Additional interference of iterated negative projection is stronger than positive projection of lower levels, and should flush them out of pipeline. This flushing need not be final, spans of negative projected value may be stored in buffers, to delay the loss. Buffers are implemented in slower and cheaper media (tape vs. RAM) and accessed if associated patterns match on a higher level, thus project above-average match among their inputs.

Iterative back-projection is evaluated starting from 3^rd level: to be projectable the input must represent derivatives of value, which are formed starting from 2^nd level. Compare this to 2^nd level evaluation:

Λ for input, V for V filter, iterated within V pattern. Similar sub-iteration in E pattern?

Evaluation value = projected-inputs-match - E filter: average input match that co-occurs with average higher-level match per evaluation (thus accounting for evaluation costs + selected comparison costs). Compare this to V filter that selects for 2^nd level comparison: average input match that co-occurs with average higher-level match per comparison (thus accounting for costs of default cross-comparison only).

E filter feedback starts from 4^th level of search, because its inputs represent pre-valuated lower-level inputs.

4^th level also pre-pre-valuates vs. prevaluation filter, forming pre-prevalue that determines prevaluation vs. summation of next input span. And so on: the order of evaluation increases with the level of search.

Higher levels are increasingly selective in their inputs, because they additionally select by higher orders derived on these levels: magnitude ) match and difference of magnitude ) match and difference of match, etc.

Feedback of prevaluation is ± pre-filter: binary evaluation-value sign that determines evaluating vs. skipping initial inputs within projected span, and flushing those already pipelined within lower levels.

Negative feedback may be iterated, forming a skip span.

Parallel lower hierarchies & skip spans may be assigned to different external sources or their internal buffers.

Filter update feedback is level-sequential, but pre-filter feedback is sent to all lower levels at once.

Pre-filter is defined per input, and then sequentially translated into pre-filters of higher derivation levels:

prior value += prior match -> value sign: next-level pre-filter. If there are multiple pre-filters of different evaluation orders from corresponding levels, they AND & define infra-patterns: sign ( input ( derivatives.

filter update evaluation and feedback

Negative evaluation-value blocks input evaluation (thus comparison) and filter updating on all lower levels. Not-evaluated input spans (gaps) are also outputted, which will increase coordinate range per contents of both higher-level inputs and lower-level feedback. Gaps represent negative projected-match value, which must be combined with positive value of subsequent span to evaluate comparison across the gap on a higher level. This is similar to evaluation of combined positive + negative relative match spans, explained above.

Blocking locations with expected inputs will result in preference for exploration & discovery of new patterns, vs. confirmation of the old ones. It is the opposite of upward selection for stronger patterns, but sign reversal in selection criteria is basic feature of any feedback, starting with average match & derivatives.

Positive evaluation-value input spans are evaluated by lower-level filter, & this filter is evaluated for update:

combined update = (output update + output filter update / (same-filter span (fL) / output span)) /2.

both updates: -= last feedback, equal-weighted because higher-level distance is compensated by range: fL?

update value = combined update - update filter: average update per average higher-level additive match.

also differential costs of feedback transfer across locations (vs. delay) + representation + filter conversion?

If update value is negative: fL += new inputs, subdivided by their positive or negative predictive value spans.

If update value is positive: lower-level filter += combined update, new fL (with new filter representation) is initialized on a current level, while current-level part of old fL is outputted and evaluated as next-level input.

In turn, the filter gets updates from higher-level outputs, included in higher-higher-level positive patterns by that level’s filter. Hence, each filter represents combined span-normalized feedback from all higher levels, of exponentially growing span and reduced update frequency.

Deeper hierarchy should block greater proportion of inputs. At the same time, increasing number of levels contribute to projected additive match, which may justify deeper search within selected spans.

Higher-level outputs are more distant from current input due to elevation delay, but their projection range is also greater. So, outputs of all levels have the same relative distance (distance/range) to a next input, and are equal-weighted in combined update. But if input span is skipped, relative distance of skip-initiating pattern to next input span will increase, and its predictive value will decrease. Hence, that pattern should be flushed or at least combined with a higher-level one:

combined V prevalue = higher-level V prevalue + ((current-level V prevalue - higher-level V prevalue) / ((current-level span / distance) / (higher-level span / distance)) /2. // the difference between current-level and higher-level prevalues is reduced by the ratio of their relative distances.

To speed up selection, filter updates can be sent to all lower levels in parallel. Multiple direct filter updates are span-normalized and compared at a target level, and the differences are summed in combined update. This combination is equal-weighted because all levels have the same span-per-distance to next input, where the distance is the delay of feedback during elevation. // this happens automatically in level-sequential feedback?

combined update = filter update + distance-normalized difference between output & filter updates:

((output update - filter update) / (output relative distance / higher-output relative distance)) /2.

This combination method is accurate for post-skipped input spans, as well as next input span.

- filter can also be replaced by output + higher-level filter /2, but value of such feedback is not known.

- possible fixed-rate sampling, to save on feedback evaluation if slow decay, ~ deep feedforward search?

- selection can be by patterns, derivation orders, sub-patterns within an order, or individual variables?

- match across distance also projects across distance: additive match = relative match * skipped distance?

cross-level shortcuts: higher-level sub-filters and symbols

After individual input comparison, if match of a current scale (length-of-a-length…) projects positive relative match of input lower-scale / higher-derivation level, then the later is also cross-compared between the inputs.

Lower scale levels of a pattern represent old lower levels of a search hierarchy (current or buffered inputs).

So, feedback of lower scale levels goes down to corresponding search levels, forming shortcuts to preserve detail for higher levels. Feedback is generally negative: expectations are redundant to inputs. But specifying feedback may be positive: lower-level details are novel to a pattern, & projected to match with it in the future.

Higher-span comparison power is increased if lower-span comparison match is below average:

variable subtraction ) span division ) super-span logarithm?

Shortcuts to individual higher-level inputs form a queue of sub-filters on a lower level, possibly represented by a queue-wide pre-filter. So, a level has one filter per parallel higher level, and sub-filter for each specified sub-pattern. Sub-filters of incrementally distant inputs are redundant to all previous ones.

Corresponding input value = match - sub-filter value * rate of match to sub-filter * redundancy?

Shortcut to a whole level won’t speed-up search: higher-level search delay > lower-hierarchy search delay.

Resolution and parameter range may also increase through interaction of co-located counter-projections?

Symbols, for communication among systems that have common high-level concepts but no direct interface, are “co-author identification” shortcuts: their recognition and interpretation is performed on different levels.

Higher-level patterns have increasing number of derivation levels, that represent corresponding lower search levels, and project across multiple higher search levels, each evaluated separately?

Match across discontinuity may be due to additional dimensions or internal gaps within patterns.

Search depth may also be increased by cross-comparison between levels of scale within a pattern: match across multiple scale levels also projects over multiple higher- and lower- scale levels? Such comparison between variable types within a pattern would be of a higher order:

5. Comparison between variable types within a pattern (tentative)

To reiterate, elevation increases syntactic complexity of patterns: the number of different variable types within them. Syntax is identification of these types by their position (syntactic coordinate) within a pattern. This is analogous to recognizing parts of speech by their position within a sentence.

Syntax “synchronizes” same-type variables for comparison | aggregation between input patterns. Access is hierarchical, starting from sign->value levels within each variable of difference and relative match: sign is compared first, forming + and - segments, which are then evaluated for comparison of their values.

Syntactic expansion is pruned by selective comparison vs. aggregation of individual variable types within input patterns, over each coordinate type or resolution. As with templates, minimal aggregation span is resolution of individual inputs, & maximal span is determined by average magnitude (thus match) of new derivatives on a higher level. Hence, a basic comparison cycle generates queues of interlaced individual & aggregate derivatives at each template variable, and conditional higher derivatives on each of the former.

Sufficiently complex syntax or predictive variables will justify comparing across “syntactic“ coordinates within a pattern, analogous to comparison across external coordinates. In fact, that’s what higher-power comparisons do. For example, division is an iterative comparison between difference & match: within a pattern (external coordinate), but across derivation (syntactic coordinate).

Also cross-variable is comparison between orders of match in a pattern: magnitude, match, match-of-match... This starts from comparison between match & magnitude: match rate (mr) = match / magnitude. Match rate can then be used to project match from magnitude: match = magnitude * output mr * filter mr.

In this manner, mr of each match order adjusts intra-order-derived sequentially higher-order match:

match *= lower inter-order mr. Additive match is then projected from adjusted matches & their derivatives.

This inter-order projection continues up to the top order of match within a pattern, which is the ultimate selection criterion because that’s what’s left matching on the top level of search.

Inter-order vectors are ΛV symmetrical, but ΛV derivatives from lower order of match are also projected for higher-order match, at the same rate as the match itself?

Also possible is comparison across syntactic gaps: ΛY comparison -> difference, filter feedback VY hierarchy. For example, comparison between dimensions of a multi-D pattern will form possibly recurrent proportions.

Internal comparisons can further compress a pattern, but at the cost of adding a higher-order syntax, which means that they must be increasingly selective. This selection will increase “discontinuity” over syntactic coordinates: operations necessary to convert the variables before comparison. Eventually, such operators will become large enough to merit direct comparisons among them. This will produce algebraic equations, where the match (compression) is a reduction in the number of operations needed to produce a result.

The first such short-cut would be a version of Pythagorean theorem, discovered during search in 2D (part 6) to compute cosines. If we compare 2D-adjacent 1D Ls by division, over 1D distance and derivatives (an angle), partly matching ratio between the ratio of 1D Ls and a 2nd derivative of 1D distance will be a cosine.

Cosines are necessary to normalize all derivatives and lengths (Ls) to a value they have when orthogonal to 1D scan lines (more in part 6).

Such normalization for a POV angle is similar to dimensionality reduction in Machine Learning, but is much more efficient because it is secondary to selective dimensionality expansion. It’s not really “reduction”: dimensionality is prioritized rather than reduced. That is, the dimension of pattern’s main axis is maximized, and dimensions sequentially orthogonal to higher axes are correspondingly minimized. The process of discovering these axes is so basic that it might be hard-wired in animals.

6. Cartesian dimensions and sensory modalities (out of date)

This is a recapitulation and expansion on incremental dimensionality introduced in part 2.

Term “dimension” here is reserved for a parameter that defines sequence and distance among inputs, initially Cartesian dimensions + Time. This is different from terminology of combinatorial search, where dimension is any parameter of an input, and their external order and distance don’t matter. My term for that is “variable“, external dimensions become types of a variable only after being encoded within input patterns.

For those with ANN background, I want to stress that a level of search in my approach is 1D queue of inputs, not a layer of nodes. The inputs to a node are combined regardless of difference and distance between them (the distance is the difference between laminar coordinates of source “neurons”).

These derivatives are essential because value of any prediction = precision of what * precision of where. Coordinates and co-derived differences are not represented in ANNs, so they can't be used to calculate Euclidean vectors. Without such vectors, prediction and selection of where must remain extremely crude.

Also, layers in ANN are orthogonal to the direction of input flow, so hierarchy is at least 2D. The direction of inputs to my queues is in the same dimension as the queue itself, which means that my core algorithm is 1D. A hierarchy of 1D queues is the most incremental way to expand search: we can add or extend only one coordinate at a time. This allows algorithm to select inputs that are predictive enough to justify the cost of representing additional coordinate and corresponding derivatives. Again, such incremental syntax expansion is my core principle, because it enables selective (thus scalable) search.

A common objection is that images are “naturally” 2D, and our space-time is 4D. Of course, these empirical facts are practically universal in our environment. But, a core cognitive algorithm must be able to discover and forget any empirical specifics on its own. Additional dimensions can be discovered as some general periodicity in the input flow: distances between matching inputs are compared, match between these distances indicates a period of lower dimension, and recurring periods form higher-dimension coordinate.

But as a practical shortcut to expensive dimension-discovery process, initial levels should be designed to specialize in sequentially higher spatial dimensions: 1D scan lines, 2D frames, 3D set of confocal “eyes“, 4D temporal sequence. These levels discover contiguous (positive match) patterns of increasing dimensionality:

1D line segments, 2D blobs, 3D objects, 4D processes. Higher 4D cycles form hierarchy of multi-dimensional orders of scale, integrated over time or distributed sensors. These higher cycles compare discontinuous patterns. Corresponding dimensions may not be aligned across cycles of different scale order.

Explicit coordinates and incremental dimensionality are unconventional. But the key for scalable search is input selection, which must be guided by cost-benefit analysis. Benefit is projected match of patterns, and cost is representational complexity per pattern. Any increase in complexity must be justified by corresponding increase in discovered and projected match of selected patterns. Initial inputs have no known match, thus must have minimal complexity: single-variable “what”, such as brightness of a grey-scale pixel, and single-variable “where”: pixel’s coordinate in one Cartesian dimension.

Single coordinate means that comparison between pixels must be contained within 1D (horizontal) scan line, otherwise their coordinates are not comparable and can’t be used to select locations for extended search. Selection for contiguous or proximate search across scan lines requires second (vertical) coordinate. That increases costs, thus must be selective according to projected match, discovered by past comparisons within 1D scan line. So, comparison across scan lines must be done on 2^nd level of search. And so on.

Dimensions are added in the order of decreasing rate of change. This means spatial dimensions are scanned first: their rate of change can be sped-up by moving sensors. Comparison over purely temporal sequence is delayed until accumulated change / variation justifies search for additional patterns. Temporal sequence is the original dimension, but it is mapped on spatial dimensions until spatial continuum is exhausted. Dimensionality represented by patterns is increasing on higher levels, but each level is 1D queue of patterns.

Also independently discoverable are derived coordinates: any variable with cumulative match that correlates with combined cumulative match of all other variables in a pattern. Such correlation makes a variable useful for sequencing patterns before cross-comparison.

It is discovered by summing matches for same-type variables between input patterns, then cross-comparing summed matches between all variables of a pattern. Variable with the highest resulting match of match (mm) is a candidate coordinate. That mm is then compared to mm of current coordinate. If the difference is greater than cost of reordering future inputs, sequencing feedback is sent to lower levels or sensors.

Another type of empirically distinct variables is different sensory modalities: colors, sound and pitch, and so on, including artificial senses. Each modality is processed separately, up a level where match between patterns of different modalities but same scope exceeds match between unimodal patterns across increased distance. Subsequent search will form multi-modal patterns within common S-T frame of reference.

As with external dimensions, difference between modalities can be pre-defined or discovered. If the latter, inputs of different modalities are initially mixed, then segregated by feedback. Also as with dimensions, my core algorithm only assumes single-modal inputs, pre-defining multiple modalities would be an add-on.

7. Notes on working mindset and awards for contributions

My terminology is as general as the subject itself. It’s a major confounder, - people crave context, but generalization is decontextualization. And cognitive algorithm is a meta-generalization: the only thing in common for everything we learn. This introduction is very compressed, partly because much the work is in progress. But I think it also reflects and cultivates ruthlessly reductionist mindset required for such subject.

My math is very simple, because algorithmic complexity must be incremental. Advanced math can accelerate learning on higher levels of generalization, but is too expensive for initial levels. And minimal general learning algorithm must be able to discover computational shortcuts (AKA math) on it’s own, just like we do. Complex math is definitely not innate in humans on any level: cavemen didn’t do calculus.

This theory may seem too speculative, but any degree of generalization must be correspondingly lossy. Which is contrary to precision-oriented culture of math and computer science. Hence, current Machine Learning is mostly experimental, and the progress on algorithmic side is glacial. A handful of people aspire to work on AGI, but they either lack or neglect functional definition of intelligence, their theories are only vague inspiration.

I think working on this level demands greater delay of experimental verification than is acceptable in any established field. Except for philosophy, which has nothing else real to study. But established philosophers have always been dysfunctional fluffers, not surprisingly as their only paying customers are college freshmen.

Our main challenge in formalizing GI is a specie-wide ADHD. We didn’t evolve for sustained focus on this level of generalization, that would cause extinction long before any tangible results. Which is no longer a risk, GI is the most important problem conceivable, and we have plenty of computing power for anything better than brute-force algorithms. But our psychology lags a light-year behind technology: we still hobble on mental crutches of irrelevant authority and peer support, flawed analogies and needless experimentation.

Awards for contributions

I offer prizes up to a total of $500K for debugging, optimizing and extending this algorithm: github.

Contributions must fit into incremental-complexity hierarchy outlined here. Unless you find a flaw in my reasoning, which would be even more valuable. I can also pay monthly, but there must be a track record.

Winners will have an option to convert the awards into an interest in all commercial applications of a final algorithm, at the rate of $10K per 1% share. This option is informal and likely irrelevant, mine is not a commercial enterprise. Money can’t be primary motivation here, but it saves time.

Awards so far:

2010: Todor Arnaudov, $600 for suggestion to buffer old inputs after search.

2011: Todor, $400 consolation prize for understanding some ideas that were not clearly explained here.

2014: Dan He, $600 for pushing me to be more specific and to compare my algorithm with others.

2016: Todor Arnaudov, $500 for multiple suggestions on implementing the algorithm, as well as for the effort.

Kieran Greer, $375 for an attempt to implement my level 1 pseudo code in C#

2017:

Alexander Loschilov, $2800 for help in converting my level 1 pseudo code into Python, consulting on PyCharm and SciPy, and for insistence on 2D clustering, February-April.

Todor Arnaudov: $2000 for help in optimizing level_1_2D, June-July.

Kapil Kashyap: $ 2000 for stimulation and effort, help with Python and level_1_2D, September-October

2018:

Todor Arnaudov, $1000 mostly for effort and stimulation, January-February

Andrei Demchenko, $1800 for conventional refactoring in line_POC_introductory.py, interface improvement and few improvements in the code, April - May.

Todor Arnaudov, $2000 for help in debugging frame_dblobs.py, September - October.

Khanh Nguyen, $2700, for getting to work line_POC.

2019:

Stephan Verbeeck, $2000 for getting me to return to using minimally-coarse gradient and his perspective on colors and line tracing, January-June

Todor Arnaudov, $1600, frequent participant, March-June

Kok Wei Chee, $900, for diagrams of line_POC and frame_blobs, December

Khanh Nguyen, $10100, lead debugger and co-designer, January-December

2020:

Mayukh Sarkar, $600 for frame_blobs performance analysis and porting form_P to C++, January

Maria Parshakova, $1600, team developer, March-May

Khanh Nguyen, $8100, team developer

Kok Wei Chee, $14200, team developer

2021:

Many thanks to Chris Sun for his efforts to find collaborators!

Kok Wei Chee, $22000, lead developer, January-December

Khanh Nguyen, $5000, team developer, April-October

Alex Pitertsev, $1000: mostly visualization via dfs, July-August

Kelvin Spacey, $1840: port to dataframes, various, May-July

Yura Guruel, $1000: various, May-July

Aqib Mumtaz and Ayesha Ali, $1400: audio interfacing for 1D alg, April-May

2022:

Kok Wei Chee, $25300: lead developer, January-December

Alex Pitertsev, $2000 for porting line_comp to Julia, June

Intelligence

5/3/24

Cognitive Algorithm: Comparison-first alternative to Deep Learning