Intelligence is a general cognitive ability,
ultimately ability to predict. That includes cognitive component of action:
planning is technically a selfprediction. Any prediction is interactive
projection of known patterns, hence primary cognitive process is pattern
discovery. This perspective is well established, pattern recognition is a core
of any IQ test. But there is no general and constructive definition of either pattern
or recognition (quantified similarity). Below, I define similarity for
the simplest inputs, then describe hierarchically recursive algorithm to search
for similarity (patterns) among incrementally complex inputs (lowerlevel
patterns).
For excellent popular introductions to
cognitionasprediction thesis see “On
Intelligence” by
Jeff Hawkins and “How to Create a Mind“ by Ray Kurzweil. But on a technical level, they
and most current researchers implement artificial neural networks, which operate
in a very coarse statistical fashion. Capsule Networks, recently introduced by Geoffrey Hinton et.
al., are more selective but still rely on Hebbian learning, coarse due to immediate
input summation. My approach is outlined below, then compared to ANN, BNN, and
CapsNet.
We need help with design and
implementation of this algorithm, in Python or Julia. This is an open project: CogAlg, current code explained in WIKI. But do I pay prizes for contributions, or monthly if there
is a track record, see the last part here. Contributions should be justified in
terms of strictly incremental search for similarity, forming hierarchical patterns:
parameterized nearestneighbour clusters or capsules.
This content is published
under Creative Commons Attribution 4.0 International License.
Outline
of my approach
Proposed algorithm is a firstprinciples alternative to deep
learning, nonneuromorphic and substatistical. It performs hierarchical search,
crosscomparing inputs over selectively incremental distance and composition,
followed by parameterized clustering. Firstlevel comparands are sensory inputs
at the limit of resolution: adjacent pixels of video or equivalents in other
modalities. Symbolic data is encoded by prior cognitive process. Such encoding is
never explicit, which makes it extremely hard to decode in a strictly bottomup
system.
Basic comparison
is inverse arithmetic operation for two singlevariable comparands, with
incremental power: Boolean, subtraction, division, etc. Each order of
comparison forms miss or loss: XOR, difference, ratio.., and match or
similarity, which can be defined directly or as inverse deviation of miss. Direct
match is compression of represented magnitude by replacing larger input with
corresponding order of miss between the inputs: Boolean AND, min input in comp
by subtraction, integer part of ratio in comp by division, etc., more in part
1.
These direct
similarity measures work in modalities where input intensity represents some
conserved physical property of the source, anticorrelating with its variation.
But it’s not the case with albedo in visual input, it doesn’t really depend on
physical density or invariance: dark objects can be just as stable as bright ones. So, initial match in
vision should be defined indirectly, as inverse deviation of variation in intensity.
1D version of variation is difference, multiD comparison will combine
differences into Euclidean distance and gradient.
In 2D
image processing, basic comparison is done by edge detectors, which form gradient and its
angle. They are used as first layer in the proposed model, same as in CNN. It then
segments image into blobs (2D patterns) by the sign of gradient deviation,
which is also pretty conventional. But these blobs are parameterized with summed
pixellevel intensity, derivatives (initially gradient and angle) and
dimensions. Each parameter has independent predictive value, so they should be
preserved for nextlevel comparison between blobs. I don’t know of any model that
performs such parameterization, so the algorithm seems to be novel from this point
on.
Higherlevel
inputs are lowerlevel patterns, their parameters are selectively crosscompared
between patterns, forming match and miss per parameter. Thus, number of parameters
per pattern may multiply on each level. Match and miss per pattern are summed from
matches  misses per parameter, and their deviations define compositionally
higher patterns. Crosscomparison is incremental in distance, derivation, and
composition. Which implies a unique set of operations per level of search,
hence a singular in “cognitive algorithm“.
It’s a
form of hierarchical connectivity clustering: patterns are contiguous because
they are defined by results of crosscomp, which should have a fixed range to
encode pose parameters: coordinates, dimensions, orientation. This is essential
because value of prediction = precision of what * precision of where. All
params derived by crosscomp are predictive: crosscomp computes predictive
value. They should be compared between patterns, to discover longerrange
spatiotemporal and then conceptual patterns. But this process is very complex
and slow, it won’t pay off in simple test problems, which is probably why such
schemes are not actively explored.
Resulting
hierarchy is a dynamic pipeline: terminated patterns are outputted for
comparison on the next level, hence a new level must be formed for a pattern
terminated by current top level. Which continues as long as the system receives
novel inputs. As distinct from autoencoders, there is no need for decoding: comparison
and clustering is done on each level, with threshold adjusted by feedback of summed
deviation per parameter.
Patterns
are also hierarchical: each level of search adds a level of composition and
sublevel of differentiation to those of input pattern. To avoid combinatorial
explosion, nextlevel search is selective per input pattern level.
Again,
autonomous cognition must start with analog inputs, such as video or audio. Symbolic
data: any sort of language, is encoded by some prior cognitive process. To
discover meaningful patterns in a set of symbols, they must be decoded before
being crosscompared. And the difficulty of decoding is exponential with the
level of encoding, thus hierarchical learning starting with raw sensory input
is by far the easiest to implement (part 0).
Many readers see a gap between this outline and
algorithm, or a lack of the latter. It’s true that the algorithm is far from
complete, but aboveexplained principles are stable and we are translating them
into code: https://github.com/boriskz/CogAlg. Final algorithm will be a metalevel of search: 1st
level operations plus recursive increment in input complexity, which generate
nextlevel alg. We are in a spacetime continuum, thus each level will be 3D or 4D cycle. I avoid complex math because
it's not selectively incremental.
Comparison
to artificial and biological neural networks
ANN learns via some version of Hebbian “fire together, wire
together” coincidence reinforcement. Normally, “neuron’s” inputs are weighed at
“synapses”, then summed and thresholded into input to next hidden layer.
Output
of last hidden layer is compared to toplayer template, forming an error. That
error backpropagates, converting initially random weights into meaningful
values via Stochastic Gradient Descent. I have several basic
problems with this whole paradigm, listed below along with my alternatives:
 Hebbian
learning is driven by vertical inputtooutput comparison, secondary to input summation.
This is seductively simple per backprop cycle, but it takes tens of thousands
cycles to form meaningful representations. That’s because summation is a loss
of resolution, which makes learning exponentially more coarse per layer. Lateral
parametrized crosscomparison is far more complex per layer, but output is immediately
informative. Feedback here only adjusts layerwide hyperparameters: thresholds
for the last step of pattern segmentation.
 Both
initial weights and sampling that feeds SGD are randomized, which is a
zeroknowledge option. But we do have prior knowledge for any raw data in real
spacetime: proximity predicts similarity, thus search should proceed with
incremental comparison range and input composition. Also driven by random
variation are methods like RBM and GAN. There is nothing random in my model, that’s
antithetical to intelligence. Rather, variation here is pattern projection by
coderived miss: projected input = input  (d_input * d_coordinate) / 2.
 SGD minimizes
error (toplayer miss), which is quantitatively different from maximizing match:
compression. And that error is w.r.t. some specific template, while my match is
summed over all past input / experience. All inputs represent environment, thus
have positive value. But then they are packed (compressed) into patterns, which
have different range and precision, thus different representational value per relatively
fixed record cost.
 Representation
is fully distributed, which mimics the brain. But the brain has no alternative:
no substrate for local memory or differentiated program in neurons. We have it
now, parallelization in computers is a simple speed vs. efficiency tradeoff,
useful only for complex semantically isolated nodes. Such nodes are patterns, encapsulating
a set of coderived “what” and “where” parameters. This is similar to neural ensemble, but parameters that are
compared together should be localized in memory, not distributed across a
network.
Inspiration
by the brain kept ANN research going for decades before they became useful. Their
“neurons” are mere stick figures, but that’s not a problem, most of neuron’s complexity
is due to constraints of biology. The problem is, core mechanism in ANN:
weighted summation, may also be a nolonger needed compensation for such
constraints. Neural memory is dedicated connections, which makes representation
and crosscomparison of individual inputs very expensive, so they are summed.
But we now have dirtcheap randomaccess memory.
Other
biological constraints are very slow neurons, and the imperative of fast
reaction for survival in the wild. Both favor fast though crude summation (vs. direct
parameterized clustering), at the cost of glacial training. Reaction speed
became less important: modern society is quite secure, while continuous
learning is far more important because of accelerating progress. Summation also
reduces noise, very important for neurons that often fire at random, to
initiate and maintain latent connections. But that’s irrelevant for electronic circuits.
Biological
intelligence is a distant side effect of maximizing reproduction. The brain
evolved to guide the body, even our abstract thinking is always framed in terms
of action. Hence, Hebbian learning is driven by feedback of such action: output
of weighted input sum. Neurons evolved as instinctive stimulustoresponse converters,
they only do pattern recognition as an instrumental upshot, if integrated into large
networks. Primary learning is comparisondefined connectivity clustering within
a level, which outputs immediately meaningful patterns.
Comparison
to Capsule Networks
The
nearest experimentally successful method is recently introduced “capsules”. Some
similarities to CogAlg:
 capsules also output multivariate vectors,
“encapsulating” several parameters, similar to my patterns,
 these parameters also include pose: coordinates
and dimensions, compared to compute corresponding miss,
 these misses / distances are compared to
find affine transformations or equivariance: my match of misses,
 capsules also send direct feedback to lower
layer: dynamic routing, vs. transhiddenlayer backprop in ANN.
My main problems with CapsNet and alternative
treatment:
 Object is defined as a recurring
configuration of different parts. But such recurrence can’t be assumed, it
should be derived by crosscomparing relative position among parts of matching
objects. Which can only be done after their positions are crosscompared, which
is after their objects are crosscompared: two levels above the level that
forms initial objects. So, objects formed by positional equivariance would be secondary,
though they may displace initial segmentation objects as a primary
representation. Stacked Capsule
Autoencoders also have exclusive segmentation on the first layer, but
proximity doesn’t matter on their higher layers.
 Routing by agreement is basically recursive
input clustering, by match of input vector to the output vector. The output
(centroid) represents inputs at all locations, so its comparison to inputs is
effectively mixeddistance. Thus, clustering in CapsNet is fuzzy and
discontinuous, forming redundant representations. Routing by agreement reduces
that redundancy, but not consistently so, it doesn’t specifically account for
it.
My default clustering is exclusive
segmentation: each element (child) belongs to only one cluster (parent). Fuzzy
clustering is selective to inputs valued above the cost of adjusting for
overlap in representation, which increases with the range of crosscomparison. Conditional
range increase is done on all levels of composition
 Instantiation
parameters are applicationspecific, CapsNet has no general mechanism to derive
them. My general mechanism is crosscomparison of input capsule parameters,
which forms higherorder parameters. First level forms pixellevel gradient,
similar to edge detection in CNN. But then it forms proximityconstrained clusters,
defined by gradient and parameterized by summed pixel intensity, dy, dx,
gradient, angle. This crosscomparison followed by clustering is done on all
levels, with incremental number of parameters per input.

Number of layers is fixed, while I think it should be incremental with
experience. My hierarchy is a dynamic pipeline: patterns are displaced from a
level by criterion sign change and sent to existing or new higher level. So,
both hierarchy of patterns per system and subhierarchy of derivatives per
pattern expand with experience. The derivatives are summed within a pattern,
then evaluated for extending intrapattern search and feedback.

Output vector of higher capsules combines parameters of all lower layers into
Euclidean distance. That is my default too, but they should also be kept
separate, for potential crosscomp among layerwide representations
Overall,
CapsNet is a variation of ANN, with input summation first and dynamic routing second.
So, it’s a type of Hebbian learning, with most of the problems that I listed in
the previous section.
elaboration
below, some out of date:
0. Cognition
vs. evolution, analog vs. symbolic initial input
1.
Comparison: quantifying match and miss between two variables
2.
Forward search and patterns, implementation for image recognition in video
3.
Feedback of filters, attentional input selection, imagination, motor action
4.
Initial levels of search, corresponding orders of feedback, and resulting
patterns:
level
1: comparison to past inputs, forming difference and relative match patterns
level
2: additional evaluation of resulting patterns for feedback, forming filter
patterns
level
3: additional evaluation of projected filter patterns, forming updatedinput
patterns
5.
Comparison between variable types within a pattern
6.
Cartesian dimensions and sensory modalities
7.
Notes on clustering, ANNs, and probabilistic inference
8.
Notes on working mindset, “hiring” and a prize for contributions
0.
Cognition vs. evolution, analog vs. symbolic initial input
Some
say intelligence can be recognized but not defined. I think that’s absurd: we
recognize some implicit definition. Others define intelligence as a problemsolving
ability, but the only general problem is efficient search for solutions.
Efficiency is a function of selection among inputs, vs. bruteforce alltoall
search. This selection is by predicted value of the inputs, and prediction is
interactive projection of their patterns. Some agree that intelligence is all
about pattern discovery, but define pattern as a crude statistical coincidence.
Of
course, the only mechanism known to produce humanlevel intelligence is even
cruder, and that shows in haphazard construction of our brains. Algorithmically simple,
biological evolution alters heritable traits at random and selects those with
aboveaverage reproductive fitness. But this process requires almost
inconceivable computing power because selection is extremely coarse: on the
level of whole genome rather than individual traits, and also because
intelligence is only one of many factors in reproductive fitness.
Random
variation in evolutionary algorithms, generative RBMs, and so on, is antithetical to intelligence. Intelligent
variation must be driven by feedback within cognitive hierarchy: higher levels
are presumably “smarter” than lower ones. That is, higherlevel inputs
represent operations that formed them, and are evaluated to alter future
lowerlevel operations. Basic operations are comparison and summation among
inputs, defined by their range and resolution, analogous to reproduction in
genetic algorithms.
Range
of comparison per conservedresolution input should increase if projected match
(cognitive fitness function) exceeds average match per comparison. In any
nonrandom environment, average match declines with the distance between
comparands. Thus, search over increasing distance requires selection of above
average comparands. Any delay, coarseness, and inaccuracy of such selection is
multiplied at each search expansion, soon resulting in combinatorial explosion
of unproductive (low additive match) comparisons.
Hence,
my model is strictly incremental: search starts with minimalcomplexity inputs
and expands with minimal increments in their range and complexity (syntax). At
each level, there is only one best increment, projected to discover the
greatest additive match. No other AGI approach follows this principle.
I
guess people who aim for humanlevel intelligence are impatient with small
increments and simple sensory data. Yet, this is the most theoretical problem
ever, demanding the longest delay in gratification.
symbolic
obsession and its discontents
Current Machine Learning and related theories (AIT, Bayesian inference, etc.) are largely
statistical also because they were developed primarily for symbolic data. Such
data, precompressed and preselected by humans, is far more valuable than
sensory inputs it was ultimately derived from. But due to this selection and
compression, proximate symbols are not likely to match, and partial match
between them is very hard to quantify. Hence, symbolic data is a misleading
initial target for developing conceptually consistent algorithm.
Use of
symbolic data as initial inputs in AGI projects betrays profound
misunderstanding of cognition. Even children, predisposed to learn language,
only become fluent after years of directly observing things their parents talk
about. Words are mere labels for concepts, the most important of which are
spatiotemporal patterns, generalized from multimodal sensory experience.
Topdown reconstruction of such patterns solely from correlations among their labels
should be exponentially more difficult than their bottomup construction.
All
our knowledge is ultimately derived from senses, but lower levels of human
perception are unconscious. Only generalized concepts make it into our consciousness,
AKA declarative memory, where we assign them symbols (words) to
facilitate communication. This brainspecific constraint creates heavy symbolic
vs. subsymbolic bias, especially strong in artificial intelligentsia. Which is
putting a cart in front of a horse: most words are meaningless unless coupled
with implicit representations of sensory patterns.
To be
incrementally selective, cognitive algorithm must exploit proximity first,
which is only productive for continuous and losstolerant raw sensory
data. Symbolic data is already compressed: consecutive characters and words in
text won’t match. It’s also encoded with distant crossreferences, that are
hardly ever explicit outside of a brain. Text looks quite random unless you
know the code: operations that generalized pixels into patterns (objects,
processes, concepts). That means any algorithm designed specifically for text
will not be consistently incremental in the range of search, which will impair
its scalability.
In
Machine Learning, input is string, frame, or video sequence of a defined
length, with artificial separation between training and inference. In my
approach, learning is continuous and interactive. Initial inputs are streamed
pixels of maximal resolution, and higherlevel inputs are multivariate
patterns formed by comparing lowerlevel inputs. Spatiotemporal range of
inputs, and selective search across them, is extended indefinitely. This
expansion is directed by higherlevel feedback, just as it is in human
learning.
Everything
ever written is related to my subject, but nothing is close enough: not other
method is meant to be fully consistent. Hence a dire scarcity of references
here. My approach is presented bottomup (parts 1  6), thus can be understood
without references. But that requires a clean context,  hopefully cleaned out
by reader‘s own introspective generalization. Other (insufficiently) related
approaches are addressed above and in part 7. I also have a more advanced
workinprogress, but will need a meaningful feedback to elaborate.
1.
Comparison: quantifying match and miss between two variables
First
of all, we must quantify predictive value. Algorithmic information theory defines it as
compressibility of representation. Which is perfectly fine, but compression is
currently computed only for sequences of inputs.
To
enable far more incremental selection, thus scalable search, I quantify match
between individual inputs. Partial match is a finer dimension of analysis, vs.
binary same  different instances. This is similar to the way probabilistic
inference improved on classical logic, by quantifying probability vs. binary
true  false values.
Partial
match between two variables is a complementary of miss, in corresponding power
of comparison:

comparison by subtraction increases match to a smaller comparand and reduces
miss to a difference,

comparison by division increases match to min * integer part of ratio and
reduces miss to a fractional part
(direct
match works for tactile input. but reflectedlight in vision requires inverse definition
of initial match)
In
other words, match is a compression of larger comparand’s magnitude by
replacing it with miss. Which means that match = smaller input: a common subset
of both inputs, = sum of AND between their uncompressed (unary code) representations.
Ultimate criterion is recorded magnitude, rather than bits of memory it
occupies, because the former represents physical impact that we want to predict.
The volume of memory used to record that magnitude depends on prior
compression, which is not an objective parameter.
Some
may object that match includes the case when both inputs equal zero, but then
match should also be zero. The purpose here is prediction, which represents
conservation of some physical property of observed objects. Ultimately, we’re
predicting potential impact on observer, represented by input. Zero input means
zero impact, which has no conservable property (inertia), thus no intrinsic
predictive value.
Given
incremental complexity, initial inputs should have binary resolution and
implicit coordinate (which is a macroparameter, so its resolution lags that of
an input). Compression of bit inputs by AND is well known as digitization:
substitution of two lower 1 bits with one higher 1 bit. Resolution of coordinate
(input summation span) is adjusted by feedback to form integers that are large
enough to produce aboveaverage match.
Nextorder
compression is comparison between consecutive integers, with binary (before 
after) coordinate.
Additive
match is achieved by comparison of a higher power than that which produced
comparands: AND will not further compress integers digitized by AND. Rather,
initial comparison between integers is by subtraction, resulting difference is
miss, and smaller input is absolute match. If inputs are 4 and 7, then miss is
3, and their match or common subset is 4. Difference is smaller than XOR
(nonzero complementary of AND) because XOR may include oppositesign
(oppositedirection) bit pairs 0, 1 and 1, 0, which are cancelledout by
subtraction.
Comparison
by division forms ratio, which is a compressed
difference. This compression is explicit in long division: match is accumulated
over iterative subtraction of smaller comparand from remaining difference. In
other words, this is also a comparison by subtraction, but between different
orders of derivation. Resulting match is smaller comparand * integer part of ratio, and miss is final reminder or
fractional part of ratio.
Ratio
can be further compressed by converting it to radix or logarithm, and so on.
By
reducing miss, higherpower comparison increases complementary match (match =
larger input  miss):
to be compressed:
larger input  XOR  difference: combined currentorder match &
miss
additive
match: AND

oppositesign XOR  multiple: of a smaller input within a
difference
remaining
miss: XOR

difference 
fraction: complementary to multiple
within a ratio
But the
costs of operations and record of incidental sign, fraction, irrational
fraction, etc. may grow even faster. To justify these costs, power of
comparison should only increase for inputs sufficiently compressed by prior
order of comparison: AND for bit inputs, SUB for integer inputs, DIV for pattern
inputs, etc.
Selection
criterion is a deviation of match: current match  past match cooccurring with
average higherlevel projected match. Such past match is a filter that
determines inclusion of the input into positive or negative (above or below
average) predictive value pattern. Relative match is accumulated until it
exceeds the cost of updating lowerlevel filter, which terminates filter
pattern (samefilter input span) and initializes a new one.
There
is a general agreement that compression is a measure of similarity, but no one
else seems to use it from the bottom up. Compression also depends on resolution
of coordinate (default input summation span), and on resolution of input
magnitude. Projected match can be kept above system’s average by adjusting
corresponding resolution filters: most significant bits and least significant
bits of both coordinate and magnitude.
Separate
filters are formed for each type of compared variable. Initial input, such as
reflected light, is likely to be incidental and very indirectly representative
of physical properties in observed objects. Then its filter will increase, reducing
number of positive patterns, potentially down to 0. But differences or ratios
between inputs represent variation, which is anticorrelated with match. They
have negative predictive value, inverted to get incrementally closer to
intrinsically predictive properties, such as mass or momentum.
So,
absent significant correlation between input magnitude and represented physical
object magnitude, the proxy to match in initial comparison is inverse deviation
of difference:  average_difference  difference. Though less accurate
(defined via average diff vs. individual input), this match is also a
complementary of diff:
complementary
of difference within average_difference (=max of the difference s),
similar to minimum:
complementary
of difference within max input.
2.
Forward search and patterns, implementation for image recognition in video
Pattern
is a contiguous span of inputs that form aboveaverage matches, similar to
conventional cluster.
As
explained above, matches and misses (derivatives) are produced by comparing
consecutive inputs. These derivatives are summed within a pattern and then
compared between patterns on the next level of search, adding new derivatives
to a higher pattern. Patterns are defined contiguously on each level, but positive
and negative patterns are always interlaced, thus nextlevel samesign
comparison is discontinuous.
Negative
patterns represent contrast or discontinuity between positive patterns, which
is a one or higher dimensional equivalent of difference between
zerodimensional pixels. As with differences, projection of a negative pattern
competes with projection of adjacent positive pattern. But match and difference
are derived from the same input pair, while positive and negative patterns
represent separate spans of inputs.
Negative
match patterns are not predictive on its own but are valuable for allocation:
computational resources of nolonger predictive pattern should be used
elsewhere. Hence, the value of negative pattern is borrowed from predictive
value of coprojected positive pattern, as long as combined additive match
remains above average. Consecutive positive and negative patterns project over
same future input span, and these projections partly cancel each other. So,
they should be combined to form feedback, as explained in part 3.
Initial
match is evaluated for inclusion into higher positive or negative pattern. The
value is summed until its sign changes, and if positive, evaluated again for
crosscomparison among constituent inputs over increased distance. Second
evaluation is necessary because the cost of incremental syntax generated by
crosscomparing is per pattern rather than per input. Pattern is terminated and
outputted to the next level when value sign changes. On the next level, it is
compared to previous patterns of the same compositional order.
Initial
inputs are pixels of video, or equivalent limit of positional resolution in
other modalities. Hierarchical search on higher levels should discover patterns
representing empirical objects and processes, and then relational logical and
mathematical shortcuts, eventually exceeding generality of our semantic concepts.
In
cognitive terms, everything we know is a pattern, the rest of input is noise,
filtered out by perception. For online learning, all levels should receive
inputs from lower levels and feedback from higher levels in parallel.
spacetime
dimensionality and initial implementation
Any
prediction has two components: what and where. We must have both: value of
prediction = precision of what * precision of where. That “where” is currently
neglected: statistical ML represents spacetime at greatly reduced resolution,
if at all. In the brain and some neuromorphic models, “where” is represented in a separate
network. That makes transfer of positional information very expensive and
coarse, reducing predictive value of representations. There is no such separation
in my patterns, they represent both what and where as local vars.
My
core algorithm is 1D: time only (part 4). Our spacetime is 4D, but each of
these dimensions can be mapped on one level of search. This way, levels
can select input patterns that are strong enough to justify the cost of
representing additional dimension, as well as derivatives (matches and
differences) in that dimension.
Initial
4D cycle of search would compare contiguous inputs, similarly to connectedcomponent analysis:
level
1 compares consecutive 0D pixels within horizontal scan line, forming 1D
patterns: line segments.
level
2 compares contiguous 1D patterns between consecutive lines in a frame, forming
2D patterns: blobs.
level
3 compares contiguous 2D patterns between incrementaldepth frames, forming 3D
patterns: objects.
level
4 compares contiguous 3D patterns in temporal sequence, forming 4D patterns:
processes.
(in simple video, time is added on level 3
and depth is computed from derivatives)
Subsequent
cycles would compare 4D input patterns over increasing distance in each
dimension, forming longerrange discontinuous patterns. These cycles can be
coded as implementation shortcut, or form by feedback of core algorithm itself,
which should be able to discover maximal dimensionality of inputs. “Dimension”
here is parameter that defines external sequence and distance among inputs.
This is different from conventional clustering, were both external and internal
parameters are dimensions. More in part 6.
However, average match at a given distance in
our spacetime is presumably equal over all four dimensions. That means
patterns defined in fewer dimensions will be fundamentally limited and biased
by the angle of scanning. Hence, initial pixel comparison and clustering into
patterns should also be over 4D at once, or at least over 2D for images and 3D
for video. This is ouruniversespecific extension of my core algorithm.
There
is also a visionspecific adaptation in the way I define initial match. Predictive
visual property is albedo, which means locally stable ratio of brightness /
intensity. Since lighting is usually uniform over much larger area than pixel,
the difference in brightness between adjacent pixels should also be stable. Relative
brightness indicates some underlying property, so it should be crosscompared
to form patterns. But it’s reflected: doesn’t really represent physical quantity /
density of an object. Thus, initial match
is inverse deviation of gradient.
We
are currently coding 1^{st} level algorithm: https://github.com/boriskz/CogAlg/wiki. 1D code is complete, but not
macrorecursive. We are extending it to 2D for image recognition, then to 3D
video for object and process recognition. Higher levels for each Dcycle algorithm
will process discontinuous search among fullD patterns. Complete hierarchical (metalevel)
algorithm will consist of:

1st level algorithm: contiguous crosscomparison over fullD cycle, plus bitfilter
feedback

recurrent increment in complexity, extending currentlevel alg to nextlevel
alg. It will unfold increasingly complex higherlevel input patterns for
crosscomparison, then combine results for evaluation and feedback.
We
will then add colors, maybe audio and text. Initial testing could be
recognition of labeled images, but 2D is a poor representation of our 4D world,
video or stereo video is far better. Variation across space is a product of
past interactions, thus predictive of variation over time (which is normally
lower: we can’t speedup time).
3.
Feedback of filters, attentional input selection, imagination, motor action
(needs
work)
After
evaluation for inclusion into higherlevel pattern, input is also evaluated as
feedback to lower levels. Feedback is update to filters that evaluate forward (Î›) and feedback (V), as described above but on
lower level.
Basic
filter is average value of input’s projected match that cooccurs with (thus
predicts) average higherlevel match, within a positive (aboveaverage)
pattern. Both values are represented in resulting patterns.
Feedback
value =  forward value  value of derivatives / 2, both per higherlevel input
pattern. In turn, forward value is determined by higherlevel feedback, and so
on. Thus, all higher levels affect selection on lowerlevel inputs. This is
because the span of each pattern approximates, hence projects over, combined
span of all lower levels. Default feedback propagates levelsequentially, more
expensive shortcut feedback may be sent to selected levels to filter inputs
that are already in the pipeline, or to permanently rearrange hierarchy.
Negative
derivatives project increasing match: match to subsequent inputs is greater
than to previous inputs.
Such
feedback will further increase lowerlevel filter. If filter is zero, all
inputs are crosscompared, and if filter is negative, it is applied to cancel
subsequent filters for incrementally longerrange crosscomparison.
There
is one filter for each compared variable within input pattern, initialized at 0
and updated by feedback.
novelty
vs. generality
Any
integrated system must have a common selection criterion. Two obvious cognitive
criteria are novelty and generality: miss and match. But we can’t select for
both, they exhaust all possibilities. Novelty can’t be primary criterion: it
would select for noise and filter out all patterns, which are defined by match.
On the other hand, to maximize match of inputs to memory we can stare at a wall:
lock into predictable environments. But of course, natural curiosity actively
skips predictable locations, thus reducing the match.
This
dilemma is resolved if we maximize predictive power: projected match, rather
than actual match, of inputs to records. To the extent that new match was
projected by the derivatives of past inputs, it doesn’t add to their projected
match. But neither does noise: novelty (difference to past inputs) that is not likely
to persist (match) in the future. Projection is only computed over some
discontinuity, as feedback from terminated patterns:
Additive
projected match = new match  downward projected match (a filter, where projection
is m + Dm / 2).
So,
selection for novelty is done by subtracting higherlevel projection from
corresponding input parameter. Higherorder positional selection is skipping
(or avoiding processing) predictable future input spans. Skipped input span is formally
a *coordinate* filter feedback: next coordinate of inputs with expected aboveaverage
*additive* predictive value. Thus, next input location is selected first by
proximity and then by novelty, both relative to a template comparand. This is
covered in more detail in part 4, level 3.
Vertical
evaluation computes deviations, to form positive or negative higherlevel
patterns. Evaluation is relative to higherlevel averages, which represent past
inputs, thus should be projected over feedback delay: average += average difference * (delay / average span)
/2. Average per input variable may also be a feedback, representing redundancy
to higher level, which also depends on higherlevel match rate: rM = match /
input.
If rM
> average per cost of processing: additive match = input match 
inputtoaverage match * rM.
Lateral
comparison computes differences, to project corresponding parameters of all
derivation orders:
difference
in magnitude of initial inputs: projected next input = last input +
difference/2,
difference
in input match, a subset of magnitude: projected next match = last match +
match difference/2,
difference
in match of match, a subsubset of magnitude, projected correspondingly, and so
on.
Ultimate
criterion is top order of match on a top level of search: the most predictive
parameter in a system.
imagination,
planning, action
Imagination
is never truly original, it can only be formalized as interactive projection of
known patterns. As explained above, patterns send feedback to filter lowerlevel
sources. This feedback is to future sources, where the patterns are projected
to continue or reoccur. Stronger upstream patterns and correspondingly higher
filters reduce resolution of or totally skip predictable input spans. But when multiple
originally distant patterns are projected into the same location, their
feedback cancels out in proportion to their relative difference.
In
other words, combined filter is cancelledout to the extent that coprojected
patterns are mutually exclusive:
filter
= max_pattern_feedback  alt_pattern_feedback * match_rate. By default, match_rate
used here is average (match / max_comparand). But it has average error: average
abs(match_rate  average_match_rate). To improve filter accuracy, we can derive
actual match rate by crosscomparing coprojected patterns. I think imagination
is just that: search across coprojected patterns, before accessing their external
target sources.
Patterns
are projected in space and time, depending on their past ST span and a vector
of input derivatives over that span. So, pattern input parameters in some
future location can be projected as:
(recorded
input parameters) + (corresponding derivatives * relative distance) / 2.
Where
relative distance = (projected coords  current coords) / span of the pattern
in the same direction.
Any search
is defined by location: contiguous coordinate span. Span of feedback target is that
of feedback source’ input pattern: narrower than the span of feedback source
unit’ output pattern. So, search across coprojected patterns is performed on a
conceptually lower level, but patterns themselves belong to higher level. Meaning
that search will be within intersection
of coprojected patterns, vs. whole patterns. Intersection is a location within
each of the patterns, and crosscomparison will be among pattern elements in
that location.
Combined
filter is then prevaluated: projected value of positive patterns is compared to
the cost of evaluating all inputs, both within a target location. If prevalue
is negative: projected inputs are not worth evaluating, their location is
skipped and “imagination” moves to the next nearest one. Filter search continues
until prevalue turns positive (with aboveaverage novelty) and the sensor is
moved that location. This sensor movement, along with adjustment of its
threshold, is the most basic form of motor feedback, AKA action.
Cognitive
component of action is planning: a form of imagination where projected patterns
include those that represent the system itself. Feedback of such selfpatterns
eventually reaches the bottom of representational hierarchy: sensors and
actuators, adjusting their sensitivity  intensity and coordinates. This adjustment
is action. Such environmental interface is a part of any cognitive system,
although actuators are optional.
4.
Initial levels of search, corresponding orders of feedback and resulting
patterns
This
part recapitulates and expands on my core algorithm, which operates in one
dimension: time only. Spatial and derived dimensions are covered in part 6.
Even within 1D, the search is hierarchical in scope, containing any number of
levels. New level is added when current top level terminates and outputs the
pattern it formed.
Higherlevel
patterns are fed back to select future inputs on lower levels. Feedback is sent
to all lower levels because span of each pattern approximates combined span of
inputs within whole hierarchy below it.
So,
deeper hierarchy forms higher orders of feedback, with increasing elevation and
scope relative to its target: samelevel prior input, higherlevel match
average, beyondthenextlevel match value average, etc.
These
orders of feedback represent corresponding order of input compression: input,
match between inputs, match between matches, etc. Such compression is produced
by comparing inputs to feedback of all orders.
Comparisons
form patterns, of the order that corresponds to relative span of compared
feedback:
1: prior inputs are compared to the following
ones on the same level, forming difference patterns dPs,
2: higherlevel match is used to evaluate match
between inputs, forming deviation patterns vPs,
3: higherhierarchy value revaluates positive
values of match, forming more selective shortcut patterns sPs
Feedback
of 2^{nd} order consists of input
filters (if) defining value patterns, and coordinate
filters (Cf) defining positional resolution and relative
distance to future inputs.
Feedback
of 3^{rd} order is shortcut filters
for beyondthenext level. These filters, sent to a location defined by
attached coordinate filters, form higherorder value patterns for deeper
internal and distantlevel comparison.
Higherorder
patterns are more selective: difference is as likely to be positive as
negative, while value is far more likely to be negative, because positive
patterns add costs of reevaluation for extended crosscomparison among their
inputs. And so on, with selection and reevaluation for each higher order of
positive patterns. Negative patterns are still compared as a whole: their weak
match is compensated by greater span.
All
orders of patterns formed on the same level are redundant representations of
the same inputs. Patterns contain representation of match between their inputs,
which are compared by higherorder operations. Such operations increase overall
match by combining results of lowerorder comparisons across pattern’s
variables:
0Le: AND of bit inputs to form digitized integers,
containing multiple powers of two
1Le: SUB of integers to form patterns, over
additional external dimensions = pattern length L
2Le: DIV of multiples (L) to form ratio patterns, over additional distances =
negative pattern length LL
3Le: LOG of powers (LLs), etc. Starting from
second level, comparison is selective per element of an
input.
Such
power increase also applies in comparison to higherorder feedback, with a lag
of one level per order.
Power
of coordinate filters also lags the power of input filters by one level:
1Le fb: binary sensor resolution: minimal and
maximal detectable input value and coordinate increments
2Le fb: integervalued average match and relative
initial coordinate (skipping intermediate coordinates)
3Le fb: rationalvalued coefficient per variable and
multiple skipped coordinate range
4Le fb: realvalued coefficients and multiple
coordinaterange skip
I am
defining initial levels to find recurring increments in operations per level,
which could then be applied to generate higher levels recursively, by incrementing
syntax of output patterns and of feedback filters per level.
operations
per generic level (out of date)
Level
0 digitizes inputs, filtered by minimal detectable magnitude: least significant
bit (i LSB). These bits are AND compared,
then their matches are AND compared again, and so on, forming integer outputs.
This is identical to iterative summation and bitfiltering by sequentially
doubled i LSB.
Level
1 compares consecutive integers, forming ± difference patterns (dP s). dP s are then evaluated to
crosscompare their individual differences, and so on, selectively increasing
derivation of patterns.
Evaluation:
dP M (summed match)  dP aM (dP M per average match between
differences in level 2 inputs).
Integers
are limited by the number of digits (#b), and input span: least significant bit
of coordinate (C LSB).
No 1^{st} level feedback: fL cost is additive to dP cost, thus must be
justified by the value of dP (and coincident difference in value of patterns
filtered by adjusted i LSB), which is not known till dP is outputted to 2^{nd} level.
Level
2 evaluates match within dP s  bf L (dP) s, forming ± value patterns: vP s  vP (bf L) s. +vP s are evaluated for
crosscomparison of their dP s, then of resulting derivatives, then of inputted
derivation levels. +vP (bf L) s are evaluated to crosscompare bf L s, then dP s, adjusted by the
difference between their bit filters, and so on.
dP
variables are compared by subtraction, then resulting matches are combined with
dP M (match within dP) to
evaluate these variables for crosscomparison by division, to normalize for the
difference in their span.
//
match filter is also normalized by span ratio before evaluation, samepower
evaluation and comparison?
Feedback:
input dP s  bf L (dP) are backprojected and
resulting magnitude is evaluated to increment or decrement 0^{th} level i LSB. Such increments
terminate bitfilter span ( bf L (dP)), output it to 2^{nd} level, and initiate a new i LSB span to filter future inputs.
// bf L (dP) representation: bf , #dP, Î£ dP, Q (dP).
Level
3 evaluates match in input vP s or f L (vP) s, forming ± evaluationvalue patterns: eP s  eP (fL) s. Positive eP s are evaluated for
crosscomparison of their vP s ( dP s ( derivatives ( derivation levels ( lower searchlevel
sources: buffered or external locations (selected sources may directly specify
strong 3^{rd} level subpatterns).
Feedback:
input vP is backprojected, resulting match is compared to 2^{nd} level filter, and the
difference is evaluated vs. filterupdate filter. If update value is positive,
the difference is added to 2^{nd} level filter, and filter span is terminated.
Same for adjustment of previously covered bit filters and 2^{nd} level filterupdate
filters?
This
is similar to 2^{nd} level operations, but input vP s are separated by
skippedinput spans. These spans are a filter of coordinate (Cf, higherorder than f for 2^{nd} level inputs), produced by
prevaluation of future inputs:
projected
novel match = projected magnitude * average match per magnitude 
projectedinput match?
Prevalue
is then evaluated vs. 3^{rd} level evaluation filter + lowerlevel processing cost,
and negative prevaluevalue input span (= span of backprojecting input) is
skipped: its inputs are not processed on lower levels.
// no
prevaluation on 2^{nd} level: the cost is higher than potential savings of only
1^{st} level processing costs?
As
distinct from input filters, Cf is defined individually rather than per filter
span. This is because the cost of Cf update: span representation and
interruption of processing on all lower levels, is minor compared to the value
of represented contents? ±eP = ±Cf: individual skip evaluation, no flushing?
or
interruption is predetermined, as with Cb, fixed C f within C f L: a span of sampling
across fixedL gaps?
alternating
signed Cf s are averaged ±vP s?
Division:
between L s, also inputs within
minimaldepth continuous dsign or morder derivation hierarchy?
tentative
generalizations and extrapolations
So,
filter resolution is increased per level, first for i filters and then for C
filters: level 0 has input bit filter,
level
1 adds coordinate bit filter, level 2 adds input integer filter, level 3 adds
coordinate integer filter.
//
coordinate filters (Cb, Cf) are not inputspecific, patterns are formed by
comparing their contents.
Level
4 adds input multiple filter: eP match and its derivatives, applied in parallel to
corresponding variables of input pattern. Variablevalues are multiplied and
evaluated to form patternvalue, for inclusion into nextlevel ±pattern // if
separately evaluated, inputvariable value = deviation from average:
signreversed match?
Level
5 adds coordinate multiple filter: a sequence of skippedinput spans by
iteratively projected patterns, as described in imagination section of part 3.
Alternatively, negative coordinate filters implement crosslevel shortcuts,
described in level 3 subpart, which select for projected matchassociated
novelty.
Additional
variables in positive patterns increase cost, which decreases positive vs.
negative span proportion.
Increased
difference in sign, syntax, span, etc., also reduces match between positive and
negative patterns. So, comparison, evaluation, prevaluation... on higher
levels is primarily for samesign patterns.
Consecutive
differentsign patterns are compared due to their proximity, forming ratios of
their span and other variables. These ratios are applied to project match
across differentsign gap or contrast pattern:
projected
match += (projected match  intervening negative match) * (negative value /
positive value) / 2?
Î›V selection is incremented by induction:
forward and feedback of actual inputs, or by deduction: algebraic compression
of input syntax, to find computational shortcuts. Deduction is faster, but
actual inputs also carry empirical information. Relative value of additive
information vs. computational shortcuts is set by feedback.
Following
subparts cover three initial levels of search in more detail, though out of
date:
Level
1: comparison to past inputs, forming difference patterns and match patterns
Inputs
to the 1^{st} level of search are single
integers, representing pixels of 1D scan line across an image, or equivalents
from other modalities. Consecutive inputs are compared to form differences,
difference patterns, matches, relative match patterns. This comparison may be
extended, forming higher and distant derivatives:
resulting variables per
input: *=2 derivatives (d,m) per comp, + conditional *=2 (xd, xi) per extended
comp:
8 derivatives
// ddd, mdd, dd_i, md_i, + 1inputdistant dxd, mxd, +
2inputdistant d_ii, m_ii,
/
\
4 der
4 der // 2 consecutive: dd, md, + 2
derivatives between 1inputdistant inputs: d_i and m_i,
/
\ / \
d,m d,m
d,m // d, m: derivatives from default comparison
between consecutive inputs,
/ \ /
\ / \
i
>> i >> i >> i
// i: singlevariable inputs.
This
is explained / implemented in my draft python code: line_POC. That first level is for
generic 1D cognitive algorithm, its adaptation for image and then video
recognition algorithm will be natively 2D.
That’s
what I spend most of my time on, the rest of this intro is significantly out of
date.
bitfiltering
and digitization
1^{st} level inputs are filtered
by the value of most and least significant bits: maximal and minimal detectable
magnitude of inputs. Maximum is a magnitude that cooccurs with average 1^{st} level match, projected by
outputted dP s. Least significant bit
value is determined by maximal value and number of bits per variable.
This
bit filter is initially adjusted by overflow in 1^{st} level inputs, or by a set
number of consecutive overflows.
It’s
also adjusted by feedback of higherlevel patterns, if they project over or
under flow of 1^{st} level inputs that exceeds the cost of adjustment.
Underflow is average number of 0 bits above top 1 bit.
Original
input resolution may be increased by projecting analog magnification, by impact
or by distance.
Iterative
bitfiltering is digitization: bit is doubled per higher digit, and exceeding
summed input is transferred to next digit. A digit can be larger than binary if
the cost of such filtering requires larger carry.
Digitization
is the most basic way of compressing inputs, followed by comparison between
resulting integers.
hypothetical:
comparable magnitude filter, to form minimalmagnitude patterns
This
doesn’t apply to reflected brightness, only to types of input that do represent
physical quantity of a source.
Initial
magnitude justifies basic comparison, and summation of belowaverage inputs
only compensates for their lower magnitude, not for the cost of conversion.
Conversion involves higherpower comparison, which must be justified by higher
order of match, to be discovered on higher levels.
iP min
mag span conversion cost and comparison match would be on 2^{nd} level, but it’s not
justified by 1^{st} level match, unlike D span
conversion cost and comparison match, so it is effectively the 1^{st} level of comparison?
possible
+iP span evaluation: double evaluation + span representation cost <
additional lowerbits match?
The
inputs may be normalized by subtracting feedback of average magnitude, forming
± deviation, then by dividing it by next+1 level feedback, forming a multiple
of average absolute deviation, and so on. Additive value of input is a
combination of all deviation orders, starting with 0^{th} or absolute magnitude.
Initial
input evaluation if any filter: cost < gain: projected negativevalue (comparison
cost  positive value):
by
minimal magnitude > ± relative magnitude patterns (iP s), and + iP s are evaluated or
crosscompared?
or by
average magnitude > ± deviations, then by coaverage deviation: ultimate bit
filter?
Summation
*may* compensate for conversion if its span is greater than average per
magnitude spectrum?!
Summation
on higher levels also increases span order, but withinorder conversion is the
same, and betweenorder comparison is intrapattern only. bf spans overlap
vP span, > filter conversion costs?
Level
2: additional evaluation of
input patterns for feedback, forming filter patterns (out of date)
Inputs
to 2^{nd} level of search are
patterns derived on 1^{st} level. These inputs are evaluated for feedback to update
0^{th} level i LSB, terminating
samefilter span.
Feedback
increment of LSB is evaluated by deviation (∆) of magnitude, to avoid input overflow or
underflow:
∆ += I/ L  LSB a; ∆ > ff? while (∆ > LSB a){ LSB ±; ∆ = LSB a; LSB a *2};
LSB a
is average input (* V/ L?) per LSB value, and ff
is average deviation per positivevalue increment;
Î£ (∆) before evaluation: no V patterns? #b++ and C LSB are more expensive,
evaluated on 3^{rd} level?
They
are also compared to previously inputted patterns, forming difference patterns
dPs and value patterns vPs per input variable, then combined into dPP s and vPP s per input pattern.
L *
sign of consecutive dP s is a known miss, and match of dP variables is
correlated by common derivation.
Hence,
projected match of other +dP and dP variables = amk * (1  L / dP). On the other hand, samesign dP s are distant by L,
reducing projected match by amk * L, which is equal to reduction by miss of L?
So, dP
evaluation is for two comparisons of equal value: crosssign, then cross L
samesign (1 dP evaluation is blocked by
feedback of discovered or defined alternating sign and covariable match
projection).
Both
of last dP s will be compared to the
next one, thus past match per dP (dP M) is summed for three dP s:
dP M ( Î£ ( last 3 dP s L+M))  a dP M (average of 4Le +vP dP M) > v, vs;; evaluation / 3 dP s > value, sign / 1 dP.
while (vs = ovs){ ovs = vs; V+=v; vL++; vP (L, I, M, D) += dP (L, I, M, D);; default vP  wide sum, select preserv.
vs > 0? comp (3 dP s){ DIV (L, I, M, D) > N, ( n, f, m, d); vP (N, F, M, D) += n, f, m, d;; sum: der / variable, n / input?
vr = v+ N? SUB (nf) > nf m; vd = vr+ nf m, vds = vd  a;; ratios are too small
for DIV?
while (vds = ovds){ ovds = vds; Vd+=vd; vdL++; vdP() += Q (d  ddP);; default Q (d  ddP) sum., select. preserv.
vds > 0? comp (1^{st} x l^{st} d  ddP s of Q (d) s);; splicing Q (d) s of matching dP s, cont. only: no comp ( Î£ Q (d  ddP)?
Î£ vP ( Î£ vd P eval: primary for P,
redundant to individual dP s ( d s for +P, cost *2, same for +P' I and P' M,D?
no Î£ V  Vd evaluation of cont. comp
per variable or division: cost + vL = comp cost? Î£ V per fb: no vL, #comp;
 L, I, M, D: same value per mag,
power / compression, but I  M, D redund = mag, +vP: I  2a,  vP: M, D  2a?
 no variable eval: cost (sub + vL + filter) > comp cost, but
match value must be adjusted for redundancy?
 normalization for
comparison: min (I, M, D) * rL, SUB (I, M, D)? Î£ L (pat) vs C: more general but
interrupted?
variablelength
DIV: while (i > a){ while (i> m){ SUB (i, m) > d; n++; i=d;}; m/=2; t=m; SUB (d, t); f+= d;}?
additive
compression per d vs. m*d: > length cost?
tdP ( tM, tD, dP(), ddP Î£ ( dMÎ£ (Q (dM)), dDÎ£ (Q (dD)), ddLÎ£ (Q (ddL)), Q (ddP))); // last d and D are within
dP()?
Input
filter is a higherlevel average, while filter update is accumulated over
multiple higherlevel spans until it exceeds filterupdate filter. So, filter
update is 2^{nd} order feedback relative to
filter, as is filter relative to match.
But
the same filter update is 3^{rd} order of feedback when used to evaluate input
value for inclusion into pattern defined by a previous filter: update span is
two orders higher than value span.
Higherlevel
comparison between patterns formed by different filters is mediated, vs.
immediate continuation of currentlevel comparison across filter update
(mediated cont.: splicing between differentfilter patterns by vertical
specification of match, although it includes lateral crosscomparison of
skipdistant specifications).
However,
filter update feedback is periodic, so it doesn’t form continuous crossfilter
comparison patterns xPs.
adjustment
of forward evaluation by optional feedback of projected input
More
precisely, additive value or novel magnitude of an input is its deviation from
higherlevel average. Deviation = input  expectation: (higherlevel summed
input  summed difference /2) * rL (L / hL).
Inputs
are compared to last input to form difference, and to past average to form
deviation or novelty.
But
last input is more predictive of the next one than a more distant average, thus
the latter is compared on higher level than the former. So, input variable is
compared sequentially and summed within resulting patterns. On the next level,
the sum is compared vertically: to nextnextlevel average of the same
variable.
Resulting
vertical match defines novel value for higherlevel sequential comparison:
novel
value = past match  (vertical match * higherlevel match rate)  average novel
match:
nv =
L+M  (m (I, (hI * rL)) * hM / hL)  hnM * rL; more precise than
initial value: v = L+M  hM * rL;
Novelty
evaluation is done if higherlevel match > cost of feedback and operations,
separately for I and D P s:
I, M ( D, M feedback, vertical SUB (I, nM ( D, ndM));
Impact
on ambient sensor is separate from novelty and is predicted by
representationalvalue patterns?
 nextinput prediction: seq
match + vert match * relative rate, but predictive selection is per level, not
input.
 higherorder expectation is
relative match per variable: pMd = D * rM, M/D, or D * rMd: Md/D,
 if rM  rMd are derived by
intrapattern comparison, when average M  Md > average per division?
oneinput
search extension within crosscompared patterns
Match
decreases with distance, so initial comparison is between consecutive inputs.
Resulting match is evaluated, forming ±vP s. Positive P s are then evaluated for expanded internal
search: crosscomparison among 1inputdistant inputs within a pattern (on same
level, higherlevel search is between new patterns).
This
cycle repeats to evaluate crosscomparison among 2inputdistant inputs,
3inputdistant inputs, etc., when summed currentdistance match exceeds the
average per evaluation.
So,
patterns of longer crosscomparison range are nested within selected positive
patterns of shorter range. This is similar to 1^{st} level ddP s being nested within dP s.
Same
input is reevaluated for comparison at increased distance because match will
decay: projected match = last match * match rate (mr), * (higherlevel mr /
currentlevel mr) * (higherlevel distance / next distance)?
Or = input * average match rate
for that specific distance, including projected match within negative patterns.
It is
reevaluated also because projected match is adjusted by past match: mr *= past mr / past projected
mr?
Also,
multiple comparisons per input form overlapping and redundant patterns (similar
to fuzzy clusters),
and
must be evaluated vs. filter * number of prior comparisons, reducing value of
projected match.
Instead
of directly comparing incrementally distant input pairs, we can calculate their
difference by adding intermediate differences. This would obviate multiple
access to the same inputs during crosscomparison.
These
differences are also subtracted (compared), forming higher derivatives and
matches:
ddd, x1dd, x2d ( ddd: 3^{rd} derivative, x1dd: d of 2inputdistant
d s, x2d: d of 2inputdistant
inputs)
/ \
dd, x1d dd, x1d ( dd: 2^{nd} derivative, x1d = d+d =
difference between 1inputdistant inputs)
/
\
/ \
d d d ( d:
difference between consecutive inputs)
/ \ /
\ / \
i
i
i i
(
i: initial inputs)
As
always, match is a smaller input, cached or restored, selected by the sign of a
difference.
Comparison
of both types is between all sametype variable pairs from different inputs.
Total
match includes match of all its derivation orders, which will overlap for
proximate inputs.
Incremental
cost of crosscomparison is the same for all derivation orders. If projected
match is equal to projected miss, then additive value for different orders of
the same inputs is also the same: reduction in projected magnitude of differences
will be equal to reduction in projected match between distant inputs?
multiinput
search extension, evaluation of selection per input: tentative
On the
next level, average match from expansion is compared to that from shorterdistance
comparison, and resulting difference is decay of average match with distance.
Again, this decay drives reevaluation per expansion: selection of inputs with
projected decayed match above average per comparison cost.
Projected
match is also adjusted by prior match (if local decay?) and redundancy
(symmetrical if no decay?)
Slower
decay will reduce value of selection per expansion because fewer positive
inputs will turn negative:
Value
of selection = Î£ comp cost of negvalue inputs  selection cost (average
saved cost or relative delay?)
This
value is summed between higherlevel inputs, into average value of selection
per increment of distance. Increments with negative value of selection should
be compared without reevaluation, adding to minimal number of comparisons per
selection, which is evaluated for feedback as a comparisondepth filter:
Î£ (selection value per
increment) > average selection value;; for negative patterns of each depth,
 >1 only?
depth
adjustment value = average selection value; while (average selection value > selection cost){
depth
adjustment ±±; depth adjustment value = selection value per increment
(depthspecific?); };
depth
adjustment > minimal per feedback? >> lowerlevel depth filter;;
additive depth = adjustment value?

match filter is summed and evaluated per current comparison depth?

selected positive relative matches don’t reduce the benefit of pruningout
negative ones.
 skip
if negative selection value: selected positive matches < selection cost:
average value or relative delay?
Each
input forms a queue of matches and misses relative to templates within
comparison depth filter. These derivatives, both discrete and summed, overlap
for inputs within each other’s search span. But representations of discrete
derivatives can be reused, redundancy is only necessary for parallel
comparison.
Assuming
that environment is not random, similarity between inputs declines with
spatiotemporal distance. To maintain proximity, a ninput search is FIFO:
input is compared to all templates up to maximal distance, then added to the
queue as a new template, while the oldest template is outputted into
patternwide queue.
valueproportional
combination of patterns: tentative
Summation
of +dP and dP is weighted by their value: L (summed dsign match) + M (summed
i match).
Such
relative probability of +dP vs.  dP is indicated by corresponding ratios: rL =
+L/L, and rM = +M/M.
(Ls
and Ms are compared by division: comparison power should be higher for more
predictive variables).
But
weighting complementation incurs costs, which must be justified by value of
ratio. So, division should be of variable length, continued while the ratio is
above average. This is shown below for Ls, also applies to Ms:
dL =
+L  L, mL = min (+L, L); nL =0; fL=0;
efL=1; // nL: L multiple, fL: L fraction, efL: extended fraction.
while
(dL > adL){ dL = dL; // all Ls are positive; dL is evaluated for long
division by adL: average dL.
while
(dL > 0){ dL = mL; nL++;} dL = mL/2; dL >0? fL+= efL; efL/=2;} //
ratio: rL= nL + fL.
Ms’
longdivision evaluation is weighted by rL: projected rM value = dM * nL
(reducedresolution rL)  adM.
Ms are
then combined: cM = +M + M * rL; // rL is relative probability of M across
iterated cL.
Ms are
not projected (M+= D * rcL * rM D (MD/cD) /2): precision of
higherlevel rM D is below that of rM?
Prior
ratios are combination rates: rL is probability of M, and combined rL and rM
(cr) is probability of D.
If rM
< arM, cr = rL, else: cr = (+L + +M) / (L + M) // cr = √(rL * rM) would
lose L vs. M weighting.
cr
predicts match of weighted cD between cdPs, where negativedP variable is
multiplied by aboveaverage match ratio before combination: cD = +D + D * cr. // after unweighted
comparison between Ds?
Averages:
arL, arM, acr, are feedback of ratios that cooccur with aboveaverage match of
spannormalized variables, vs. input variables. Another feedback is averages
that evaluate long division: adL, adM, adD.
Both
are feedback of positive C pattern, which represents these variables, inputted
& evaluated on 3^{rd} level.
; or 4^{th} level: value of dPs * ratio is compared to value
of dPs, & the difference is multiplied by cL / hLe cL?
Comparison
of oppositesign Ds forms negative match = smaller D, and positive difference
dD = +D+ D.
dD
magnitude predicts its match, not further combination. Single comparison is
cheaper than its evaluation.
Comparison
is by division if larger D cooccurs with hLe nD of aboveaverage predictive
value (division is signneutral & reductive). But average nD value is below
the cost of evaluation, except if positive feedback?
So,
default operations for L, M, D of complementary dPs are comparison by long
division and combination.
D
combination: +D D*cr, vs.  cD * cr: +D vs. D weighting is lost, meaningless if cD=0?
Combination
by division is predictive if the ratio is matching on higher level (hLe) &
acr is fed back as filter?
Resulting
variables: cL, rL, cM, rM, cr, cD, dD, form top level of cdP: complemented dP.
Level
3: prevaluation of projected filter patterns, forming updatedinput patterns
(out
of date)
3^{rd} level inputs are ± V
patterns, combined into complemented V patterns. Positive V patterns include
derivatives of 1^{st} level match, which project match within future inputs (D
patterns only represent and project derivatives of magnitude). Such
projectedinputsmatch is prevaluated, negative prevaluespan inputs are
summed or skipped (reloaded), and positive prevaluespan inputs are evaluated
or even directly compared.
Initial
upward (Î›) prevaluation by E filter
selects for evaluation of V patterns, within resulting ± E patterns. Resulting
prevalue is also projected downward (V), to select future input spans for evaluation,
vs. summation or skipping. The span is of projecting V pattern, same as of
lower hierarchy. Prevaluation is then iterated over multiple projectedinput
spans, as long as last prevalue remains above average for the cost of prevaluation.
Additional
interference of iterated negative projection is stronger than positive
projection of lower levels, and should flush them out of pipeline. This
flushing need not be final, spans of negative projected value may be stored in
buffers, to delay the loss. Buffers are implemented in slower and cheaper media
(tape vs. RAM) and accessed if associated patterns match on a higher level,
thus project aboveaverage match among their inputs.
Iterative
backprojection is evaluated starting from 3^{rd} level: to be projectable the input must
represent derivatives of value, which are formed starting from 2^{nd} level. Compare this to 2^{nd} level evaluation:
Î› for
input, V for V filter, iterated within V pattern. Similar subiteration in E
pattern?
Evaluation
value = projectedinputsmatch  E filter: average input match that cooccurs
with average higherlevel match per evaluation (thus accounting for
evaluation costs + selected comparison costs). Compare this to V filter that
selects for 2^{nd} level comparison: average
input match that cooccurs with average higherlevel match per comparison (thus
accounting for costs of default crosscomparison only).
E
filter feedback starts from 4^{th} level of search, because its inputs
represent prevaluated lowerlevel inputs.
4^{th} level also
preprevaluates vs. prevaluation filter, forming preprevalue that determines
prevaluation vs. summation of next input span. And so on: the order of
evaluation increases with the level of search.
Higher
levels are increasingly selective in their inputs, because they additionally
select by higher orders derived on these levels: magnitude ) match and
difference of magnitude ) match and difference of match, etc.
Feedback
of prevaluation is ± prefilter: binary evaluationvalue sign that determines
evaluating vs. skipping initial inputs within projected span, and flushing
those already pipelined within lower levels.
Negative
feedback may be iterated, forming a skip span.
Parallel
lower hierarchies & skip spans may be assigned to different external
sources or their internal buffers.
Filter
update feedback is levelsequential, but prefilter feedback is sent to all
lower levels at once.
Prefilter
is defined per input, and then sequentially translated into prefilters of
higher derivation levels:
prior
value += prior match > value sign: nextlevel prefilter. If there are
multiple prefilters of different evaluation orders from corresponding levels,
they AND & define infrapatterns: sign ( input ( derivatives.
filter
update evaluation and feedback
Negative
evaluationvalue blocks input evaluation (thus comparison) and filter updating
on all lower levels. Notevaluated input spans (gaps) are also outputted, which
will increase coordinate range per contents of both higherlevel inputs and
lowerlevel feedback. Gaps represent negative projectedmatch value, which must
be combined with positive value of subsequent span to evaluate comparison
across the gap on a higher level. This is similar to evaluation of combined positive
+ negative relative match spans, explained above.
Blocking
locations with expected inputs will result in preference for exploration &
discovery of new patterns, vs. confirmation of the old ones. It is the opposite
of upward selection for stronger patterns, but sign reversal in selection
criteria is basic feature of any feedback, starting with average match &
derivatives.
Positive
evaluationvalue input spans are evaluated by lowerlevel filter, & this
filter is evaluated for update:
combined
update = (output update + output filter update / (samefilter span (fL) / output span)) /2.
both
updates: = last feedback, equalweighted because higherlevel distance is
compensated by range: fL?
update
value = combined update  update filter: average update per average
higherlevel additive match.
also
differential costs of feedback transfer across locations (vs. delay) +
representation + filter conversion?
If
update value is negative: fL += new inputs, subdivided by their positive or
negative predictive value spans.
If
update value is positive: lowerlevel filter += combined update, new fL (with
new filter representation) is initialized on a current level, while
currentlevel part of old fL is outputted and evaluated as nextlevel input.
In
turn, the filter gets updates from higherlevel outputs, included in
higherhigherlevel positive patterns by that level’s filter. Hence, each
filter represents combined spannormalized feedback from all higher levels, of
exponentially growing span and reduced update frequency.
Deeper
hierarchy should block greater proportion of inputs. At the same time,
increasing number of levels contribute to projected additive match, which may
justify deeper search within selected spans.
Higherlevel
outputs are more distant from current input due to elevation delay, but their
projection range is also greater. So, outputs of all levels have the same
relative distance (distance/range) to a next input, and are equalweighted in
combined update. But if input span is skipped, relative distance of
skipinitiating pattern to next input span will increase, and its predictive
value will decrease. Hence, that pattern should be flushed or at least combined
with a higherlevel one:
combined
V prevalue = higherlevel V prevalue + ((currentlevel V prevalue  higherlevel V prevalue) / ((currentlevel
span / distance) / (higherlevel span / distance)) /2. // the difference
between currentlevel and higherlevel prevalues is reduced by the ratio of
their relative distances.
To speed
up selection, filter updates can be sent to all lower levels in parallel.
Multiple direct filter updates are spannormalized and compared at a target
level, and the differences are summed in combined update. This combination is
equalweighted because all levels have the same spanperdistance to next
input, where the distance is the delay of feedback during elevation. // this
happens automatically in levelsequential feedback?
combined
update = filter update + distancenormalized difference between output &
filter updates:
((output
update  filter update) / (output relative distance / higheroutput relative
distance)) /2.
This
combination method is accurate for postskipped input spans, as well as next
input span.

filter can also be replaced by output + higherlevel filter /2, but value of
such feedback is not known.

possible fixedrate sampling, to save on feedback evaluation if slow decay, ~
deep feedforward search?

selection can be by patterns, derivation orders, subpatterns within an order,
or individual variables?

match across distance also projects across distance: additive match = relative
match * skipped distance?
crosslevel
shortcuts: higherlevel subfilters and symbols
After
individual input comparison, if match of a current scale (lengthofalength…)
projects positive relative match of input lowerscale / higherderivation
level, then the later is also crosscompared between the inputs.
Lower
scale levels of a pattern represent old lower levels of a search hierarchy
(current or buffered inputs).
So,
feedback of lower scale levels goes down to corresponding search levels,
forming shortcuts to preserve detail for higher levels. Feedback is generally
negative: expectations are redundant to inputs. But specifying feedback may be
positive: lowerlevel details are novel to a pattern, & projected to match
with it in the future.
Higherspan
comparison power is increased if lowerspan comparison match is below average:
variable
subtraction ) span division )
superspan logarithm?
Shortcuts
to individual higherlevel inputs form a queue of subfilters on a lower level,
possibly represented by a queuewide prefilter. So, a level has one filter per
parallel higher level, and subfilter for each specified subpattern.
Subfilters of incrementally distant inputs are redundant to all previous ones.
Corresponding
input value = match  subfilter value * rate of match to subfilter *
redundancy?
Shortcut
to a whole level won’t speedup search: higherlevel search delay >
lowerhierarchy search delay.
Resolution
and parameter range may also increase through interaction of colocated
counterprojections?
Symbols,
for communication among systems that have common highlevel concepts but no
direct interface, are “coauthor identification” shortcuts: their recognition
and interpretation is performed on different levels.
Higherlevel
patterns have increasing number of derivation levels, that represent
corresponding lower search levels, and project across multiple higher search
levels, each evaluated separately?
Match
across discontinuity may be due to additional dimensions or internal gaps
within patterns.
Search
depth may also be increased by crosscomparison between levels of scale within
a pattern: match across multiple scale levels also projects over multiple
higher and lower scale levels? Such comparison between variable types within
a pattern would be of a higher order:
5.
Comparison between variable types within a pattern (tentative)
To
reiterate, elevation increases syntactic complexity of patterns: the number of
different variable types within them. Syntax is identification of these types
by their position (syntactic coordinate) within a pattern. This is analogous to
recognizing parts of speech by their position within a sentence.
Syntax
“synchronizes” sametype variables for comparison  aggregation between input
patterns. Access is hierarchical, starting from sign>value levels within
each variable of difference and relative match: sign is compared first, forming
+ and  segments, which are then evaluated for comparison of their values.
Syntactic
expansion is pruned by selective comparison vs. aggregation of individual
variable types within input patterns, over each coordinate type or resolution.
As with templates, minimal aggregation span is resolution of individual inputs,
& maximal span is determined by average magnitude (thus match) of new
derivatives on a higher level. Hence, a basic comparison cycle generates queues
of interlaced individual & aggregate derivatives at each template variable,
and conditional higher derivatives on each of the former.
Sufficiently
complex syntax or predictive variables will justify comparing across
“syntactic“ coordinates within a pattern, analogous to comparison across
external coordinates. In fact, that’s what higherpower comparisons do. For
example, division is an iterative comparison between difference & match:
within a pattern (external coordinate), but across derivation (syntactic
coordinate).
Also
crossvariable is comparison between orders of match in a pattern: magnitude,
match, matchofmatch... This starts from comparison between match &
magnitude: match rate (mr) = match / magnitude. Match rate can then be used to
project match from magnitude: match = magnitude * output mr * filter mr.
In
this manner, mr of each match order adjusts intraorderderived sequentially
higherorder match:
match
*= lower interorder mr. Additive match is then projected from adjusted matches
& their derivatives.
This
interorder projection continues up to the top order of match within a pattern,
which is the ultimate selection criterion because that’s what’s left matching
on the top level of search.
Interorder
vectors are Î›V symmetrical, but Î›V derivatives from lower order of match are
also projected for higherorder match, at the same rate as the match itself?
Also
possible is comparison across syntactic gaps: Î›Y comparison > difference,
filter feedback VY hierarchy. For example, comparison between dimensions of a
multiD pattern will form possibly recurrent proportions.
Internal
comparisons can further compress a pattern, but at the cost of adding a
higherorder syntax, which means that they must be increasingly selective. This
selection will increase “discontinuity” over syntactic coordinates: operations
necessary to convert the variables before comparison. Eventually, such
operators will become large enough to merit direct comparisons among them. This
will produce algebraic equations, where the match (compression) is a reduction
in the number of operations needed to produce a result.
The
first such shortcut would be a version of Pythagorean theorem, discovered
during search in 2D (part 6) to compute cosines. If we compare 2Dadjacent 1D
Ls by division, over 1D distance and derivatives (an angle), partly matching
ratio between the ratio of 1D Ls and a 2nd derivative of 1D distance will be a
cosine.
Cosines
are necessary to normalize all derivatives and lengths (Ls) to a value they
have when orthogonal to 1D scan lines (more in part 6).
Such
normalization for a POV angle is similar to dimensionality reduction in Machine Learning, but
is much more efficient because it is secondary to selective dimensionality
expansion. It’s not really “reduction”: dimensionality is prioritized rather
than reduced. That is, the dimension of pattern’s main axis is maximized, and
dimensions sequentially orthogonal to higher axes are correspondingly
minimized. The process of discovering these axes is so basic that it might be
hardwired in animals.
6.
Cartesian dimensions and sensory modalities (out of date)
This
is a recapitulation and expansion on incremental dimensionality introduced in
part 2.
Term
“dimension” here is reserved for a parameter that defines sequence and distance
among inputs, initially Cartesian dimensions + Time. This is different from
terminology of combinatorial search, where dimension is any parameter of an
input, and their external order and distance don’t matter. My term for that is
“variable“, external dimensions become types of a variable only after being
encoded within input patterns.
For
those with ANN background, I want to stress that a level of search in my approach
is 1D queue of inputs, not a layer of nodes. The inputs to a node are combined
regardless of difference and distance between them (the distance is the
difference between laminar coordinates of source “neurons”).
These
derivatives are essential because value of any prediction = precision of what *
precision of where. Coordinates and coderived differences are not represented
in ANNs, so they can't be used to calculate Euclidean vectors. Without such
vectors, prediction and selection of where must remain extremely crude.
Also,
layers in ANN are orthogonal to the direction of input flow, so hierarchy is at
least 2D. The direction of inputs to my queues is in the same dimension as the
queue itself, which means that my core algorithm is 1D. A hierarchy of 1D
queues is the most incremental way to expand search: we can add or extend only
one coordinate at a time. This allows algorithm to select inputs that are
predictive enough to justify the cost of representing additional coordinate and
corresponding derivatives. Again, such incremental syntax expansion is my core
principle, because it enables selective (thus scalable) search.
A
common objection is that images are “naturally” 2D, and our spacetime is 4D.
Of course, these empirical facts are practically universal in our environment.
But, a core cognitive algorithm must be able to discover and forget any
empirical specifics on its own. Additional dimensions can be discovered as some
general periodicity in the input flow: distances between matching inputs are
compared, match between these distances indicates a period of lower dimension,
and recurring periods form higherdimension coordinate.
But as
a practical shortcut to expensive dimensiondiscovery process, initial levels
should be designed to specialize in sequentially higher spatial dimensions: 1D
scan lines, 2D frames, 3D set of confocal “eyes“, 4D temporal sequence. These
levels discover contiguous (positive match) patterns of increasing
dimensionality:
1D
line segments, 2D blobs, 3D objects, 4D processes. Higher 4D cycles form
hierarchy of multidimensional orders of scale, integrated over time or
distributed sensors. These higher cycles compare discontinuous patterns.
Corresponding dimensions may not be aligned across cycles of different scale
order.
Explicit
coordinates and incremental dimensionality are unconventional. But the key for
scalable search is input selection, which must be guided by costbenefit
analysis. Benefit is projected match of patterns, and cost is representational
complexity per pattern. Any increase in complexity must be justified by
corresponding increase in discovered and projected match of selected patterns.
Initial inputs have no known match, thus must have minimal complexity:
singlevariable “what”, such as brightness of a greyscale pixel, and
singlevariable “where”: pixel’s coordinate in one Cartesian dimension.
Single
coordinate means that comparison between pixels must be contained within 1D
(horizontal) scan line, otherwise their coordinates are not comparable and
can’t be used to select locations for extended search. Selection for contiguous
or proximate search across scan lines requires second (vertical) coordinate.
That increases costs, thus must be selective according to projected match,
discovered by past comparisons within 1D scan line. So, comparison across scan
lines must be done on 2^{nd} level of search. And so on.
Dimensions
are added in the order of decreasing rate of change. This means spatial
dimensions are scanned first: their rate of change can be spedup by moving
sensors. Comparison over purely temporal sequence is delayed until accumulated
change / variation justifies search for additional patterns. Temporal sequence
is the original dimension, but it is mapped on spatial dimensions until spatial
continuum is exhausted. Dimensionality represented by patterns is increasing on
higher levels, but each level is 1D queue of patterns.
Also
independently discoverable are derived coordinates: any variable with
cumulative match that correlates with combined cumulative match of all other
variables in a pattern. Such correlation makes a variable useful for sequencing
patterns before crosscomparison.
It is
discovered by summing matches for sametype variables between input patterns,
then crosscomparing summed matches between all variables of a pattern.
Variable with the highest resulting match of match (mm) is a candidate
coordinate. That mm is then compared to mm of current coordinate. If the
difference is greater than cost of reordering future inputs, sequencing
feedback is sent to lower levels or sensors.
Another
type of empirically distinct variables is different sensory modalities: colors,
sound and pitch, and so on, including artificial senses. Each modality is
processed separately, up a level where match between patterns of different
modalities but same scope exceeds match between unimodal patterns across
increased distance. Subsequent search will form multimodal patterns within
common ST frame of reference.
As
with external dimensions, difference between modalities can be predefined or
discovered. If the latter, inputs of different modalities are initially mixed,
then segregated by feedback. Also as with dimensions, my core algorithm only
assumes singlemodal inputs, predefining multiple modalities would be an
addon.
7.
Notes on clustering, ANNs, and probabilistic inference (out of date)
In
terms of conventional machine learning, my approach is a form of hierarchical fuzzy clustering. Cluster is simply a
different term for pattern: a set of matching inputs. Each set is represented
by a centroid: an input with belowthreshold combined “distance” to other inputs
of the set. The equivalent of centroid in my model is an input with
aboveaverage match (a complementary of a distance) to other inputs within its
search span. Such inputs are selected to search nextlevel queue and to
indirectly represent other crosscompared but not selected inputs, via their
discrete or aggregate derivatives relative to the selected one.
Crucial
differences here is that conventional clustering methods initialize centroids
with arbitrary random weights, while I use matches (and so on) from past
comparisons. And the weights are usually defined in terms of one variable,
while I select higherlevel inputs based on a combination of all variables per
pattern, the number of which increases with the level of search.
Current
methods in unsupervised learning were developed / accepted because they solved
specific problems with reasonable resources. But, they aren’t comparable to
human learning in scalability. I believe that requires an upfront investment in
incremental complexity of representation: a syntactic overhead that makes such
representation uncompetitive at short runs, but is necessary to predictively
prune longerrange search.
The
most basic example here is my use of explicit coordinates, and of input
differences at their distances. I haven’t seen that in other lowlevel
approaches, yet they are absolutely necessary to form Euclidean vectors.
Explicit coordinates is why I start image processing with 1D scan lines, 
another thing that no one else does. Images seem to be “naturally” 2D, but
expanding search in 2Ds at once adds the extra cost of two sets of new
coordinates & derivatives. On the other hand, adding 1D at a time allows to
select inputs for each additional layer of syntax, reducing overall (number of
variables * number of inputs) costs of search on the next level.
Artificial
Neural Networks
The
same coordinateblind mindset pervades ANNs. Their learning is probabilistic:
match is determined as an aggregate of multiple weighted inputs. Again, no
derivatives (0D miss), or coordinates (14D miss) per individual input pair are
recorded. Without them, there is no Euclidean vectors, thus pattern prediction
must remain extremely crude. I use aggregation extensively, but this
degradation of resolution is conditional on results of prior comparison between
inputs. Neurons simply don’t have the capacity for primary comparison.
This
creates a crucial difference in the way patterns are represented in the brain
vs. my hierarchy of queues. Brain consists of neurons, each likely representing
a variable of a given magnitude and modality. These variables are shared among
multiple patterns, or coactivated networks (“cognits” in terms of J. Fuster).
This
is conceptually perverse: relatively constant values of specific variables
define a pattern, but there’s no reason that same values should be shared among
different patterns. The brain is forced to share variables because it has fixed
number of neurons, but a fluid and far greater number of their networks.
I
think this is responsible for our crude taxonomy, such as using large, medium,
small instead of specific numbers. So, it’s not surprising that our minds are
largely irrational, even leaving aside all the subcortical nonsense. We don’t
have to slavishly copy these constraints. Logically, a cooccurring set of
variables should be localized within a pattern. This requires more local memory
for redundant representations, but will reduce the need for interconnect and
transfers to access global shared memory, which is far more expensive.
More
broadly, neural networkcentric mindset itself is detrimental, any function
must be initially conceptualized as a sequential algorithm, parallelization is
an optional superstructure.
Probabilistic
inference: AIT, Bayesian logic, Markov models
A good
introduction to Algorithmic information theory is Philosophical Treatise of
Universal Induction by Rathmanner and Hutter. The criterion is same as mine:
compression and prediction. But, while a progress vs. frequentist probability
calculus, both AIT and Bayesian inference still assume a prior, which doesn’t
belong in a consistently inductive approach. In my approach, priors or models
are simply past inputs and their patterns. Subjectspecific priors could
speedup learning, but unsupervised pattern discovery algorithm must be the
core on which such shortcuts are added or removed from.
More
importantly, as with most contemporary approaches, Bayesian learning is
statistical and probabilistic. Probability is estimated from simple incidence
of events, which I think is way too coarse. These events hardly ever match or
miss precisely, so their similarity should be quantified. This would add a
whole new dimension of micrograyscale: partial match, as in my approach, vs.
binary incidence in probabilistic inference. It should improve accuracy to the
same extent that probabilistic inference improved on classical logic, by adding
a macrograyscale of partial probability vs. binary true  false values of the
former.
Resolution
of my inputs is always greater than that of their coordinates, while Bayesian
inference and AIT typically start with the reverse: strings of 1bit inputs.
These inputs, binary confirmations / disconfirmations, are extremely crude way
to represent “events” or inputs. Besides, the events are assumed to be
highlevel concepts: the kind that occupy our conscious minds and are derived
from senses by subconscious cognitive processes, which must be built into
general algorithm. Such choice of initial inputs in BI and AIT demonstrates a
lack of discipline in incrementing complexity.
To
attempt a general intelligence, Solomonoff introduced “universal prior“: a
class of all models. That class is a priori infinite, which means that he hits
combinatorial explosion even *before* receiving actual inputs. It‘s a solution
that only a mathematician may find interesting. Marginally practical
implementation of AIT is Levin Search, which randomly generates models /
algorithms of incremental complexity and selects those that happen to solve a
problem or compress a bit string.
Again,
I think starting with prior models is putting a cart before a horse: cognition
must start with raw data, complex math only becomes costefficient on much
higher levels of selection and generalization. And this distinction between
input patterns and pattern discovery process is only valid within a level:
algorithm is embedded in resulting patterns, hence is also compared on higher
levels, forming “algorithmic patterns“.
8.
Notes on working mindset and a prize for contributions
My
terminology is as general as the subject itself. It’s a major confounder, 
people crave context, but generalization is decontextualization. And cognitive
algorithm is a metageneralization: the only thing in common for everything we
learn. This introduction is very compressed, because much of it is work in
progress. But I think it also reflects and cultivates ruthlessly reductionist
mindset required for such subject.
My
math is very simple, because algorithmic complexity must be incremental.
Advanced math can accelerate learning on higher levels of generalization, but
it’s too expensive for initial levels. And minimal general learning algorithm
must be able to discover computational shortcuts (AKA math) on it’s own, just
like we do. Complex math is definitely not innate in humans on any level: cavemen
didn’t do calculus.
This theory
may seem too speculative, but any degree of generalization must be
correspondingly lossy. Which is contrary to precisionoriented culture of math
and computer science. Hence, current Machine Learning is mostly experimental, and
the progress on algorithmic side is glacial. A handful of people aspire to work
on AGI, but they either lack or neglect functional definition
of intelligence, their theories are only vague inspiration.
I
think working on this level demands greater delay of experimental verification
than is acceptable in any established field. Except for philosophy, which has
nothing else real to study. But established philosophers have always been
dysfunctional fluffers, not surprisingly as their only paying customers are
college freshmen.
Our
main challenge in formalizing GI is a speciewide ADHD. We didn’t evolve for
sustained focus on this level of generalization, that would cause extinction
long before any tangible results. Which is no longer a risk, GI is the most
important problem conceivable, and we have plenty of computing power for
anything better than bruteforce algorithms. But our psychology lags a
lightyear behind technology: we still hobble on mental crutches of irrelevant
authority and peer support, flawed analogies and needless experimentation.
Prize
for contributions
I
offer prizes up to a total of $500K for debugging, optimizing, and extending
this algorithm: github.
Contributions
must fit into incrementalcomplexity hierarchy outlined here. Unless you find a
flaw in my reasoning, which would be even more valuable. I can also pay
monthly, but there must be a track record.
Winners
will have an option to convert the awards into an interest in all commercial
applications of a final algorithm, at the rate of $10K per 1% share. This option
is informal and likely irrelevant, mine is not a commercial enterprise. Money can’t
be primary motivation here, but it saves time.
Winners
so far:
2010: Todor Arnaudov, $600 for suggestion to
buffer old inputs after search. This occurred to me more than once before, but
I rejected it as redundant to potential elevation of these inputs. Now that he
made me think about it again, I realized that partial redundancy can preserve
the detail at much lower cost than elevation.
The
buffer is accessed if coordinates of contents get relatively close to projected
inputs (that and justification is mine). It didn’t feel right because brain has
no substrate for passive memory, but we do now.
2011:
Todor, $400 consolation prize for understanding some ideas that were not
clearly explained here.
2016:
Todor Arnaudov, $500 for multiple
suggestions on implementing the algorithm, as well as for the effort.
2017:
Alexander Loschilov, $2800 for help in
converting my level 1 pseudo code into Python, consulting on PyCharm and SciPy,
and for insistence on 2D clustering, FebruaryApril.
Todor Arnaudov: $2000 for help in
optimizing level_1_2D, JuneJuly.
Kapil Kashyap: $ 2000 for stimulation
and effort, help with Python and level_1_2D, SeptemberOctober
2018:
Todor Arnaudov, $1000 mostly for effort and
stimulation, JanuaryFebruary
Andrei Demchenko, $1800 for conventional refactoring in
line_POC_introductory.py, interface improvement and few improvements in the
code, April  May.
Todor Arnaudov, $2000 for help in debugging
frame_dblobs.py, September  October.
Khanh Nguyen, $2700, for getting to work line_POC.
2019:
Stephan Verbeeck, $2000 for getting me to return to using minimallycoarse gradient and
his perspective on colors and line tracing, JanuaryJune
Todor Arnaudov, $1600, frequent participant,
MarchJune
Kok Wei
Chee,
$900, for diagrams of line_POC and frame_blobs, December
Khanh Nguyen, $10100, lead debugger and codesigned, JanuaryDecember