Intelligence is ability to predict and plan (self-predict), which
can only be done by discovering and projecting patterns. This definition is well
established: pattern recognition is a core of any IQ test, and prediction is
the only use of recognition. But there is no general and constructive definition of either
pattern or recognition. More precisely, recognition is a quantified similarity (match),
which currently lacks a consistent measure.

So, I came up with my own definitions, which directly translate
into algorithm introduced here.

For excellent popular introduction to cognition-as-prediction
thesis see “On Intelligence” by Jeff Hawkins and “How to Create a Mind“ by Ray Kurzweil. But on a technical level,
both of them and most current researchers implement pattern discovery via
artificial neural networks, which operate in a very coarse statistical fashion.

Less coarse (more selective) are Capsule Networks, recently introduced by
Geoffrey Hinton et al. But they are largely ad hock, still work-in-progress,
and still use layers of CNN. Neither CNN nor CapsNet
is theoretically derived. I outline my approach below and then compare it to
ANN, biological NN, CapsNet, and clustering.

I need help in design and implementation of this
algorithm, in Python. This is an open project on GitHub: CogAlg. I can
pay per contribution, or monthly if there is some track record, see last part
of this post. But this work is theory first, experimentation last. So, I need
to be sure that you can participate in designing algorithm from the first
principles. Unless you find flaws or omissions in my reasoning, which would be
even more valuable. Please contact me if interested: leave a comment here, on
GitHub, or via G+.

*This content is published under the Creative Commons Attribution 4.0 International License.*

*Outline of my approach*
Proposed algorithm is a
clean design for deep learning: non-neuromorphic, sub-statistical, comparison-first.

It’s a hierarchical search
for patterns, by cross-comparing inputs over incremental distance and
composition. First-level inputs are single variables, such as pixels, and
higher-level inputs are multi-parameter patterns formed on lower levels.
Parameters (variables) include match and miss per variable compared on lower
levels. Hence, higher-level patterns are more complex: variables per pattern
selectively multiply on each level.

I define pattern as
contiguous span of inputs forming the same sign of difference between input and
feedback (part 2). There can be multiple orders of feedback, forming
corresponding orders of overlapping patterns.

First-order feedback is
prior input, and patterns are spans of inputs with increasing or decreasing
magnitude.

Second-order feedback is
average higher-level match, and second-order patterns are spans of inputs with
above- or below- average match to prior inputs. And so on, with relatively longer-range
feedback compared to higher order of match between input and shorter-range
feedback.

Match and miss are formed
by cross-comparing inputs, over selectively extended range of search. Basic comparison is inverse arithmetic operation between two
single-variable inputs, starting with adjacent pixels.

Specific match and miss is
determined by power of comparison: Boolean match is AND and miss is XOR, comparison by subtraction increases match to a
smaller comparand and reduces miss to a difference,

comparison by division
increases match to a multiple and reduces miss to a fraction, and so on (part
1).

To generalize, match is
lossless compression per comparison. Match between patterns is a combined match
between their variables. Search expansion is strictly incremental, to
enable fine-grain selection of comparands. Thus, there is a unique set of
operations per level, hence a singular in “cognitive algorithm“ (CogAlg below). Within a level, search is incremental in distance between inputs
and in their derivation order (part 4, level 1). Between levels, search is
incremental in compositional scope and number of derived variables per pattern.

My hierarchy is a
dynamically extended pipeline: when pattern terminates, it is outputted for
comparison on the next level, and is replaced by initialized pattern on current
level. Thus, a new level must be formed for a pattern terminated by current top
level. This continues indefinitely, as long as system recieves new inputs.

As distinct from
autoencoders (current mainstay in unsupervised learning), there is no need for
decoding: comparison is done on each level, and level’s output patterns are
also fed back to filter lower levels.

Autonomous cognition must
start with analog inputs, such as video or audio. All symbolic data,
including natural languages, is encoded by some prior cognitive process. To discover meaningful patterns, these symbols must be decoded
before being cross-compared. And the difficulty of decoding is exponential
with the level of encoding, so hierarchical learning starting with raw sensory
input is by far the easiest to implement (part i).

To discover anything
complex at “polynomial” cost, resulting patterns should also be hierarchical,
with each level of search adding a level of composition and a sub-level of
differentiation to its input patterns. Higher-level search is selective per
level of resulting pattern. Such composition and selection speeds-up search, to
form long-range spatio-temporal and then conceptual patterns. Which also send
feedback: filters and then motor action, to select lower-level inputs and
locations with above-average additive predictive value (part 3).

Hierarchical approaches are
common in unsupervised learning, and all do some sort of
pattern recognition.

But none that I know of is
strictly incremental in scope and complexity of discoverable patterns. Incremental selection is necessary for scalability, vs.
combinatorial explosion in search space. But it is more expensive
upfront and won’t pay in simple test problems. So, it’s not suitable for
immediate experimentation, which is probably why no one else seems to be
working on anything sufficiently similar to my algorithm.

*Comparison to artificial and biological neural networks*
ANN learns via some version of Hebbian “fire together, wire together” coincidence reinforcement.
Normally, “neuron’s” inputs are weighed at synapses, then summed and
thresholded into output. Final output also back- propagates to synapses
and is combined with their last input to adjust the weights. This weight
adjustment is learning. But prior summation degrades initial resolution
of inputs, precluding any comparison between them (the
inputs are recoverable, but why degrade and then spend a ton of computation
restoring them).

This is an inherently
coarse statistical method: inputs are summed within samples defined by
initially random weights. These weights are trained into meaningful values by SGD: practically a brute force and too expensive to scale without
supervision or task-specific reinforcement. Again, CogAlg is comparison-first
and resulting patterns are immediately meaningful. Basically, my feedback is
infinitely finer-grained than backpropagation.

Currently the most
successful method is CNN, which computes match as a product: input * kernel (weights),
adjusted by some activation function. But kernels start with useless random
weights and are adjusted by very coarse trial and error, delayed by hidden
layers. It’s understandable in biology, the brain is born with a fixed number
of neurons, they can’t be grown on demand and learn only by adding and
adjusting connections.

But in software, generating
new nodes that locally represent learned content should be far more efficient.

And similarity to weights
is vastly exaggerated by matrix multiplication: conceptually, match is a common
subset, not superset of comparands. This exaggeration is compensated by some
activation function, but that’s another unprincipled and inaccurate hack, which
causes vanishing or exploding weight gradients.

Multiplication is analog
(cheap) in the brain, and exaggeration adds resistance against noise, but at
the cost of inherent inaccuracy. I believe noise cancellation should be
separate from recognition, specific to local properties of noise, which should
be learned and updated along with those of the input itself.

ANN compute difference
(error) on their output layer, but not between hidden layers. This
distinction is only justified in supervised learning, where we have some
specific expectation. In unsupervised learning, all layers are equally unknown
before input. So, each layer should compute both similarity (kernel) and
difference: feedback to lower layer. In a temporal input flow, such difference should
update expectations. I think this disconnected update is what causes
initialization bias in ANN, and similar confirmation bias in humans.

Also, both input and kernel
in ANN are arrays, often 2D or higher. This is far more coarse, thus less
potentially selective and efficient, than one-to-one comparison of laterally
adjacent inputs in my algorithm. I use “level” vs. “layer” because my levels are not identical.
Complexity of inputs and operations is incremental with elevation: initial
inputs are pixels, higher-level inputs are patterns formed on lower levels. My
inference is comparison that forms separate match and difference, and the difference
is fed back to update level-wide filters.

Inspiration by the brain
kept ANN research going for decades before they became useful. But their
“neurons” are mere stick figures for real ones. Of course, most of complexity
in a neuron is due to constraints of biology. Ironically, weighted summation in ANN may also be a no-longer needed
compensation for such constraint:

neural memory requires
dedicated connections (synapses), which makes individual
input representation and comparison prohibitively expensive. But not anymore,
we now have dirt-cheap random access memory.

Other biological
constraints are very slow neurons, and the imperative of fast reaction for
survival in the wild. Both favor fast though crude summation (vs.
slower one-to-one comparison), at the cost of glacial training. Reaction speed
became less important: modern society is quite secure, while continuous
learning is far more important because of accelerating progress. Another
constraint is noise: neurons often fire at random, so their spikes are summed
to reduce noise. Which is not a good reason to degrade far more precise
electronic signals.

*Comparison to Capsule Networks and clustering*
The nearest experimentally
successful method is recently introduced “capsules”. Some similarities to CogAlg:

- capsules also output
multi-variate vectors, “encapsulating” multiple properties, similar to my
patterns

- these properties also
include coordinates and dimensions, compared to compute differences and ratios

- these distances and
proportions are also compared to find “equivariance” or affine transformations

- capsules also send direct
feedback to lower layer (dynamic routing), vs. trans-hidden-layer backprop in
ANN

But measure of similarity
in CapsNet (“agreement” in dynamic routing) is still an unprincipled dot
product, vs. conceptually consistent compression in CogAlg. Almost all
recognition algorithms, biological or engineered, select for dot product. To
repeat my objections, multiplication vastly exaggerates similarity. The brain
is extremely noisy and exaggeration adds noise resistance, but this should be a
separate noise-specific function.

Pure similarity is a common
subset: the smaller of comparands. Which is also a compression of represented
magnitude by replacing larger comparand with the difference between the two.
This is an atomic implication of information-theoretical compression-uber-alles
principle, but I don’t know of anyone else using minimum as a measure of
similarity. It’s not sufficient, working measure would be minimum minus average
minumum, then minus projected co-derived differences, etc., but minimum is not
avoidable as a starting point.

CapsNets also use CNN
layers to recognize basic features, which are then fed to capsule layers. But
in a truly general method, the same principles must apply on all stages of
processing, any differention should be learned rather than built-in. All
variables in my patterns are derived by incrementally complex comparison.

And current implementation
of capsules contain only probability and pose variables. My patterns have match instead of probability,
and include lower-level properties, with their matches and differences, as well
as pose. In my terms, Hinton’s equivariance is a match between variables
representing differences and distances. After being summed within pattern,
these derivatives are evaluated to extend intra-pattern search and for
feedback.

These are direct
consequences of having comparison as a common atomic operation for all stages
of CogAlg.

Also, my hierarchy is a
dynamic pipeline: pattern is displaced from its level by a miss to new input,
then forwarded to existing or newly formed higher level. Which means that
higher-level patterns include lower-level variables, in addition to newly
derived ones. So, both hierarchy of patterns per system and sub-hierarchy of
variables per pattern expand with experience. This is fundamentally superior to
static design of CapsNet.

Another method similar to
mine is hierarchical clustering. But conventional clustering defines match as
inverted difference between inputs. This is the opposite of ANN, which computes
match but no difference. And it’s also wrong: match is a common subset of
comparands, distinct from and complementary to the difference between them.
Both should be computed because each has independent predictive value.

*elaboration below:*
i. Cognition vs. evolution, analog vs. symbolic initial input

1. Comparison: quantifying match and miss between two variables

2. Forward search and patterns, incremental space-time
dimensionality

3. Feedback of filters, attentional input selection, imagination,
motor action

4. Initial levels of search, corresponding orders of feedback, and
resulting patterns:

level 1: comparison to past
inputs, forming difference and relative match patterns

level 2: additional
evaluation of resulting patterns for feedback, forming filter patterns

level 3: additional
evaluation of projected filter patterns, forming updated-input patterns

5. Comparison between variable types within a pattern

6. Cartesian dimensions and sensory modalities

7. Notes on clustering, ANNs, and probabilistic inference

8. Notes on working mindset, “hiring” and a prize for contributions

*i. Cognition vs. evolution, analog vs. symbolic initial input*

Some say intelligence can
be recognized but not defined. I think that’s absurd: we recognize some implicit
definition. Others define intelligence as a problem-solving ability, but the
only general problem is efficient search for solutions. Efficiency is a
function of selection among inputs, vs. brute-force all-to-all search. This
selection is by predicted value of the inputs, and prediction is interactive
projection of their patterns. Some agree that intelligence is all about pattern
discovery, but define pattern as a crude statistical coincidence.

Of course, the only
mechanism known to produce human-level intelligence is even cruder, and that
shows in haphazard construction of our brains. Algorithmically simple,
biological evolution alters heritable traits at random and selects those with
above-average reproductive fitness. But this process requires almost
inconceivable computing power because selection is very coarse: on the level of
whole genome rather than individual traits, and also because intelligence is
only one of many factors in reproductive fitness.

Random variation in
evolutionary algorithms, generative RBMs, and so on, is antithetical to intelligence. Intelligent
variation must be driven by feedback within cognitive hierarchy: higher levels
are presumably “smarter” than lower ones. That is, higher-level inputs
represent operations that formed them, and are evaluated to alter future
lower-level operations. Basic operations are comparison and summation among
inputs, defined by their range and resolution, analogous to reproduction in
genetic algorithms.

Range of comparison per
conserved-resolution input should increase if projected match (cognitive
fitness function) exceeds average match per comparison. In any non-random
environment, average match declines with the distance between comparands. Thus,
search over increasing distance requires selection of above- average
comparands. Any delay, coarseness, and inaccuracy of such selection is
multiplied at each search expansion, soon resulting in combinatorial explosion
of unproductive (low additive match) comparisons.

Hence, my model is strictly
incremental: search starts with minimal-complexity inputs and expands with
minimal increments in their range and complexity (syntax). At each level, there
is only one best increment, projected to discover the greatest additive match.
No other AGI approach follows this principle.

I guess people who aim for
human-level intelligence are impatient with small increments and simple sensory
data. Yet, this is the most theoretical problem ever, demanding the longest
delay in gratification.

*symbolic obsession and its discontents*
Current Machine L

__earning__and related theories (AIT, Bayesian inference, etc.) are largely statistical also because they were developed primarily for symbolic data. Such data, pre-compressed and pre-selected by humans, is far more valuable than sensory inputs it was ultimately derived from. But due to this selection and compression, proximate symbols are not likely to match, and partial match between them is hard to quantify. Hence, symbolic data is a misleading initial target for developing conceptually consistent algorithm.
Use of symbolic data as
initial inputs in AGI projects betrays profound misunderstanding of cognition.
Even children, predisposed to learn language, only become fluent after years of
directly observing things their parents talk about. Words are mere labels for
concepts, the most important of which are spatio-temporal patterns, generalized
from multi-modal sensory experience. Top-down reconstruction of such patterns
solely from correlations among their labels should be exponentially more
difficult than their bottom-up construction.

All our knowledge is
ultimately derived from senses, but lower levels of human perception are
unconscious. Only generalized concepts make it into our consciousness, AKA declarative memory, where we assign them symbols (words) to
facilitate communication. This brain-specific constraint creates heavy symbolic
vs. sub-symbolic bias, especially strong in artificial intelligentsia. Which is
like putting a cart in front of a horse: most words are meaningless unless
coupled with representations of sensory patterns.

To be incrementally
selective, cognitive algorithm must exploit proximity first, which is only
productive for continuous and loss-tolerant raw sensory data. Symbolic
data is already compressed: consecutive characters and words in text won’t
match. It’s also encoded with distant cross-references, that are hardly ever
explicit outside of a brain. Text looks quite random unless you know the code:
operations that generalized pixels into patterns (objects, processes,
concepts). That means an algorithm designed specifically for text will not be
consistently incremental in the range of search, which will impair its
scalability.

In Machine Learning, input
is string, frame, or video sequence of a defined length, with artificial
separation between training and inference. In my approach, learning is
continuous and interactive. Initial inputs are streamed pixels of maximal
resolution, and higher-level inputs are multi-variate patterns formed by
comparing lower-level inputs. Spatio-temporal range of inputs, and selective
search across them, is extended indefinitely. This expansion is directed by
higher-level feedback, just as it is in human learning.

Everything ever written is
related to my subject, but nothing is close enough: not other method is meant
to be fully consistent. Hence a dire scarcity of references here. My approach
is presented bottom-up (parts 1 - 6), thus can be understood without
references. But that requires a clean context, - hopefully cleaned out by
reader‘s own introspective generalization. Other (insufficiently) related
approaches are addressed above and in part 7. I also have a more advanced
work-in-progress, but will need a meaningful feedback to elaborate.

*1. Comparison: quantifying match and miss between two variables*

First of all, we must
quantify predictive value. Algorithmic information theory defines it as compressibility of
representation, which is perfectly fine. But current implementations only
define compression for whole sequences of inputs. To enable far more
incremental selection (thus more scalable search), I start by quantifying match
(= compression) between individual inputs. Partial match is a finer dimension
of analysis, vs. binary same | different inputs in probabilistic inference.
This is similar to the way probabilistic inference improved on classical logic,
by quantifying partial probability vs. binary true | false value of statements.

I define match as a
complementary of miss. Basic miss is a difference between comparands, hence
match is a smaller comparand. In other words, match is a compression of larger
comparand’s magnitude by replacing it with difference relative to smaller
comparand. Ultimate criterion is recorded magnitude, rather than bits of memory
it occupies, because the former represents physical impact that we want to
predict. The volume of memory used to record that magnitude depends on prior
compression, which is not an objective parameter.

This definition is
tautological: smaller input is a common subset of both inputs, = sum of AND between their uncompressed (unary code) representations. Some may
object that match includes the case when both inputs equal zero, but that match
also equals zero. The purpose is prediction, which is a representational equivalent
of conservation in physics. Ultimately, we’re predicting some potential impact
on observer, represented by input. Zero input means zero impact, which has no
conservable inertia, thus no intrinsic predictive value.

Given incremental
complexity of representation, initial inputs should have binary resolution.
However, these binary inputs are more efficiently compressed by digitization
within a coordinate: substitution of every two lower-order bits of an integer
with one higher-order bit. Resolution of coordinate (input summation span) is
adjusted to form large enough integers, to produce average match that exceeds
the cost of comparison.

All costs here are
opportunity costs: average compression by equivalent computational resources.

Next order of compression
is by comparing integers across coordinates, initially only between consecutive
coordinates. Basic comparison is inverse arithmetic operation of incremental
power: AND, subtraction, division, logarithm, and so on. Additive match is
achieved by comparison of a higher power than that which produced comparands:
comparison by AND will not further compress integers previously digitized by
AND.

Rather, initial comparison
between integers is by subtraction, miss is resulting difference and absolute match is a smaller input. For example, if inputs are 4 and 7, then miss
is 3, and their match or common subset is 4. Difference is smaller than XOR
(complementary of AND) because XOR includes opposite-sign bit pairs 0,1 and
1,0, which will be cancelled-out by subtraction.

Comparison by division
forms ratio, which is a compressed difference.
Compression is explicit in long division: match is accumulated over iterative
subtraction of smaller comparand from remaining difference. In other words,
this is also a comparison by subtraction, but between different orders of
derivation. New match is smaller comparand * integer part of ratio, and miss is final reminder or fractional part of
ratio.

Ratio can be further
compressed by converting it to radix or logarithm, and so on.

By reducing miss,
higher-power comparison increases complementary match (match = larger input -
miss):

total per order:
1) larger input 2) XOR
3) difference: combined current-order match & miss

additive match:
AND opposite-sign
XOR multiple of a smaller input within a
difference

remaining miss:
XOR
difference
fraction:
complementary to multiple within a ratio

But the costs of operations
and record of signs, fractions, irrational fractions, etc. may grow even
faster. Thus, the power of comparison is increased only if higher-level
“meta-comparison” between results from different powers of comparison finds a
pattern of improving (compression - costs) with increasing power.

The above covers absolute
match, which must be projected on potential search location before evaluation.
Projection is recombination of co-derived matches and misses at a distant
location:

projected match = match + ((difference * (summed match / summed input)) * distance) / 2

This match projection from
results of a single comparison is not implemented in the algorithm. That’s
because projection is a feedback, which adjusts lower-level filters and is
represented in the output. Incurred costs must be justified by the value of
feedback, which is low per comparison but accumulates over multiple
comparisons.

So, actual feedback is
*average* projected match in higher-level patterns, formed by multiple vs.
individual comparisons. This projection is by recombining overlapping match
patterns and difference patterns in proportion to their predictive value (vs.
recombining individual matches and differences above).

But ultimate selection
criterion is relative match: current-level match - past current-level match
that co-occurs with average higher-level projected match. In other words,
relative match is positive or negative predictive value, which determines input
inclusion into positive or negative pattern.

*2. Forward search and patterns, incremental space-time dimensionality*

Pattern is a contiguous
span of inputs that form above-average matches, similar to conventional cluster.

As explained above, matches
and misses (derivatives) are produced by comparing consecutive inputs. These
derivatives are summed within a pattern and then compared between patterns on
the next level of search, adding new derivatives to a higher pattern. Patterns
are defined contiguously for each level, but are necessarily followed by a span
of below-average match (gap or negative pattern), thus next-level inputs are
discontinuous.

Negative patterns represent
contrast or discontinuity between positive patterns, which is a one- or higher-
dimensional equivalent of difference between zero-dimensional pixels. As with
differences, projection of a negative pattern competes with projection of
adjacent positive pattern. But match and difference is derived from the
same input pair, while positive and negative patterns represent separate
spans of inputs.

Negative pattern is not
predictive on its own, but is valuable for allocation: computational resources
of no-longer predictive pattern should be used elsewhere. Hence,
the value of negative pattern is borrowed from predictive value of co-projected
positive pattern, as long as combined additive match remains above average.

Consecutive positive and
negative patterns project over same future input span, and these projections
partly cancel each other. So, they should be combined to form feedback, as
explained in part 3.

Initial match is evaluated
for inclusion into higher positive or negative pattern. The value is summed
until its sign changes, and if positive, evaluated again for cross-comparison
among constituent inputs over increased distance. Second evaluation is
necessary because the cost of incremental syntax, produced by cross-comparing,
is per pattern rather than per input. Pattern is terminated and
outputted to the next level when value sign changes. On the next level, it is
compared to previous patterns of the same compositional order.

Initial inputs are pixels
of video, or equivalent limit of positional resolution in other modalities.
Hierarchical search on higher levels should discover patterns representing
empirical objects and processes, and then relational logical and mathematical
shortcuts, eventually exceeding generality of our semantic concepts. Which are
also patterns in cognitive terms: all information is either learned as patterns
or filtered out as noise. For online learning, all levels should receive inputs
from lower levels and feedback from higher levels in parallel.

*space-time dimensionality and initial implementation*
Any prediction has two
components: what and where. We must have both: value of prediction = precision
of what * precision of where. That “where” is currently neglected: statistical
ML represents space-time at greatly reduced resolution, if at all. In the brain
and some neuromorphic models, “where” is represented in a separate network.
That makes transfer of locations very expensive and coarse, reducing predictive
value of combined representations. There is no such immediate degradation of
positional information in my method.

My core algorithm is 1D:
time only (part 4). Our space-time is 4D, but each of these dimensions
can be mapped on one level of search. This way, levels can select input
patterns that are strong enough to justify the cost of representing additional
dimension, as well as derivatives (matches and differences) in that dimension.

Initial 4D cycle of search
would compare contiguous inputs, similarly to connected-component analysis:

level 1 compares
consecutive 0D pixels within horizontal scan line, forming 1D patterns: line
segments.

level 2 compares contiguous
1D patterns between consecutive lines in a frame, forming 2D patterns: blobs.

level 3 compares contiguous
2D patterns between incremental-depth frames, forming 3D patterns: objects.

level 4 compares contiguous
3D patterns in temporal sequence, forming 4D patterns: processes.

Subsequent cycles would
compare 4D input patterns over increasing distance in each dimension, forming
longer-range discontinuous patterns. These cycles can be coded
as implementation shortcut, or discovered by core algorithm itself, which can
adapt to inputs of any dimensionality. “Dimension” here is a
parameter that defines external sequence and distance among inputs. This is
different from conventional clustering, which treats both external and internal
parameters as dimensions. More in part 6.

However, average match in
our space-time is presumably equal over all four dimensions. That means
patterns defined in fewer dimensions will be biased by the angle of scanning,
introducing artifacts. Hence, initial pixel comparison and inclusion into
patterns should also be defined over 4Ds, or at least over 2D blobs for still
images. This is a universe-specific extension of my core algorithm.

I am currently working on
implementation of core algorithm adapted to process images and then video, but
it should be extensible to any type and scope of data. My work-in-progress
Python code is posted on github, suggestions and collaboration are most welcome, see the last
part here on prizes.

*3. Feedback of filters, attentional input selection, imagination, motor action*

*(needs work)*

After evaluation for
inclusion into higher-level pattern, input is also evaluated as feedback to
lower levels. Feedback is update to filters that evaluate forward (Î›) and
feedback (V), as described above but on
lower level.

Basic filter is average
value of input’s projected match that co-occurs with (thus predicts) average
higher-level match within a positive (above-average) pattern. Both values are
represented in resulting patterns.

Feedback value = forward
value - value of derivatives /2, both of an input pattern. In turn, forward
value is determined by higher-level feedback, and so on. Thus, all higher
levels affect selection on lower-level inputs. This is because the span of each
pattern approximates, hence projects over, combined span of all lower levels.

Indirect feedback
propagates level-sequentially, more expensive shortcut feedback is sent to
selected levels.

Negative derivatives
project increasing match: match to subsequent inputs is greater than to previous inputs.

Such feedback will reduce
lower-level filter. If filter is zero, all inputs are cross-compared, and if
filter is negative, it is applied to cancel subsequent filters for
incrementally longer-range cross-comparison.

There is one filter for
each compared variable within input pattern, initialized at 0 and
updated by feedback.

*novelty vs. generality*
Any integrated system must
have a common selection criterion. Two obvious cognitive criteria are novelty
and generality: miss and match, But we can’t select for both, they exhaust all
possibilities. Novelty can’t be primary criterion: it would select for noise
and filter out all patterns, which are defined by match. On the other hand, to
maximize match of inputs to memory we can “stare at a wall” and lock into
predictable environments. But of course, natural curiosity actively skips
predictable locations, thus reducing the match.

This dilemma is resolved if
we maximize predictive power: projected match, rather than actual match, of inputs to records. To the extent that new
match was projected by past inputs, it doesn’t add to their projected match.
But neither does noise: novelty (difference to past inputs) that is not
projected to persist (match) in the future.

Additive projected match =
downward (V) match to subsequent inputs - upward (Î›) match to previous inputs.

This Î›V asymmetry comes from projecting difference or
value (both are signed) of input relative to feedback.

Î› match is initially
projected from match between input and its higher-level average. Such averages
may be included in feedback, along with average match (filter). Each
average is compared to same-type variable of an input, and resulting match (redundancy to a higher level) is subtracted from input
match before evaluation.

Averages represent past vs.
current higher level and should be projected over feedback delay:

average += average
difference * (delay / average span) /2. Value of input-to-average
match is reduced by higher-level match rate: rM= match / input, so
additive match = input match - input-to-average match * rM.

Adjustment by
input-to-average match is done if rM is above-average for incurred cost of
processing.

So, basic selection for
novelty is subtraction of adjusted Î› match-to-filter. V selection for generality, on the other hand, is a function of
back-projected difference of match, which is a higher-derivation feedback.
Projected increase of match adds to, and decrease of match subtracts from,
predictive value of future inputs.

Higher-order selection for
novelty should skip (or avoid processing) future input spans predicted with
high certainty. This is a selection of projected vs. actual inputs, covered in
part 4, level 3.

More generally, Î›V projection asymmetry is expressed by
differences of incremental representation order:

difference in magnitude of initial inputs: projected next input = last input +
difference/2,

difference in input match, a subset of magnitude: projected next match = last match + match
difference/2,

difference in match of match, a sub-subset of magnitude, projected correspondingly, and so on.

Ultimate criterion is top
order of match on a top level of search: the most predictive parameter in a
system.

*imagination, planning, action*
Imagination is not actually
original, it can only be formalized as interactive projection of known
patterns. Strong patterns send positive feedback to lower-level locations where
they are projected to re-occur at higher resolution, to increase represented
detail. When originally distant patterns are projected to re-occur in the same
location, their projections interact and combine. This is a generative (vs.
reductive) mechanism.

Interaction is comparison
and summation between same-type variables of co-projected patterns: direct
feedback from multiple higher-level patterns levels to the same external to
them location. Location is a span of search within a lower level, which
receives feedback with matching projected coordinates. Evaluation of actual
inputs is delayed until additive match projected in their location exceeds the
value of combined filters.

Comparison forms patterns
of filters to compress their representations and selectively project their
match, same as with original inputs. Summation of their differences cancels or
reinforces corresponding variables, forming combined filter, including
secondary patterns of repulsion or attraction. These patterns per variable are
discovered on a higher level, by comparing past inputs with their
multiple-feedback projections.

Combined filter is
pre-evaluated: projected value of positive patterns is compared to projected
cost of evaluating all inputs, both within a filtered location. Resulting
prevalue (value of evaluation) is negative when the latter exceeds the former:
projected inputs are not worth evaluating and their span / location is skipped.

Iterative skipping of input
spans is the most basic motor feedback: it increments coordinates and
differences of the filters. Cross-filter search then proceeds over multiple
“imagined” locations, before finding one with projected above-average additive
match. That’s where skipping stops and actual inputs are received.

Cognitive component of
action is planning. It is technically identical to imagination, except that
projected patterns also represent the system itself. Feedback of such
self-patterns eventually reaches the bottom of representational hierarchy:
sensors and actuators, adjusting their sensitivity and coordinates. Such
environmental interface is part of any cognitive system, although actuators are
optional.

*4. Initial levels of search, corresponding orders of feedback and resulting patterns*

This part recapitulates and
expands on my core algorithm, which operates in one dimension: time only.
Spatial and derived dimensions are covered in part 6. Even within 1D, search is
hierarchical in scope, containing any number of levels. New level is added when current top level terminates and outputs
the pattern it formed.

Higher-level patterns are
fed back to select future inputs on lower levels. Feedback is sent to all lower
levels because span of each pattern approximates combined span of inputs within
whole hierarchy below it.

So, deeper hierarchy forms
higher orders of feedback, with increasing elevation and scope relative to its
target: same-level prior input, higher-level match average,
beyond-the-next-level match value average, etc.

These orders of feedback
represent corresponding order of input compression: input,
match between inputs, match between matches, etc. Such compression is produced
by comparing inputs to feedback of all orders. Comparisons form patterns, the
type of which corresponds to relative span of compared feedback:

1: prior inputs are compared to the following ones on the same level,
forming difference patterns dPs,

2: higher-level match is used to evaluate match between inputs, forming value
patterns vPs,

3: higher-hierarchy value revaluates positive values of match, forming more selective
shortcut patterns sPs

Feedback of 2

^{nd}order consists of input filters (if) defining value patterns, and coordinate filters (Cf) defining positional resolution and relative distance to future inputs.
Feedback of 3

^{rd}order is shortcut filters for beyond-the-next level. These filters, sent to a location defined by attached coordinate filters, form higher-order value patterns for deeper internal and distant-level comparison.
Higher-order patterns are
more selective: difference is as likely to be positive as negative, while value
is far more likely to be negative, because positive patterns add costs of
re-evaluation for extended cross-comparison among their inputs. And so on, with
selection and re-evaluation for each higher order of positive patterns.
Negative patterns are still compared as a whole: their weak match is
compensated by greater span.

All orders of patterns
formed on the same level are redundant representations of the same inputs.
Patterns contain representation of match between their inputs, which are
compared by higher-order operations. Such operations increase overall match by
combining results of lower-order comparisons across pattern’s variables:

0Le: AND of bit
inputs to form digitized integers, containing multiple powers of two

1Le: SUB of
integers to form patterns, over additional external dimensions = pattern length
L

2Le: DIV of
multiples (L) to form ratio patterns,
over additional distances = negative pattern length LL

3Le: LOG of
powers (LLs), etc.
Starting from second level, comparison is selective per element of an
input.

Such power increase also
applies in comparison to higher-order feedback, with a lag of one level per
order.

Power of coordinate filters
also lags the power of input filters by one level:

1Le fb: binary
sensor resolution: minimal and maximal detectable input value and coordinate
increments

2Le fb:
integer-valued average match and relative initial coordinate (skipping
intermediate coordinates)

3Le fb:
rational-valued coefficient per variable and multiple skipped coordinate range

4Le fb: real-valued
coefficients and multiple coordinate-range skip

I am defining initial
levels to find recurring increments in operations per level, which could then
be applied to generate higher levels recursively, by incrementing syntax of
output patterns and of feedback filters per level.

*operations per generic level (out of date)*
Level 0 digitizes inputs, filtered by minimal
detectable magnitude: least significant bit (i LSB). These bits are AND- compared, then their
matches are AND- compared again, and so on, forming integer
outputs. This is identical to iterative summation and bit-filtering by
sequentially doubled i LSB.

Level 1 compares consecutive integers, forming ± difference patterns (dP s). dP s are then evaluated to cross-compare their
individual differences, and so on, selectively increasing derivation of
patterns.

Evaluation: dP M (summed match) - dP aM (dP M per average match between
differences in level 2 inputs).

Integers are limited by the
number of digits (#b), and input span: least
significant bit of coordinate (C LSB).

No 1

^{st}level feedback: fL cost is additive to dP cost, thus must be justified by the value of dP (and coincident difference in value of patterns filtered by adjusted i LSB), which is not known till dP is outputted to 2^{nd}level.
Level 2 evaluates match within dP s | bf L (dP) s, forming ± value patterns: vP s | vP (bf L) s. +vP s are evaluated for cross-comparison of their dP s, then of resulting
derivatives, then of inputted derivation levels. +vP (bf L) s are evaluated to cross-compare bf L s, then dP s, adjusted by the difference between their bit filters, and so
on.

dP variables are compared by
subtraction, then resulting matches are combined with dP M (match
within dP) to
evaluate these variables for cross-comparison by division, to normalize
for the difference in their span.

// match filter is also normalized by span ratio
before evaluation, same-power evaluation and comparison?

Feedback: input dP s | bf L (dP) are back-projected and resulting magnitude is evaluated to
increment or decrement 0

^{th}level i LSB. Such increments terminate bit-filter span ( bf L (dP)), output it to 2^{nd}level, and initiate a new i LSB span to filter future inputs. // bf L (dP) representation: bf , #dP, Î£ dP, Q (dP).
Level 3 evaluates match in input vP s or f L (vP) s, forming ± evaluation-value patterns: eP s | eP (fL) s. Positive eP s are evaluated for cross-comparison of their vP s ( dP s ( derivatives ( derivation levels ( lower search-level sources: buffered or external locations (selected
sources may directly specify strong 3

^{rd}level sub-patterns).
Feedback: input vP is back-projected, resulting match is compared to 2

^{nd}level filter, and the difference is evaluated vs. filter-update filter. If update value is positive, the difference is added to 2^{nd}level filter, and filter span is terminated. Same for adjustment of previously covered bit filters and 2^{nd}level filter-update filters?
This is similar to 2

^{nd}level operations, but input vP s are separated by skipped-input spans. These spans are a filter of coordinate (Cf, higher-order than f for 2^{nd}level inputs), produced by pre-valuation of future inputs:
projected novel match =
projected magnitude * average match per magnitude - projected-input match?

Pre-value is then evaluated
vs. 3

^{rd}level evaluation filter + lower-level processing cost, and negative prevalue-value input span (= span of back-projecting input) is skipped: its inputs are not processed on lower levels.
// no prevaluation on 2

^{nd}level: the cost is higher than potential savings of only 1^{st}level processing costs?
As distinct from input
filters, Cf is defined individually rather than per filter
span. This is because the cost of Cf update: span representation
and interruption of processing on all lower levels, is minor compared to the
value of represented contents? ±eP = ±Cf: individual skip evaluation, no flushing?

or interruption is
predetermined, as with Cb, fixed C f within C f L: a span of sampling across fixed-L gaps?

alternating signed Cf s are averaged ±vP s?

Division: between L s, also inputs within
minimal-depth continuous d-sign or m-order derivation hierarchy?

*tentative generalizations and extrapolations*
So, filter resolution is
increased per level, first for i filters and then for C filters: level 0 has input bit filter,

level 1 adds coordinate bit filter, level 2 adds input integer filter, level 3
adds coordinate integer filter.

// coordinate filters (Cb, Cf) are not input-specific,
patterns are formed by comparing their contents.

Level 4 adds input multiple filter: eP match and its derivatives, applied in parallel to corresponding
variables of input pattern. Variable-values are multiplied and evaluated to
form pattern-value, for inclusion into next-level ±pattern // if separately evaluated, input-variable value = deviation from
average: sign-reversed match?

Level 5 adds coordinate multiple filter: a sequence of skipped-input spans by
iteratively projected patterns, as described in imagination section of part 3.
Alternatively, negative coordinate filters implement cross-level shortcuts,
described in level 3 sub-part, which select for projected match-associated
novelty.

Additional variables in
positive patterns increase cost, which decreases positive vs. negative span
proportion.

Increased difference in
sign, syntax, span, etc., also reduces match between positive and negative
patterns. So, comparison, evaluation, pre-valuation... on
higher levels is primarily for same-sign patterns.

Consecutive different-sign
patterns are compared due to their proximity, forming ratios of their span and
other variables. These ratios are applied to project match across
different-sign gap or contrast pattern:

projected match +=
(projected match - intervening negative match) * (negative value / positive
value) / 2?

Î›V selection is incremented
by induction: forward and feedback of actual inputs, or by deduction: algebraic compression of input syntax, to find computational
shortcuts. Deduction is faster, but actual inputs also
carry empirical information. Relative value of additive information vs.
computational shortcuts is set by feedback.

Following sub-parts cover
three initial levels of search in more detail, though out of date:

*Level 1: comparison to past inputs, forming difference patterns and match patterns*

Inputs to the 1

^{st}level of search are single integers, representing pixels of 1D scan line across an image, or equivalents from other modalities. Consecutive inputs are compared to form differences, difference patterns, matches, relative match patterns. This comparison may be extended, forming higher and distant derivatives:
resulting variables per
input: *=2 derivatives (d,m) per comp, + conditional *=2 (xd, xi) per extended
comp:

8 derivatives
// ddd, mdd, dd_i, md_i, + 1-input-distant dxd,
mxd, + 2-input-distant d_ii, m_ii,

/
\

4 der
4 der // 2 consecutive: dd, md, + 2
derivatives between 1-input-distant inputs: d_i and m_i,

/
\ / \

d,m d,m
d,m // d, m: derivatives from default comparison between consecutive
inputs,

/ \ /
\ / \

i >> i
>> i >> i // i: single-variable inputs.

This is explained /
implemented in my draft python code: level_1_working. That first level is for generic 1D cognitive algorithm, its
adaptation for image and then video recognition algorithm will be natively 2D.

That’s what I spend most
of my time on, the rest of this intro is significantly out of date.

*bit-filtering and digitization*
1

^{st}level inputs are filtered by the value of most and least significant bits: maximal and minimal detectable magnitude of inputs. Maximum is a magnitude that co-occurs with average 1^{st}level match, projected by outputted dP s. Least significant bit value is determined by maximal value and number of bits per variable.
This bit filter is
initially adjusted by overflow in 1

^{st}level inputs, or by a set number of consecutive overflows.
It’s also adjusted by
feedback of higher-level patterns, if they project over- or under- flow of 1

^{st}level inputs that exceeds the cost of adjustment. Underflow is average number of 0 bits above top 1 bit.
Original input resolution
may be increased by projecting analog magnification, by impact or by distance.

Iterative bit-filtering is
digitization: bit is doubled per higher digit, and exceeding summed input is
transferred to next digit. A digit can be larger than binary if the cost of
such filtering requires larger carry.

Digitization is the most
basic way of compressing inputs, followed by comparison between resulting
integers.

*hypothetical: comparable magnitude filter, to form minimal-magnitude patterns*
Initial magnitude justifies
basic comparison, and summation of below-average inputs only compensates for
their lower magnitude, not for the cost of conversion. Conversion involves
higher-power comparison, which must be justified by higher order of match, to
be discovered on higher levels. Or by average neg. mag. span?

iP min mag span conversion cost and comparison
match would be on 2

^{nd}level, but it’s not justified by 1^{st}level match, unlike D span conversion cost and comparison match, so it is effectively the 1^{st}level of comparison?
possible +iP span evaluation: double evaluation + span representation cost
< additional lower-bits match?

The inputs may be
normalized by subtracting feedback of average magnitude, forming ± deviation,
then by dividing it by next+1 level feedback, forming a multiple of average
absolute deviation, and so on. Additive value of input is a combination of all
deviation orders, starting with 0

^{th}or absolute magnitude.
Initial input evaluation if
any filter: cost < gain: projected negative-value (comparison cost -
positive value):

by minimal magnitude > ± relative magnitude patterns (iP s), and + iP s are evaluated or cross-compared?

or by average magnitude
> ± deviations, then by co-average
deviation: ultimate bit filter?

Summation *may* compensate for conversion if its span is greater than average per
magnitude spectrum?!

Summation on higher levels
also increases span order, but within-order conversion is the same, and
between-order comparison is intra-pattern only. bf spans overlap vP span, -> filter conversion costs?

*Level 2:*

*additional evaluation of input patterns for feedback, forming filter patterns*

*(out of date).*

Inputs to 2

^{nd}level of search are patterns derived on 1^{st}level: dP ( L, I, D, V, Q (d)) s or t dP ( tV, tD, dP ()) s.
These inputs are evaluated
for feedback to update 0

^{th}level i LSB, terminating same-filter span.
Feedback increment of LSB is evaluated by deviation (∆) of magnitude, to avoid input overflow or
underflow:

∆ += I/ L - LSB a; |∆| > ff? while (|∆| > LSB a){ LSB ±; |∆| -= LSB a; LSB a *2};

LSB a is average input (* V/ L?) per LSB value, and ff is average deviation per
positive-value increment;

Î£ (∆) before evaluation: no V patterns? #b++ and C LSB-- are more expensive,
evaluated on 3

^{rd}level?
They are also compared to
previously inputted patterns, forming difference patterns dPs and value patterns vPs per input variable, then combined into dPP s and vPP s per input pattern.

L * sign of consecutive dP s is a known miss, and
match of dP variables is correlated by common derivation.

Hence, projected match of
other +dP and -dP variables = amk * (1 - L / dP). On the other hand, same-sign dP s are distant by L, reducing projected match by amk * L, which is equal to reduction by miss of L?

So, dP evaluation is for two
comparisons of equal value: cross-sign, then cross- L same-sign (1 dP evaluation is blocked by feedback of discovered or defined
alternating sign and co-variable match projection).

Both of last dP s will be compared to the next one, thus past match per dP (dP M) is summed for three dP s:

dP M ( Î£ ( last 3 dP s L+M)) - a dP M (average of 4Le +vP dP M) -> v, vs;; evaluation / 3 dP s -> value, sign / 1 dP.

while (vs = ovs){ ovs = vs; V+=v; vL++; vP (L, I, M, D) += dP (L, I, M, D);; default vP - wide sum, select preserv.

vs > 0? comp (3 dP s){ DIV (L, I, M, D) -> N, ( n, f, m, d); vP (N, F, M, D) += n, f, m, d;; sum: der / variable, n / input?

vr = v+ N? SUB (nf) -> nf m; vd = vr+ nf m, vds = vd - a;; ratios are too small for DIV?

while (vds = ovds){ ovds = vds; Vd+=vd; vdL++; vdP() += Q (d | ddP);; default Q (d | ddP) sum., select. preserv.

vds > 0? comp (1

^{st}x l^{st}d | ddP s of Q (d) s);; splicing Q (d) s of matching dP s, cont. only: no comp ( Î£ Q (d | ddP)?
Î£ vP ( Î£ vd P eval: primary for -P, redundant to individual dP s ( d s for +P, cost *2, same for
+P' I and -P' M,D?

no Î£ V | Vd evaluation of cont. comp per variable or division: cost + vL = comp cost? Î£ V per fb: no vL, #comp;

- L, I, M, D: same value per mag, power / compression, but I | M, D redund = mag, +vP: I - 2a, - vP: M, D - 2a?

- no
variable eval: cost (sub + vL + filter) > comp cost, but match value must be
adjusted for redundancy?

- normalization
for comparison: min (I, M, D) * rL, SUB (I, M, D)? Î£ L (pat) vs C: more general but
interrupted?

variable-length DIV: while (i > a){ while (i> m){ SUB (i, m) -> d; n++; i=d;}; m/=2; t=m; SUB (d, t); f+= d;}?

additive compression per d vs. m*d: > length cost?

tdP ( tM, tD, dP(), ddP Î£ ( dMÎ£ (Q (dM)), dDÎ£ (Q (dD)), ddLÎ£ (Q (ddL)), Q (ddP))); // last d
and D are within dP()?

Input filter is a higher-level
average, while filter update is accumulated over multiple higher-level spans
until it exceeds filter-update filter. So, filter update is 2

^{nd}order feedback relative to filter, as is filter relative to match.
But the same filter update
is 3

^{rd}order of feedback when used to evaluate input value for inclusion into pattern defined by a previous filter: update span is two orders higher than value span.
Higher-level comparison
between patterns formed by different filters is mediated, vs. immediate continuation
of current-level comparison across filter update (mediated cont.: splicing
between different-filter patterns by vertical specification of match, although
it includes lateral cross-comparison of skip-distant specifications).

However, filter update feedback
is periodic, so it doesn’t form continuous cross-filter comparison patterns xPs.

*adjustment of forward evaluation by optional feedback of projected input*
More precisely, additive
value or novel magnitude of an input is its deviation from higher-level
average. Deviation = input - expectation: (higher-level summed input -
summed difference /2) * rL (L / hL).

Inputs are compared to last
input to form difference, and to past average to form deviation or novelty.

But last input is more
predictive of the next one than a more distant average, thus the latter is
compared on higher level than the former. So, input variable is compared
sequentially and summed within resulting patterns. On the next level, the sum
is compared vertically: to next-next-level average of the same variable.

Resulting vertical match
defines novel value for higher-level sequential comparison:

novel value = past match - (vertical
match * higher-level match rate) - average novel match:

nv = L+M - (m (I, (hI * rL)) * hM / hL) - hnM * rL; more precise than initial value: v = L+M - hM * rL;

Novelty evaluation is done
if higher-level match > cost of feedback and operations,
separately for I and D P s:

I, M ( D, M feedback, vertical SUB (I, nM ( D, ndM));

Impact on ambient sensor is
separate from novelty and is predicted by representational-value patterns?

- next-input
prediction: seq match + vert match * relative rate, but
predictive selection is per level, not input.

- higher-order
expectation is relative match per variable: pMd = D * rM, M/D, or D * rMd: Md/D,

- if rM |
rMd are derived by intra-pattern comparison, when average M | Md >
average per division?

*one-input search extension within cross-compared patterns*
Match decreases with
distance, so initial comparison is between consecutive inputs. Resulting match
is evaluated, forming ±vP s. Positive P s are then evaluated for expanded internal search:
cross-comparison among 1-input-distant inputs within a pattern (on same level,
higher-level search is between new patterns).

This cycle repeats to
evaluate cross-comparison among 2-input-distant inputs, 3-input-distant inputs,
etc., when summed current-distance match exceeds the average per evaluation.

So, patterns of longer
cross-comparison range are nested within selected positive patterns of shorter
range. This is similar to 1

^{st}level ddP s being nested within dP s.
Same input is re-evaluated
for comparison at increased distance because match will decay: projected match
= last match * match rate (mr), * (higher-level
mr / current-level mr) * (higher-level distance /
next distance)?

Or = input * average match rate for that specific
distance, including projected match within negative patterns.

It is re-evaluated also
because projected match is adjusted by past match: mr *= past mr / past projected mr?

Also, multiple comparisons
per input form overlapping and redundant patterns (similar to fuzzy clusters),

and must be evaluated vs.
filter * number of prior comparisons, reducing value of projected match.

Instead of directly
comparing incrementally distant input pairs, we can calculate their difference
by adding intermediate differences. This would obviate multiple access to the
same inputs during cross-comparison.

These differences are also
subtracted (compared), forming higher derivatives and matches:

ddd, x1dd, x2d ( ddd: 3

^{rd}derivative, x1dd: d of 2-input-distant d s, x2d: d of 2-input-distant inputs)
/ \

dd, x1d dd, x1d ( dd: 2

^{nd}derivative, x1d = d+d = difference between 1-input-distant inputs)
/
\
/ \

d d d ( d: difference between consecutive inputs)

/ \ /
\ / \

i
i
i i
( i: initial inputs)

As always, match is a
smaller input, cached or restored, selected by the sign of a difference.

Comparison of both types is
between all same-type variable pairs from different inputs.

Total match includes match
of all its derivation orders, which will overlap for proximate inputs.

Incremental cost of
cross-comparison is the same for all derivation orders. If projected match is
equal to projected miss, then additive value for different orders of the same
inputs is also the same: reduction in projected magnitude of differences will
be equal to reduction in projected match between distant inputs?

*multi-input search extension, evaluation of selection per input*

*:*

*tentative*
On the next level, average
match from expansion is compared to that from shorter-distance comparison, and
resulting difference is decay of average match with distance. Again,
this decay drives re-evaluation per expansion: selection of inputs with
projected decayed match above average per comparison cost.

Projected match is also
adjusted by prior match (if local decay?) and redundancy (symmetrical
if no decay?)

Slower decay will reduce
value of selection per expansion because fewer positive inputs will turn
negative:

Value of selection = Î£ |comp cost of neg-value inputs| - selection cost (average saved cost or relative delay?)

This value is summed
between higher-level inputs, into average value of selection per increment of
distance. Increments with negative value of selection should be compared
without re-evaluation, adding to minimal number of comparisons per selection,
which is evaluated for feedback as a comparison-depth filter:

Î£ (selection value per
increment) -> average selection value;; for negative patterns of each depth, | >1 only?

depth adjustment value =
average selection value; while (|average selection value| > selection cost){

depth adjustment ±±; depth adjustment value -= selection value per
increment (depth-specific?); };

depth adjustment >
minimal per feedback? >> lower-level depth filter;; additive depth = adjustment value?

- match filter is summed
and evaluated per current comparison depth?

- selected positive relative
matches don’t reduce the benefit of pruning-out negative ones.

- skip if negative
selection value: selected positive matches < selection
cost: average value or relative delay?

Each input forms a queue of
matches and misses relative to templates within comparison depth filter. These
derivatives, both discrete and summed, overlap for inputs within each other’s
search span. But representations of discrete derivatives can be reused,
redundancy is only necessary for parallel comparison.

Assuming that environment
is not random, similarity between inputs declines with spatio-temporal
distance. To maintain proximity, a n-input search is FIFO: input is compared to
all templates up to maximal distance, then added to the queue as a new
template, while the oldest template is outputted into pattern-wide queue.

*value-proportional combination of patterns*

*:*

*tentative*
Summation of +dP and -dP is
weighted by their value: L (summed d-sign match) + M (summed i match).

Such relative probability
of +dP vs. - dP is indicated by corresponding ratios: rL =
+L/-L, and rM = +M/-M.

(Ls and Ms are compared by
division: comparison power should be higher for more predictive variables).

But weighting
complementation incurs costs, which must be justified by value of ratio. So,
division should be of variable length, continued while the ratio is above
average. This is shown below for Ls, also applies to Ms:

dL = +L - -L, mL = min (+L, -L); nL =0; fL=0; efL=1; // nL: L multiple, fL: L fraction,
efL: extended fraction.

while (dL > adL){ dL =
|dL|; // all Ls are positive; dL is evaluated for long division by adL: average
dL.

while (dL > 0){ dL -=
mL; nL++;} dL -= mL/2; dL >0? fL+= efL; efL/=2;} // ratio: rL= nL + fL.

Ms’ long-division
evaluation is weighted by rL: projected rM value = dM * nL (reduced-resolution rL) - adM.

Ms are then combined: cM =
+M + -M * rL; // rL is relative probability of -M across iterated cL.

Ms are not projected (M+= D * rcL * rM D (MD/cD) /2): precision of higher-level rM D is below that of rM?

Prior ratios are
combination rates: rL is probability of -M, and combined rL and rM (cr) is probability of -D.

If rM < arM, cr = rL,
else: cr = (+L + +M) / (-L + -M) // cr = √(rL * rM) would lose L vs. M
weighting.

cr predicts match of
weighted cD between cdPs, where negative-dP variable is multiplied by
above-average match ratio before combination: cD = +D + -D * cr. // after un-weighted comparison between Ds?

Averages: arL, arM, acr, are feedback of ratios that co-occur with above-average match of
span-normalized variables, vs. input variables. Another feedback is averages that evaluate long division: adL, adM, adD.

Both are feedback of
positive C pattern, which represents these variables, inputted & evaluated on 3

^{rd}level.
; or 4

^{th}level: value of dPs * ratio is compared to value of dPs, & the difference is multiplied by cL / hLe cL?
Comparison of opposite-sign
Ds forms negative match = smaller |D|, and positive difference dD = +D+ |-D|.

dD magnitude predicts its
match, not further combination. Single comparison is cheaper than its
evaluation.

Comparison is by division
if larger |D| co-occurs with hLe nD of above-average predictive value (division
is sign-neutral & reductive). But average nD value is below the cost of
evaluation, except if positive feedback?

So, default operations for
L, M, D of complementary dPs are comparison by long division and combination.

D combination: +D -D*cr, vs. -
cD * cr: +D vs. -D weighting is lost, meaningless if
cD=0?

Combination by division is
predictive if the ratio is matching on higher level (hLe) & acr is fed back as filter?

Resulting variables: cL,
rL, cM, rM, cr, cD, dD, form top level of cdP: complemented dP.

*Level 3: prevaluation of projected filter patterns, forming updated-input patterns*

*(out of date)*

3

^{rd}level inputs are ± V patterns, combined into complemented V patterns. Positive V patterns include derivatives of 1^{st}level match, which project match within future inputs (D patterns only represent and project derivatives of magnitude). Such projected-inputs-match is pre-valuated, negative prevalue-span inputs are summed or skipped (reloaded), and positive prevalue-span inputs are evaluated or even directly compared.
Initial upward (Î›) prevaluation by E filter selects for evaluation of V patterns, within resulting
± E patterns. Resulting prevalue is also projected
downward (V), to select future input spans for evaluation, vs. summation or
skipping. The span is of projecting V pattern, same as of lower
hierarchy. Prevaluation is then iterated over multiple projected-input spans,
as long as last |prevalue| remains above average for the cost of prevaluation.

Additional interference of
iterated negative projection is stronger than positive projection of lower
levels, and should flush them out of pipeline. This flushing need not be final,
spans of negative projected value may be stored in buffers, to delay the loss.
Buffers are implemented in slower and cheaper media (tape vs. RAM) and accessed
if associated patterns match on a higher level, thus project above-average
match among their inputs.

Iterative back-projection
is evaluated starting from 3

^{rd}level: to be projectable the input must represent derivatives of value, which are formed starting from 2^{nd}level. Compare this to 2^{nd}level evaluation:
Î› for input, V for V filter, iterated within V pattern. Similar
sub-iteration in E pattern?

Evaluation value =
projected-inputs-match - E
filter: average input match that
co-occurs with average higher-level match per evaluation (thus accounting
for evaluation costs + selected comparison costs). Compare this to V filter that selects for 2

^{nd}level comparison: average input match that co-occurs with average higher-level match per comparison (thus accounting for costs of default cross-comparison only).
E filter feedback starts from 4

^{th}level of search, because its inputs represent pre-valuated lower-level inputs.
4

^{th}level also pre-pre-valuates vs. prevaluation filter, forming pre-prevalue that determines prevaluation vs. summation of next input span. And so on: the order of evaluation increases with the level of search.
Higher levels are
increasingly selective in their inputs, because they additionally select by
higher orders derived on these levels: magnitude ) match and difference of
magnitude ) match and difference of match, etc.

Feedback of prevaluation is
± pre-filter: binary evaluation-value sign that determines evaluating vs.
skipping initial inputs within projected span, and flushing those already
pipelined within lower levels.

Negative feedback may be
iterated, forming a skip span.

Parallel lower hierarchies & skip spans may be assigned to different external sources or their
internal buffers.

Filter update feedback is
level-sequential, but pre-filter feedback is sent to all lower levels at once.

Pre-filter is defined per
input, and then sequentially translated into pre-filters of higher derivation
levels:

prior value += prior match
-> value sign: next-level pre-filter. If there are multiple pre-filters of
different evaluation orders from corresponding levels, they AND & define
infra-patterns: sign ( input ( derivatives.

*filter update evaluation*

*and*

*feedback*
Negative evaluation-value blocks input evaluation (thus
comparison) and filter updating on all lower levels. Not-evaluated input spans (gaps)
are also outputted, which will increase coordinate range per contents of both
higher-level inputs and lower-level feedback. Gaps represent negative projected-match value, which must be combined with positive value of subsequent
span to evaluate comparison across the gap on a higher level. This is similar
to evaluation of combined positive + negative relative match spans, explained
above.

Blocking locations with
expected inputs will result in preference for exploration & discovery of new patterns, vs. confirmation of the old ones. It
is the opposite of upward selection for stronger patterns, but sign reversal in
selection criteria is basic feature of any feedback, starting with average
match & derivatives.

Positive evaluation-value input spans are evaluated by
lower-level filter, & this filter is evaluated for update:

combined update = (output
update + output filter update / (same-filter span (fL) / output span)) /2.

both updates: -= last
feedback, equal-weighted because higher-level distance is compensated by range:
fL?

update value = combined
update - update filter: average update per average higher-level additive
match.

also differential costs of
feedback transfer across locations (vs. delay) + representation +
filter conversion?

If update value is negative: fL += new inputs, subdivided by their positive or
negative predictive value spans.

If update value is positive: lower-level filter += combined update, new fL (with new filter representation) is initialized on a current
level, while current-level part of old fL is outputted and evaluated
as next-level input.

In turn, the filter gets
updates from higher-level outputs, included in higher-higher-level positive
patterns by that level’s filter. Hence, each filter represents combined
span-normalized feedback from all higher levels, of exponentially growing span
and reduced update frequency.

Deeper hierarchy should
block greater proportion of inputs. At the same time, increasing number of
levels contribute to projected additive match, which may justify deeper search
within selected spans.

Higher-level outputs are
more distant from current input due to elevation delay, but their projection
range is also greater. So, outputs of all levels have the same relative distance (distance/range) to a next input, and are equal-weighted in combined update. But if input span is skipped, relative
distance of skip-initiating pattern to next input span will increase, and its
predictive value will decrease. Hence, that pattern should be flushed or at
least combined with a higher-level one:

combined V prevalue = higher-level V prevalue + ((current-level V prevalue - higher-level V prevalue) / ((current-level span / distance) / (higher-level span / distance)) /2. // the difference between
current-level and higher-level prevalues is reduced by the ratio of their
relative distances.

To speed up selection,
filter updates can be sent to all lower levels in parallel. Multiple direct
filter updates are span-normalized and compared at a target level, and the
differences are summed in combined update. This combination is equal-weighted
because all levels have the same span-per-distance to next input, where the
distance is the delay of feedback during elevation. // this happens automatically
in level-sequential feedback?

combined update = filter update + distance-normalized difference between output
& filter updates:

((output update - filter
update) / (output relative distance / higher-output
relative distance)) /2.

This combination method is
accurate for post-skipped input spans, as well as next input span.

- filter can also be
replaced by output + higher-level filter /2, but value of such feedback is not known.

- possible fixed-rate
sampling, to save on feedback evaluation if slow decay, ~ deep feedforward
search?

- selection can be by
patterns, derivation orders, sub-patterns within an order, or individual
variables?

- match across distance
also projects across distance: additive match = relative match * skipped
distance?

*cross-level shortcuts*

*:*

*higher-level sub-filters*

*and*

*symbols*
After individual input
comparison, if match of a current scale (length-of-a-length…) projects positive
relative match of input lower-scale / higher-derivation level, then the later
is also cross-compared between the inputs.

Lower scale levels of a
pattern represent old lower levels of a search hierarchy (current or buffered
inputs).

So, feedback of lower scale
levels goes down to corresponding search levels, forming shortcuts to preserve detail for higher levels. Feedback is generally negative:
expectations are redundant to inputs. But specifying feedback may be positive:
lower-level details are novel to a pattern, &
projected to match with it in the future.

Higher-span comparison
power is increased if lower-span comparison match is below average:

variable subtraction ) span division )
super-span logarithm?

Shortcuts to individual
higher-level inputs form a queue of sub-filters on a lower level, possibly
represented by a queue-wide pre-filter. So, a level has one filter per parallel
higher level, and sub-filter for each specified sub-pattern.
Sub-filters of incrementally distant inputs are redundant to all previous ones.

Corresponding input value = match - sub-filter value * rate of match to sub-filter
* redundancy?

Shortcut to a whole level
won’t speed-up search: higher-level search delay > lower-hierarchy search
delay.

Resolution and parameter
range may also increase through interaction of co-located counter-projections?

Symbols, for communication
among systems that have common high-level concepts but no direct interface, are
“co-author identification” shortcuts: their recognition and interpretation is
performed on different levels.

Higher-level patterns have
increasing number of derivation levels, that represent corresponding lower
search levels, and project across multiple higher search levels, each evaluated
separately?

Match across discontinuity may be due to additional
dimensions or internal gaps within patterns.

Search depth may also be
increased by cross-comparison between levels of scale within a
pattern: match across multiple scale levels also projects over multiple higher-
and lower- scale levels? Such comparison between variable types within a
pattern would be of a higher order:

*5. Comparison between variable types within a pattern (tentative)*

To reiterate, elevation
increases syntactic complexity of patterns: the number of different variable
types within them. Syntax is identification of these types by their position
(syntactic coordinate) within a pattern. This is analogous to recognizing parts
of speech by their position within a sentence.

Syntax “synchronizes”
same-type variables for comparison | aggregation between input patterns. Access
is hierarchical, starting from sign->value levels within each variable of difference
and relative match: sign is compared first, forming + and - segments, which are
then evaluated for comparison of their values.

Syntactic expansion is
pruned by selective comparison vs. aggregation of individual variable types
within input patterns, over each coordinate type or resolution. As with
templates, minimal aggregation span is resolution of individual inputs, &
maximal span is determined by average magnitude (thus match) of new derivatives
on a higher level. Hence, a basic comparison cycle generates queues of
interlaced individual & aggregate derivatives at each template variable,
and conditional higher derivatives on each of the former.

Sufficiently complex syntax
or predictive variables will justify comparing across “syntactic“ coordinates
within a pattern, analogous to comparison across external coordinates. In fact,
that’s what higher-power comparisons do. For example, division is an iterative
comparison between difference & match: within a pattern (external
coordinate), but across derivation (syntactic coordinate).

Also cross-variable is
comparison between orders of match in a pattern: magnitude, match,
match-of-match... This starts from comparison between match & magnitude:
match rate (mr) = match / magnitude. Match
rate can then be used to project match from magnitude: match = magnitude * output mr * filter mr.

In this manner, mr of each match order adjusts intra-order-derived sequentially
higher-order match:

match *= lower inter-order mr. Additive match is then projected from adjusted matches &
their derivatives.

This inter-order projection
continues up to the top order of match within a pattern, which is the ultimate
selection criterion because that’s what’s left matching on the top level of
search.

Inter-order vectors are Î›V symmetrical, but Î›V derivatives from lower order of match are also
projected for higher-order match, at the same rate as the match itself?

Also possible is comparison
across syntactic gaps: Î›Y comparison -> difference, filter feedback VY hierarchy. For example, comparison between dimensions of a
multi-D pattern will form possibly recurrent proportions.

Internal comparisons can
further compress a pattern, but at the cost of adding a higher-order syntax,
which means that they must be increasingly selective. This selection will
increase “discontinuity” over syntactic coordinates: operations necessary to
convert the variables before comparison. Eventually, such operators will become
large enough to merit direct comparisons among them. This will produce algebraic
equations, where the match (compression) is a reduction in the number of
operations needed to produce a result.

The first such short-cut
would be a version of Pythagorean theorem, discovered during search in 2D (part
6) to compute cosines. If we compare 2D-adjacent 1D Ls by division, over 1D
distance and derivatives (an angle), partly matching ratio between the ratio of
1D Ls and a 2nd derivative of 1D distance will be a cosine.

Cosines are necessary to
normalize all derivatives and lengths (Ls) to a value they have when orthogonal
to 1D scan lines (more in part 6).

Such normalization for a
POV angle is similar to dimensionality reduction in Machine Learning, but is much more efficient
because it is secondary to selective dimensionality expansion. It’s not really
“reduction”: dimensionality is prioritized rather than reduced. That is, the
dimension of pattern’s main axis is maximized, and dimensions sequentially
orthogonal to higher axes are correspondingly minimized. The process of
discovering these axes is so basic that it might be hard-wired in animals.

*6. Cartesian dimensions and sensory modalities*

This is a recapitulation
and expansion on incremental dimensionality introduced in part 2.

Term “dimension” here is
reserved for a parameter that defines sequence and distance among inputs,
initially Cartesian dimensions + Time. This is different from terminology of
combinatorial search, where dimension is any parameter of an input, and their
external order and distance don’t matter. My term for that is “variable“,
external dimensions become types of a variable only after being encoded within
input patterns.

For those with ANN
background, I want to stress that a level of search in my approach is 1D queue
of inputs, not a layer of nodes. The inputs to a node are combined regardless
of difference and distance between them (the distance is the difference between
laminar coordinates of source “neurons”).

These derivatives are essential
because value of any prediction = precision of what *
precision of where. Coordinates and co-derived differences are not
represented in ANNs, so they can't be used to calculate Euclidean vectors.
Without such vectors, prediction and selection of where must remain extremely crude.

Also, layers in ANN are
orthogonal to the direction of input flow, so hierarchy is at least 2D. The
direction of inputs to my queues is in the same dimension as the queue itself,
which means that my core algorithm is 1D. A hierarchy of 1D queues is the most
incremental way to expand search: we can add or extend only one coordinate at a
time. This allows algorithm to select inputs that are predictive enough to
justify the cost of representing additional coordinate and corresponding
derivatives. Again, such incremental syntax expansion is my core principle,
because it enables selective (thus scalable) search.

A common objection is that
images are “naturally” 2D, and our space-time is 4D. Of course, these empirical
facts are practically universal in our environment. But, a core cognitive
algorithm must be able to discover and forget any empirical specifics on its
own. Additional dimensions can be discovered as some general periodicity in the
input flow: distances between matching inputs are compared, match between these
distances indicates a period of lower dimension, and recurring periods form
higher-dimension coordinate.

But as a practical shortcut
to expensive dimension-discovery process, initial levels should be designed to
specialize in sequentially higher spatial dimensions: 1D scan lines, 2D frames,
3D set of confocal “eyes“, 4D temporal sequence. These levels discover
contiguous (positive match) patterns of increasing dimensionality:

1D line segments, 2D blobs,
3D objects, 4D processes. Higher 4D cycles form hierarchy of multi-dimensional
orders of scale, integrated over time or distributed sensors. These higher
cycles compare discontinuous patterns. Corresponding dimensions may not be
aligned across cycles of different scale order.

Explicit coordinates and
incremental dimensionality are unconventional. But the key for scalable search
is input selection, which must be guided by cost-benefit analysis. Benefit is
projected match of patterns, and cost is representational complexity per
pattern. Any increase in complexity must be justified by corresponding increase
in discovered and projected match of selected patterns. Initial inputs have no
known match, thus must have minimal complexity: single-variable “what”, such as
brightness of a grey-scale pixel, and single-variable “where”: pixel’s
coordinate in one Cartesian dimension.

Single coordinate means
that comparison between pixels must be contained within 1D (horizontal) scan
line, otherwise their coordinates are not comparable and can’t be used to
select locations for extended search. Selection for contiguous or proximate
search across scan lines requires second (vertical) coordinate. That increases
costs, thus must be selective according to projected match, discovered by past
comparisons within 1D scan line. So, comparison across scan lines must be done
on 2

^{nd}level of search. And so on.
Dimensions are added in the
order of decreasing rate of change. This means spatial dimensions are scanned
first: their rate of change can be sped-up by moving sensors. Comparison over
purely temporal sequence is delayed until accumulated change / variation
justifies search for additional patterns. Temporal sequence is the original
dimension, but it is mapped on spatial dimensions until spatial continuum is
exhausted. Dimensionality represented by patterns is increasing on higher
levels, but each level is 1D queue of patterns.

Also independently
discoverable are derived coordinates: any variable with cumulative match that
correlates with combined cumulative match of all other variables in a pattern.
Such correlation makes a variable useful for sequencing patterns before
cross-comparison.

It is discovered by summing
matches for same-type variables between input patterns, then cross-comparing
summed matches between all variables of a pattern. Variable with the highest
resulting match of match (mm) is a candidate coordinate. That mm is then
compared to mm of current coordinate. If the difference is greater than cost of
reordering future inputs, sequencing feedback is sent to lower levels or
sensors.

Another type of empirically
distinct variables is different sensory modalities: colors, sound and pitch,
and so on, including artificial senses. Each modality is processed separately,
up a level where match between patterns of different modalities but same scope
exceeds match between unimodal patterns across increased distance. Subsequent
search will form multi-modal patterns within common S-T frame of reference.

As with external
dimensions, difference between modalities can be pre-defined or discovered. If
the latter, inputs of different modalities are initially mixed, then segregated
by feedback. Also as with dimensions, my core algorithm only assumes
single-modal inputs, pre-defining multiple modalities would be an add-on.

*7. Notes on clustering, ANNs, and probabilistic inference*

In terms of conventional
machine learning, my approach is a form of hierarchical fuzzy clustering. Cluster is simply a different term for
pattern: a set of matching inputs. Each set is represented by a centroid: an
input with below-threshold combined “distance” to other inputs of the set. The
equivalent of centroid in my model is an input with above-average match (a
complementary of a distance) to other inputs within its search span. Such
inputs are selected to search next-level queue and to indirectly represent
other cross-compared but not selected inputs, via their discrete or aggregate
derivatives relative to the selected one.

Crucial differences here is
that conventional clustering methods initialize centroids with arbitrary random
weights, while I use matches (and so on) from past comparisons. And the weights
are usually defined in terms of one variable, while I select higher-level
inputs based on a combination of all variables per pattern, the number of which
increases with the level of search.

Current methods in
unsupervised learning were developed / accepted because they solved specific
problems with reasonable resources. But, they aren’t comparable to human
learning in scalability. I believe that requires an upfront investment in
incremental complexity of representation: a syntactic overhead that makes such
representation uncompetitive at short runs, but is necessary to predictively
prune longer-range search.

The most basic example here
is my use of explicit coordinates, and of input differences at their distances.
I haven’t seen that in other low-level approaches, yet they are absolutely necessary
to form Euclidean vectors. Explicit coordinates is why I start image processing
with 1D scan lines, - another thing that no one else does. Images seem to be
“naturally” 2D, but expanding search in 2Ds at once adds the extra cost of two
sets of new coordinates & derivatives. On the other hand, adding 1D at a
time allows to select inputs for each additional layer of syntax, reducing
overall (number of variables * number of inputs) costs of search on the next
level.

*Artificial Neural Networks*
The same coordinate-blind
mindset pervades ANNs. Their learning is probabilistic: match is determined as
an aggregate of multiple weighted inputs. Again, no derivatives (0D miss), or
coordinates (1-4D miss) per individual input pair are recorded. Without them,
there is no Euclidean vectors, thus pattern prediction must remain extremely
crude. I use aggregation extensively, but this degradation of resolution is
conditional on results of prior comparison between inputs. Neurons simply don’t
have the capacity for primary comparison.

This creates a crucial
difference in the way patterns are represented in the brain vs. my hierarchy of
queues. Brain consists of neurons, each likely representing a variable of a
given magnitude and modality. These variables are shared among multiple
patterns, or co-activated networks (“cognits” in terms of J. Fuster).

This is conceptually
perverse: relatively constant values of specific variables define a pattern,
but there’s no reason that same values should be shared among different
patterns. The brain is forced to share variables because it has fixed number of
neurons, but a fluid and far greater number of their networks.

I think this is responsible
for our crude taxonomy, such as using large, medium, small instead of specific
numbers. So, it’s not surprising that our minds are largely irrational, even
leaving aside all the sub-cortical nonsense. We don’t have to slavishly copy
these constraints. Logically, a co-occurring set of variables should be
localized within a pattern. This requires more local memory for redundant
representations, but will reduce the need for interconnect and transfers to
access global shared memory, which is far more expensive.

More broadly, neural
network-centric mindset itself is detrimental, any function must be initially
conceptualized as a sequential algorithm, parallelization is an optional
superstructure.

*Probabilistic inference: AIT, Bayesian logic, Markov models*
A good introduction to
Algorithmic information theory is Philosophical Treatise of Universal Induction by Rathmanner and Hutter. The criterion is same
as mine: compression and prediction. But, while a progress vs. frequentist
probability calculus, both AIT and Bayesian inference still assume a prior,
which doesn’t belong in a consistently inductive approach. In my approach,
priors or models are simply past inputs and their patterns. Subject-specific priors
could speed-up learning, but unsupervised pattern discovery algorithm must be
the core on which such shortcuts are added or removed from.

More importantly, as with
most contemporary approaches, Bayesian learning is statistical and
probabilistic. Probability is estimated from simple incidence of events, which
I think is way too coarse. These events hardly ever match or miss precisely, so
their similarity should be quantified. This would add a whole new dimension of
micro-grayscale: partial match, as in my approach, vs. binary incidence in
probabilistic inference. It should improve accuracy to the same extent that
probabilistic inference improved on classical logic, by adding a
macro-grayscale of partial probability vs. binary true | false values of the
former.

Resolution of my inputs is
always greater than that of their coordinates, while Bayesian inference and AIT
typically start with the reverse: strings of 1bit inputs. These inputs, binary
confirmations / disconfirmations, are extremely crude way to represent “events”
or inputs. Besides, the events are assumed to be high-level concepts: the kind
that occupy our conscious minds and are derived from senses by subconscious
cognitive processes, which must be built into general algorithm. Such choice of
initial inputs in BI and AIT demonstrates a lack of discipline in incrementing
complexity.

To attempt a general
intelligence, Solomonoff introduced “universal prior“: a class of all models.
That class is a priori infinite, which means that he hits combinatorial
explosion even *before* receiving actual inputs. It‘s a solution that only a
mathematician may find interesting. Marginally practical implementation of AIT
is Levin Search, which randomly generates models / algorithms of incremental
complexity and selects those that happen to solve a problem or compress a bit
string.

Again, I think starting
with prior models is putting a cart before a horse: cognition must start with
raw data, complex math only becomes cost-efficient on much higher levels of
selection and generalization. And this distinction between input patterns and
pattern discovery process is only valid within a level: algorithm is embedded
in resulting patterns, hence is also compared on higher levels, forming
“algorithmic patterns“.

*8. Notes on working mindset and a prize for contributions*

My terminology is as
general as the subject itself. It’s a major confounder, - people crave context,
but generalization is decontextualization. And cognitive algorithm is a
meta-generalization: the only thing in common for everything we learn. This
introduction is very compressed, because much of it is work in progress. But I
think it also reflects and cultivates ruthlessly reductionist mindset required
for such subject.

My math is very simple,
because algorithmic complexity must be incremental. Advanced math can
accelerate learning on higher levels of generalization, but it’s too expensive
for initial levels. And general learning algorithm must be able to discover
computational shortcuts (= math) on it’s own, just like we do. Complex math
definitely isn’t innate in humans on any level: cavemen didn’t do calculus.

This work may seem too
speculative, but any degree of generalization must be correspondingly lossy.
Which is contrary to deduction-heavy culture of math and computer science.
Hence, current Machine Learning is mostly experimental, with constrained tasks
and supervised techniques. A handful of people aspire to work on AGI, but they either lack or neglect functional definition of
intelligence, their theories are only vague inspiration.

I think working on this
level demands greater delay of experimental verification than is acceptable in
any established field. Except for philosophy, which has nothing else real to
study. But established philosophers have always been dysfunctional fluffers, not
surprisingly as their paying customers are college freshmen.

Our main challenge in
formalizing GI is a specie-wide ADHD. We didn’t evolve for sustained focus on
this level of generalization, that would cause extinction long before any
tangible results. Which is no longer a risk, GI is the most important problem
conceivable, and we have plenty of computing power for anything better than
brute-force algorithms. But our psychology lags a light-year behind technology:
we still hobble on mental crutches of irrelevant authority and peer support,
flawed analogies and needless experimentation.

*Prize for contributions*
I offer prizes up to a
total of $500K for debugging, optimizing, and extending this algorithm: github.

Contributions must fit into
incremental-complexity hierarchy outlined here. Unless you find a flaw in my
reasoning, which would be even more valuable. I can also pay monthly, but there
must be a track record.

Winners will have an option
to convert the awards into an interest in all commercial applications of a
final algorithm, at the rate of $10K per 1% share. This would be informal and
likely unprofitable, mine is not a commercial enterprise. Money can’t be
primary motivation here, but it saves time.

Winners so far:

2010: Todor Arnaudov, $600 for suggestion to buffer old inputs after
search. This occurred to me more than once before, but I rejected it as
redundant to potential elevation of these inputs. Now that he made me think
about it again, I realized that partial redundancy can preserve the detail at
much lower cost than elevation.

The buffer is accessed if
coordinates of contents get relatively close to projected inputs (that and
justification is mine). It didn’t feel right because brain has no substrate for
passive memory, but we do now.

2011: Todor, $400
consolation prize for understanding some ideas that were not clearly explained
here.

2016: Todor, $500 for
multiple suggestions on implementing the algorithm, as well as for the effort.

2017: Alexander Loschilov, $2800 for help in converting my level 1 pseudo
code into Python, consulting on PyCharm and SciPy, and for insistence on 2D
clustering, February-April.

Todor: $2000 for help in
optimizing level_1_2D, June-July.

Kapil Kashyap: $ 2000 for stimulation and effort, help with
Python and level_1_2D, September-October

2018:
Todor Anaudov, $1000 mostly for effort and stimulation, January-February