Intelligence is the ability to predict and plan (selfpredict),
which can only be done by discovering and projecting patterns. This perspective is well
established: pattern recognition is a core of any IQ test.
But there is no general and constructive definition of either
pattern or recognition (quantified similarity).
So, I came up with my own definitions, which directly translate
into algorithm proposed below.
For excellent popular introductions to cognitionasprediction
thesis see “On
Intelligence” by Jeff Hawkins and “How to Create a Mind“ by Ray Kurzweil. But on a technical level, they
and most current researchers implement pattern discovery via artificial neural
networks, which operate in very coarse statistical fashion.
Less coarse (more selective) are Capsule Networks, recently introduced by
Geoffrey Hinton et al. But they are largely ad hock, still workinprogress,
and depend on layers of CNN. Neither CNN nor CapsNet
is theoretically derived. I outline my approach below and then compare it to
ANN, biological NN, CapsNet, and clustering.
I need help with design and implementation of
this algorithm: open project CogAlg on GitHub, in Python. All contributions should be
consistent with the principles introduced here. Unless you find some conceptual
flaws, that would be even more valuable. I pay per contribution, or per hour if
there is a track record, see the last part.
This content is published under Creative Commons
Attribution 4.0 International License.
Outline of my approach
Proposed algorithm is a clean design for deep learning:
nonneuromorphic, substatistical, comparisonfirst. It’s a search for hierarchical
patterns, by crosscomparing inputs over selectively incremental distance and
composition. Patterns are defined by a sign of deviation of match between
inputs, where match is compression of represented magnitude by replacing inputs
with their derivatives. These definitions are unfolded below.
“Incremental” means that firstlevel
comparands must be subsymbolic integers with binary (before  after)
coordinate. Such as pixels of video, consecutive in each dimension, or equivalents
in other modalities.
Their comparison must also
be minimal in complexity: lossless transform by inverse arithmetic operations. “Lossless”
means that resulting match and miss are preserved as alternative representation
of original inputs.
Specific match (compression)
and miss (complementary of match) are determined by the power of comparison:
 comparison by subtraction
increases match to a smaller comparand and reduces miss to a difference,
 comparison by division
increases match to a multiple and reduces miss to fraction, and so on (more in part
1).
Resulting patterns represent
spans of inputs that form samesign miss. Hierarchy should generate two orders
of feedback: within and between levels. Compared to inputs, this feedback forms
lateral and vertical patterns: Lateral feedback is prior inputs, and patterns
are spans of inputs with increasing or decreasing magnitude. Vertical feedback is
average prior match, and patterns are spans of inputs with above or below
average match (deep feedback is restricted to match: higher order of
representation, to justify redundancy to lateral patterns).
Higherlevel inputs are
patterns formed by lowerlevel comparisons. They represent results or
derivatives: match and miss per compared input parameter. So, number of parameters
per pattern is selectively multiplied on each level. Match and miss between
patterns are combined matches or misses between their parameters. To maximize selection, search must be strictly incremental in
distance, derivation, and composition over both. Which means a unique set of
operations per level of search, hence a singular in “cognitive algorithm“.
Resulting hierarchy is a
dynamically extended pipeline: terminated patterns are outputted for comparison
on the next level, and new level is formed for pattern terminated by current
top level. Which continues as long as system receives novel inputs. As distinct
from autoencoders (current mainstay in unsupervised learning), there is no need
for decoding: comparison is done on each level, whose output is also fed back
to filter lower levels.
Autonomous cognition must
start with analog inputs, such as video or audio. All symbolic data,
including that in natural languages, is encoded by some prior cognitive process. To discover meaningful patterns in symbols, they must be decoded
before being crosscompared. And the difficulty of decoding is exponential with
the level of encoding, so hierarchical learning starting with raw sensory input
is by far the easiest to implement (part 0).
To discover anything
complex at “polynomial” cost, resulting patterns should also be hierarchical.
In my model, each level of search adds a level of composition and a sublevel
of differentiation to each input pattern.
Higherlevel search is
selective per level of resulting pattern. Both composition and selection
speedup search, to form longer range spatiotemporal and then conceptual
patterns. Which also send feedback: filters and then motor action, to select
lowerlevel inputs and locations with aboveaverage additive predictive value
(part 3).
Hierarchical approaches are
common in unsupervised learning, and all
do some sort of pattern recognition.
But none that I know of is
strictly incremental in scope and complexity of discoverable patterns. Which is necessary for selection, thus scalability, vs.
combinatorial explosion in search space. But selection is more
expensive upfront and won’t pay in simple test problems. So, it’s not suitable for immediate experimentation. That’s
probably why no one else seems to be working on anything sufficiently similar
to my algorithm.
Comparison to artificial
and biological neural networks
ANN learns via some version of Hebbian
“fire together, wire together” coincidence reinforcement. Normally, “neuron’s”
inputs are weighed at synapses, then summed and thresholded into output. Final output also back propagates to synapses and is combined
with their last input to adjust the weights. This weight adjustment is learning. But prior summation degrades resolution of inputs, precluding any comparison between them (the inputs are
recoverable, but why degrade and then spend a ton of computation restoring
them).
This is an inherently
coarse statistical method: inputs are summed within samples defined by
initially random weights. These weights are trained into meaningful values by backpropagation
of Stochastic Gradient Descent, but this process is too expensive to scale without supervision
or taskspecific reinforcement.
CogAlg is comparisonfirst
and resulting patterns are immediately meaningful. In other words, my initial feedback
per pixel is simply a prior or adjacent pixel, which is infinitely
finergrained than backpropagation.
Currently the most
successful method is CNN, which computes similarity as a product:
input * kernel (weights), adjusted by some activation function. Again, kernels
start with meaningless random weights, and their training is delayed (coarsified)
by hidden layers. Human brain is born with a final number of neurons. Before birth,
they have to develop and connect by processing random noise, and then learn by
adjusting their connections. But in software, generating and deleting nodes that
represent new content locally should be far more efficient.
Conceptually, similarity is
a common subset of comparands, but multiplication forms a superset. Thus, it
vastly exaggerates similarity. Which is compensated by some activation
function, but that’s another unprincipled and grossly inaccurate hack, causing
vanishing or exploding gradients. In the brain, multiplication is analog
(cheap) and exaggeration adds resistance to noise. But noise cancellation should
be separate from recognition. In fact, the very distinction between input and
noise is casespecific, it should be learned and updated continuously.
ANN compute difference
(error) on their output layer, but not between hidden layers. This
distinction is only justified in supervised learning, where we have some
specific expectation. In unsupervised learning, all layers are equally unknown
before the input. So, each layer should compute both similarity and difference:
feedback to a lower layer. In a temporal input flow, such difference should
update expectations. I think it is this delayed update that causes
initialization bias in ANN, and much of similar confirmation
bias in humans.
Also, both input and kernel
in ANN are arrays, often 2D or higher. This is far more coarse, thus less
potentially selective and efficient, than onetoone comparison of laterally
adjacent inputs in my algorithm. I use “level” vs. “layer” because my levels are not identical.
Complexity of inputs and operations is incremental with elevation: initial
inputs are pixels, higherlevel inputs are patterns formed on lower levels. My
inference is comparison that forms separate match and miss, and training is
feedback of summed miss to update lowerlevel filters.
Inspiration by the brain
kept ANN research going for decades before they became useful. But their
“neurons” are mere stick figures for real ones. Of course, most of complexity
in a neuron is due to constraints of biology. Ironically, weighted summation in ANN may also be a nolonger needed
compensation for such constraint:
neural memory requires
dedicated connections (synapses), which makes individual
input representation and comparison prohibitively expensive. But not anymore,
we now have dirtcheap randomaccess memory.
Other biological
constraints are very slow neurons, and the imperative of fast reaction for
survival in the wild. Both favor fast though crude summation (vs.
slower onetoone comparison), at the cost of glacial training. Reaction speed
became less important: modern society is quite secure, while continuous
learning is far more important because of accelerating progress. Another
constraint is noise: neurons often fire at random, so their spikes are summed
to reduce noise. Which is not a good reason to degrade far more precise
electronic signals.
Comparison to Capsule
Networks and clustering
The nearest experimentally
successful method is recently introduced “capsules”.
Some similarities to CogAlg:
 capsules also output
multivariate vectors, “encapsulating” several properties, similar to my
patterns,
 these properties also
include coordinates and dimensions, compared to compute differences and ratios,
 these distances and
proportions are also compared to find “equivariance” or affine transformations,
 capsules also send direct
feedback to lower layer (dynamic routing), vs. transhiddenlayer backprop in
ANN
But measure of similarity
in CapsNet (“agreement” in dynamic routing) is still an unprincipled dot
product, vs. additive compression in CogAlg. This is not specific to CapsNet,
most current recognition algorithms, and seemingly the brain too, select for dot product. To repeat,
multiplication vastly exaggerates similarity. Which adds noise resistance, crucial
for our horribly noisy brain, but that should be a separate inputspecific
function.
Pure similarity is a common
subset: the smaller of comparands, = compression of represented magnitude by replacing
larger comparand with its difference to a smaller comparand. This is a direct implication
of informationtheoretical compressionuberalles principle, but no one else seems
to use minimum as a measure of similarity. It’s not sufficient per se, basic working
measure would probably be a deviation of adjusted match: (minimum  difference/4)
 average (minimum  difference/4). But minimum is unavoidable as a starting
point.
Some other problems I have
with current implementation of CapsNet:
 they use CNN for initial layers,
to recognize basic features, but a truly general method would apply the same
principles on all levels of processing, any differentiation should be learned
rather than builtin.
 capsules of all layers contain
the same parameters: probability and pose variables, while I think the number of
parameters should be incremental with elevation, each level forming new
derivatives.
 the number of layers is predetermined
by design, while I think it should be incremental with experience.
My patterns have match
instead of probability, a miss that includes pose variables, and selected properties
of lowerlevel patterns. In my terms, Hinton’s equivariance is a match between misses:
differences and distances.
All these variables are derived
by incrementally complex comparison: core operation on all levels of CogAlg.
My hierarchy is also
dynamic: pattern is displaced from level by a miss to new input, then forwarded
to existing or newly formed higher level. So, higherlevel patterns include
lowerlevel variables, as well as their derivatives. The derivatives are summed
within pattern, then evaluated for extending intrapattern search and feedback.
Thus, both hierarchy of patterns per system and subhierarchy of variables per
pattern expand with experience.
Another technique similar
to mine is hierarchical clustering. But conventional
clustering defines match as inverted difference between inputs. This is the
opposite of ANN, which computes match but not coincident difference. And it’s
also wrong: match is a common subset of comparands, distinct from and
complementary to the difference between them. Both should be computed, because
each has independent predictive value.
Some readers dismiss this outline as generalities, which lack a
direct connection to my code. But I don’t see a disconnect, beyond
adaptation to 2D or 3D format of the input. Please enlighten me, I will owe you
big time. Of course, current code only covers firstlevel processing, but
higher levels will be incremental in nature. Others complain about lack of
math, but CogAlg must be selectively incremental, and complex math is not. There is no pseudocode: incremental means that
lowlevel code must generate architecture from the input.
elaboration below, some out
of date:
0. Cognition vs. evolution, analog vs. symbolic initial input
1. Comparison: quantifying match and miss between two variables
2. Forward search and patterns, implementation for image recognition
in video
3. Feedback of filters, attentional input selection, imagination,
motor action
4. Initial levels of search, corresponding orders of feedback, and
resulting patterns:
level 1: comparison to past
inputs, forming difference and relative match patterns
level 2: additional
evaluation of resulting patterns for feedback, forming filter patterns
level 3: additional
evaluation of projected filter patterns, forming updatedinput patterns
5. Comparison between variable types within a pattern
6. Cartesian dimensions and sensory modalities
7. Notes on clustering, ANNs, and probabilistic inference
8. Notes on working mindset, “hiring” and a prize for contributions
0. Cognition vs. evolution,
analog vs. symbolic initial input
Some say intelligence can
be recognized but not defined. I think that’s absurd: we recognize some
implicit definition. Others define intelligence as a problemsolving ability,
but the only general problem is efficient search for solutions. Efficiency is a
function of selection among inputs, vs. bruteforce alltoall search. This
selection is by predicted value of the inputs, and prediction is interactive
projection of their patterns. Some agree that intelligence is all about pattern
discovery, but define pattern as a crude statistical coincidence.
Of course, the only
mechanism known to produce humanlevel intelligence is even cruder, and that
shows in haphazard construction of our brains. Algorithmically
simple, biological evolution alters heritable traits at random and selects
those with aboveaverage reproductive fitness. But this process requires almost
inconceivable computing power because selection is very coarse: on the level of
whole genome rather than individual traits, and also because intelligence is
only one of many factors in reproductive fitness.
Random variation in
evolutionary algorithms, generative RBMs, and so on, is antithetical to
intelligence. Intelligent variation must be driven by feedback within cognitive
hierarchy: higher levels are presumably “smarter” than lower ones. That is,
higherlevel inputs represent operations that formed them, and are evaluated to
alter future lowerlevel operations. Basic operations are comparison and
summation among inputs, defined by their range and resolution, analogous to
reproduction in genetic algorithms.
Range of comparison per
conservedresolution input should increase if projected match (cognitive
fitness function) exceeds average match per comparison. In any nonrandom
environment, average match declines with the distance between comparands. Thus,
search over increasing distance requires selection of above average comparands.
Any delay, coarseness, and inaccuracy of such selection is multiplied at each
search expansion, soon resulting in combinatorial explosion of unproductive
(low additive match) comparisons.
Hence, my model is strictly
incremental: search starts with minimalcomplexity inputs and expands with
minimal increments in their range and complexity (syntax). At each level, there
is only one best increment, projected to discover the greatest additive match.
No other AGI approach follows this principle.
I guess people who aim for
humanlevel intelligence are impatient with small increments and simple sensory
data. Yet, this is the most theoretical problem ever, demanding the longest
delay in gratification.
symbolic obsession and its
discontents
Current Machine
Learning and related theories (AIT, Bayesian inference, etc.) are largely statistical
also because they were developed primarily for symbolic data. Such data,
precompressed and preselected by humans, is far more valuable than sensory
inputs it was ultimately derived from. But due to this selection and
compression, proximate symbols are not likely to match, and partial match
between them is hard to quantify. Hence, symbolic data is a misleading initial
target for developing conceptually consistent algorithm.
Use of symbolic data as
initial inputs in AGI projects betrays profound misunderstanding of cognition.
Even children, predisposed to learn language, only become fluent after years of
directly observing things their parents talk about. Words are mere labels for
concepts, the most important of which are spatiotemporal patterns, generalized
from multimodal sensory experience. Topdown reconstruction of such patterns
solely from correlations among their labels should be exponentially more
difficult than their bottomup construction.
All our knowledge is
ultimately derived from senses, but lower levels of human perception are
unconscious. Only generalized concepts make it into our consciousness, AKA declarative
memory, where we assign them symbols (words) to facilitate
communication. This brainspecific constraint creates heavy symbolic vs.
subsymbolic bias, especially strong in artificial intelligentsia. Which is
like putting a cart in front of a horse: most words are meaningless unless
coupled with implicit representations of sensory patterns.
To be incrementally
selective, cognitive algorithm must exploit proximity first, which is only
productive for continuous and losstolerant raw sensory data. Symbolic
data is already compressed: consecutive characters and words in text won’t
match. It’s also encoded with distant crossreferences, that are hardly ever
explicit outside of a brain. Text looks quite random unless you know the code:
operations that generalized pixels into patterns (objects, processes,
concepts). That means any algorithm designed specifically for text will not be
consistently incremental in the range of search, which will impair its
scalability.
In Machine Learning, input
is string, frame, or video sequence of a defined length, with artificial separation
between training and inference. In my approach, learning is continuous and
interactive. Initial inputs are streamed pixels of maximal resolution, and
higherlevel inputs are multivariate patterns formed by comparing lowerlevel
inputs. Spatiotemporal range of inputs, and selective search across them, is
extended indefinitely. This expansion is directed by higherlevel feedback,
just as it is in human learning.
Everything ever written is
related to my subject, but nothing is close enough: not other method is meant
to be fully consistent. Hence a dire scarcity of references here. My approach
is presented bottomup (parts 1  6), thus can be understood without
references. But that requires a clean context,  hopefully cleaned out by
reader‘s own introspective generalization. Other (insufficiently) related
approaches are addressed above and in part 7. I also have a more advanced
workinprogress, but will need a meaningful feedback to elaborate.
1. Comparison: quantifying
match and miss between two variables
First of all, we must
quantify predictive value. Algorithmic information theory defines it as
compressibility of representation. Which is perfectly fine, but compression is currently
computed only for sequences of inputs.
To enable far more
incremental selection, thus scalable search, I quantify match between
individual inputs. Partial match is a finer dimension of analysis, vs. binary
same  different instances. This is similar to the way probabilistic inference
improved on classical logic, by quantifying probability vs. binary true  false
values.
I define match as a
complementary of miss. Basic miss is a difference between comparands, hence
match is a smaller comparand. In other words, match is a compression of larger
comparand’s magnitude by replacing it with difference relative to smaller
comparand. Ultimate criterion is recorded magnitude, rather than bits of memory
it occupies, because the former represents physical impact that we want to
predict. The volume of memory used to record that magnitude depends on prior
compression, which is not an objective parameter.
This definition is
tautological: smaller input is a common subset of both inputs, = sum of AND
between their uncompressed (unary code) representations. Some may object that
match includes the case when both inputs equal zero, but then match also equals
zero. The purpose is prediction, which is a representational equivalent of
conservation in physics. Ultimately, we’re predicting some potential impact on
observer, represented by input. Zero input means zero impact, which has no
conservable inertia, thus no intrinsic predictive value.
With incremental complexity,
initial inputs have binary resolution and implicit shared coordinate (as a macroparameter,
resolution of coordinate lags that of an input). Compression of bit inputs by
AND is well known as digitization: substitution of two lower 1 bits with one
higher 1 bit. Resolution of coordinate (input summation span) is adjusted by
feedback to form integers that are large enough to produce aboveaverage match.
Nextorder compression is
comparison between consecutive integers, with binary (before  after) coordinate.
Basic comparison is inverse arithmetic operation of incremental power: AND,
subtraction, division, logarithm, and so on. Additive match is achieved by
comparison of a higher power than that which produced comparands: comparison by
AND will not further compress integers previously digitized by AND.
Rather, initial comparison
between integers is by subtraction, resulting difference is miss, and smaller
input is absolute match. For example, if inputs are 4 and 7, then miss
is 3, and their match or common subset is 4. Difference is smaller than XOR (nonzero
complementary of AND) because XOR may include oppositesign (oppositedirection)
bit pairs 0, 1 and 1, 0, which are cancelledout by subtraction.
Comparison by division
forms ratio, which is a compressed difference. This
compression is explicit in long division: match is accumulated over iterative
subtraction of smaller comparand from remaining difference. In other words,
this is also a comparison by subtraction, but between different orders of
derivation. Resulting match is smaller comparand * integer part of ratio, and miss is final reminder or
fractional part of ratio.
Ratio can be further
compressed by converting it to radix or logarithm, and so on.
By reducing miss,
higherpower comparison increases complementary match (match = larger input 
miss):
compressible: larger input  XOR
 difference: combined currentorder match &
miss
additive match:
AND

oppositesign XOR  multiple: of a smaller input within a
difference
remaining miss:
XOR

difference 
fraction: complementary to multiple
within a ratio
But the costs of operations
and of recording signs, fractions, irrational fractions, etc. may grow even
faster. To justify these costs, power of comparison should only increase for
inputs sufficiently compressed by prior order of comparison: AND for bit inputs,
SUB for integer inputs, DIV for pattern inputs, etc. Actual compression depends
on input and on resolution of its coordinate: input  derivative summation
span. We can’t control the input, thus average additive match must be kept above
system’s average by adjusting resolution of coordinate.
To filter
future inputs, this absolute match is projected: recombined with coderived
miss at a distance: projected match = match + (miss * distance) / 2. Filter
deviation is accumulated until it exceeds the cost of updating lowerlevel
filter. Which then forms relative match: current match  past match that
cooccurs with average higherlevel projected match. Relative match represents
above or below average predictive value, which determines input inclusion
into positive or negative value pattern.
2. Forward search and
patterns, implementation for image recognition in video
Pattern is a contiguous
span of inputs that form aboveaverage matches, similar to conventional cluster.
As explained above, matches
and misses (derivatives) are produced by comparing consecutive inputs. These
derivatives are summed within a pattern and then compared between patterns on
the next level of search, adding new derivatives to a higher pattern. Patterns
are defined contiguously for each level, but are necessarily followed by a span
of belowaverage match (gap or negative pattern), thus nextlevel inputs are
discontinuous.
Negative patterns represent
contrast or discontinuity between positive patterns, which is a one or higher
dimensional equivalent of difference between zerodimensional pixels. As with
differences, projection of a negative pattern competes with projection of
adjacent positive pattern. But match and difference are derived from the
same input pair, while positive and negative patterns represent separate
spans of inputs.
Negative pattern is not
predictive on its own, but is valuable for allocation: computational resources
of nolonger predictive pattern should be used elsewhere. Hence,
the value of negative pattern is borrowed from predictive value of coprojected
positive pattern, as long as combined additive match remains above average.
Consecutive positive and
negative patterns project over same future input span, and these projections
partly cancel each other. So, they should be combined to form feedback, as
explained in part 3.
Initial match is evaluated
for inclusion into higher positive or negative pattern. The value is summed
until its sign changes, and if positive, evaluated again for crosscomparison
among constituent inputs over increased distance. Second evaluation is
necessary because the cost of incremental syntax, produced by crosscomparing,
is per pattern rather than per input. Pattern is terminated and
outputted to the next level when value sign changes. On the next level, it is
compared to previous patterns of the same compositional order.
Initial inputs are pixels
of video, or equivalent limit of positional resolution in other modalities.
Hierarchical search on higher levels should discover patterns representing
empirical objects and processes, and then relational logical and mathematical
shortcuts, eventually exceeding generality of our semantic concepts. Which are
also patterns in cognitive terms: all information is either learned as patterns
or filtered out as noise. For online learning, all levels should receive inputs
from lower levels and feedback from higher levels in parallel.
spacetime dimensionality
and initial implementation
Any prediction has two
components: what and where. We must have both: value of prediction = precision
of what * precision of where. That “where” is currently neglected: statistical
ML represents spacetime at greatly reduced resolution, if at all. In the brain
and some neuromorphic models, “where” is represented in
a separate network. That makes transfer of locations very expensive and coarse,
reducing predictive value of combined representations. There is no such default
degradation of positional information in my method.
My core algorithm is 1D:
time only (part 4). Our spacetime is 4D, but each of these dimensions
can be mapped on one level of search. This way, levels can select input
patterns that are strong enough to justify the cost of representing additional
dimension, as well as derivatives (matches and differences) in that dimension.
Initial 4D cycle of search
would compare contiguous inputs, similarly to connectedcomponent analysis:
level 1 compares
consecutive 0D pixels within horizontal scan line, forming 1D patterns: line
segments.
level 2 compares contiguous
1D patterns between consecutive lines in a frame, forming 2D patterns: blobs.
level 3 compares contiguous
2D patterns between incrementaldepth frames, forming 3D patterns: objects.
level 4 compares contiguous
3D patterns in temporal sequence, forming 4D patterns: processes.
Subsequent cycles would
compare 4D input patterns over increasing distance in each dimension, forming
longerrange discontinuous patterns. These cycles can be coded
as implementation shortcut, or discovered by core algorithm itself, which can
adapt to inputs of any dimensionality. “Dimension” here is a
parameter that defines external sequence and distance among inputs. This is
different from conventional clustering, which treats both external and internal
parameters as dimensions. More in part 6.
However,
average match in our spacetime is presumably equal over all four dimensions.
That means patterns defined in fewer dimensions will be biased by the angle of
scanning, introducing artifacts. Hence, initial pixel comparison and inclusion
into patterns should also be over 4D at once, or at least over 2D at once for
images. This is a universespecific extension of my core algorithm.
I have working code for basic 1D core
algorithm: line_POC,
am currently working on its adaptation to process images: frame_blobs_draft
and frame_draft.
Initial testing could be recognition and automatic labeling of manually labeled
images. This algorithm will be organically extended to process colors, then video,
then stereo video. For video, level 2 will process consecutive frames and
derive temporal patterns, and levels 3 and higher will process discontinuous 2D
+ time patterns. It should also extend to any type and scope of data.
Suggestions and collaboration are most welcome, see the last part
on prizes.
3. Feedback of filters,
attentional input selection, imagination, motor action
(needs work)
After evaluation for
inclusion into higherlevel pattern, input is also evaluated as feedback to
lower levels. Feedback is update to filters that evaluate forward (Î›) and
feedback (V), as described above but on
lower level.
Basic filter is average
value of input’s projected match that cooccurs with (thus predicts) average
higherlevel match within a positive (aboveaverage) pattern. Both values are
represented in resulting patterns.
Feedback value = forward
value  value of derivatives /2, both of an input pattern. In turn, forward
value is determined by higherlevel feedback, and so on. Thus, all higher
levels affect selection on lowerlevel inputs. This is because the span of each
pattern approximates, hence projects over, combined span of all lower levels.
Indirect feedback
propagates levelsequentially, more expensive shortcut feedback is sent to
selected levels.
Negative derivatives
project increasing match: match to subsequent inputs is greater than to previous inputs.
Such feedback will reduce
lowerlevel filter. If filter is zero, all inputs are crosscompared, and if
filter is negative, it is applied to cancel subsequent filters for
incrementally longerrange crosscomparison.
There is one filter for
each compared variable within input pattern, initialized at 0 and
updated by feedback.
novelty vs. generality
Any integrated system must
have a common selection criterion. Two obvious cognitive criteria are novelty
and generality: miss and match, But we can’t select for both, they exhaust all
possibilities. Novelty can’t be primary criterion: it would select for noise
and filter out all patterns, which are defined by match. On the other hand, to
maximize match of inputs to memory we can “stare at a wall” and lock into
predictable environments. But of course, natural curiosity actively skips
predictable locations, thus reducing the match.
This dilemma is resolved if
we maximize predictive power: projected match, rather than actual match, of inputs to records. To the extent that new
match was projected by past inputs, it doesn’t add to their projected match.
But neither does noise: novelty (difference to past inputs) that is not
projected to persist (match) in the future.
Additive projected match =
downward (V) match to subsequent inputs  upward (Î›) match to previous inputs.
This Î›V asymmetry comes from projecting difference or
value (both are signed) of input relative to feedback.
Î› match is initially
projected from match between input and its higherlevel average. Such averages
may be included in feedback, along with average match (filter). Each
average is compared to sametype variable of an input, and resulting match (redundancy to a higher level) is subtracted from input
match before evaluation.
Averages represent past vs.
current higher level and should be projected over feedback delay:
average += average
difference * (delay / average span) /2. Value of inputtoaverage
match is reduced by higherlevel match rate: rM= match / input, so
additive match = input match  inputtoaverage match * rM.
Adjustment by
inputtoaverage match is done if rM is aboveaverage for incurred cost of
processing.
So, basic selection for
novelty is subtraction of adjusted Î› matchtofilter. V selection for generality, on the other hand, is a function of
backprojected difference of match, which is a higherderivation feedback.
Projected increase of match adds to, and decrease of match subtracts from,
predictive value of future inputs.
Higherorder selection for
novelty should skip (or avoid processing) future input spans predicted with
high certainty. This is a selection of projected vs. actual inputs, covered in
part 4, level 3.
More generally, Î›V projection asymmetry is expressed by
differences of incremental representation order:
difference in magnitude of initial inputs: projected next input = last input +
difference/2,
difference in input match, a subset of magnitude: projected next match = last match + match
difference/2,
difference in match of match, a subsubset of magnitude, projected correspondingly, and so on.
Ultimate criterion is top
order of match on a top level of search: the most predictive parameter in a
system.
imagination, planning,
action
Imagination is not actually
original, it can only be formalized as interactive projection of known
patterns. Strong patterns send positive feedback to lowerlevel locations where
they are projected to reoccur at higher resolution, to increase represented
detail. When originally distant patterns are projected to reoccur in the same
location, their projections interact and combine. This is a generative (vs.
reductive) mechanism.
Interaction is comparison
and summation between sametype variables of coprojected patterns: direct
feedback from multiple higherlevel patterns levels to the same external to
them location. Location is a span of search within a lower level, which
receives feedback with matching projected coordinates. Evaluation of actual
inputs is delayed until additive match projected in their location exceeds the
value of combined filters.
Comparison forms patterns
of filters to compress their representations and selectively project their
match, same as with original inputs. Summation of their differences cancels or
reinforces corresponding variables, forming combined filter, including
secondary patterns of repulsion or attraction. These patterns per variable are
discovered on a higher level, by comparing past inputs with their
multiplefeedback projections.
Combined filter is
preevaluated: projected value of positive patterns is compared to projected
cost of evaluating all inputs, both within a filtered location. Resulting
prevalue (value of evaluation) is negative when the latter exceeds the former:
projected inputs are not worth evaluating and their span / location is skipped.
Iterative skipping of input
spans is the most basic motor feedback: it increments coordinates and
differences of the filters. Crossfilter search then proceeds over multiple
“imagined” locations, before finding one with projected aboveaverage additive
match. That’s where skipping stops and actual inputs are received.
Cognitive component of
action is planning. It is technically identical to imagination, except that
projected patterns also represent the system itself. Feedback of such
selfpatterns eventually reaches the bottom of representational hierarchy:
sensors and actuators, adjusting their sensitivity and coordinates. Such
environmental interface is part of any cognitive system, although actuators are
optional.
4. Initial levels of
search, corresponding orders of feedback and resulting patterns
This part recapitulates and
expands on my core algorithm, which operates in one dimension: time only.
Spatial and derived dimensions are covered in part 6. Even within 1D, search is
hierarchical in scope, containing any number of levels. New level is added when current top level terminates and outputs
the pattern it formed.
Higherlevel patterns are
fed back to select future inputs on lower levels. Feedback is sent to all lower
levels because span of each pattern approximates combined span of inputs within
whole hierarchy below it.
So, deeper hierarchy forms
higher orders of feedback, with increasing elevation and scope relative to its
target: samelevel prior input, higherlevel match average,
beyondthenextlevel match value average, etc.
These orders of feedback
represent corresponding order of input compression: input,
match between inputs, match between matches, etc. Such compression is produced
by comparing inputs to feedback of all orders. Comparisons form patterns, the
type of which corresponds to relative span of compared feedback:
1: prior inputs are compared to the following ones on the same level,
forming difference patterns dPs,
2: higherlevel match is used to evaluate match between inputs, forming value
patterns vPs,
3: higherhierarchy value revaluates positive values of match, forming more selective
shortcut patterns sPs
Feedback of 2^{nd} order consists of input filters (if) defining value patterns, and coordinate filters
(Cf) defining positional
resolution and relative distance to future inputs.
Feedback of 3^{rd} order is shortcut filters for beyondthenext level. These
filters, sent to a location defined by attached coordinate filters, form
higherorder value patterns for deeper internal and distantlevel comparison.
Higherorder patterns are
more selective: difference is as likely to be positive as negative, while value
is far more likely to be negative, because positive patterns add costs of
reevaluation for extended crosscomparison among their inputs. And so on, with
selection and reevaluation for each higher order of positive patterns.
Negative patterns are still compared as a whole: their weak match is
compensated by greater span.
All orders of patterns
formed on the same level are redundant representations of the same inputs.
Patterns contain representation of match between their inputs, which are
compared by higherorder operations. Such operations increase overall match by
combining results of lowerorder comparisons across pattern’s variables:
0Le: AND of bit
inputs to form digitized integers, containing multiple powers of two
1Le: SUB of
integers to form patterns, over additional external dimensions = pattern length
L
2Le: DIV of
multiples (L) to form ratio patterns,
over additional distances = negative pattern length LL
3Le: LOG of
powers (LLs), etc.
Starting from second level, comparison is selective per element of an
input.
Such power increase also
applies in comparison to higherorder feedback, with a lag of one level per
order.
Power of coordinate filters
also lags the power of input filters by one level:
1Le fb: binary
sensor resolution: minimal and maximal detectable input value and coordinate
increments
2Le fb:
integervalued average match and relative initial coordinate (skipping
intermediate coordinates)
3Le fb:
rationalvalued coefficient per variable and multiple skipped coordinate range
4Le fb: realvalued
coefficients and multiple coordinaterange skip
I am defining initial
levels to find recurring increments in operations per level, which could then
be applied to generate higher levels recursively, by incrementing syntax of
output patterns and of feedback filters per level.
operations per generic
level (out of date)
Level 0 digitizes inputs, filtered by minimal
detectable magnitude: least significant bit (i LSB). These bits are AND compared, then their
matches are AND compared again, and so on, forming integer
outputs. This is identical to iterative summation and bitfiltering by
sequentially doubled i LSB.
Level 1 compares consecutive integers, forming ± difference patterns (dP s). dP s are then evaluated to crosscompare their
individual differences, and so on, selectively increasing derivation of
patterns.
Evaluation: dP M (summed match)  dP aM (dP M per average match between
differences in level 2 inputs).
Integers are limited by the
number of digits (#b), and input span: least
significant bit of coordinate (C LSB).
No 1^{st} level feedback: fL cost is additive to dP cost, thus must be justified by the value of dP (and coincident difference in value of patterns
filtered by adjusted i LSB), which is not known till dP is outputted to 2^{nd} level.
Level 2 evaluates match within dP s  bf L (dP) s, forming ± value patterns: vP s  vP (bf L) s. +vP s are evaluated for crosscomparison of their dP s, then of resulting
derivatives, then of inputted derivation levels. +vP (bf L) s are evaluated to crosscompare bf L s, then dP s, adjusted by the difference between their bit filters, and so
on.
dP variables are compared by
subtraction, then resulting matches are combined with dP M (match
within dP) to
evaluate these variables for crosscomparison by division, to normalize
for the difference in their span.
// match filter is also normalized by span ratio
before evaluation, samepower evaluation and comparison?
Feedback: input dP s  bf L (dP) are backprojected and resulting magnitude is evaluated to
increment or decrement 0^{th} level i LSB. Such increments terminate bitfilter span ( bf L (dP)), output it to 2^{nd} level,
and initiate a new i LSB span to filter future
inputs. // bf L (dP) representation: bf , #dP, Î£ dP, Q (dP).
Level 3 evaluates match in input vP s or f L (vP) s, forming ± evaluationvalue patterns: eP s  eP (fL) s. Positive eP s are evaluated for crosscomparison of their vP s ( dP s ( derivatives ( derivation levels ( lower searchlevel sources: buffered or external locations (selected
sources may directly specify strong 3^{rd} level
subpatterns).
Feedback: input vP is backprojected, resulting match is compared to 2^{nd} level filter, and the difference is evaluated vs. filterupdate
filter. If update value is positive, the difference is
added to 2^{nd} level filter, and filter span is terminated.
Same for adjustment of previously covered bit filters and 2^{nd} level filterupdate filters?
This is similar to 2^{nd} level operations, but input vP s are separated by skippedinput spans. These spans
are a filter of coordinate (Cf, higherorder than f for 2^{nd} level inputs), produced by prevaluation of future inputs:
projected novel match =
projected magnitude * average match per magnitude  projectedinput match?
Prevalue is then evaluated
vs. 3^{rd} level evaluation filter + lowerlevel
processing cost, and negative prevaluevalue input span (= span of backprojecting
input) is skipped: its inputs are not processed on lower levels.
// no prevaluation on 2^{nd} level: the cost is higher than potential savings of only 1^{st} level processing costs?
As distinct from input
filters, Cf is defined individually rather than per filter
span. This is because the cost of Cf update: span representation
and interruption of processing on all lower levels, is minor compared to the
value of represented contents? ±eP = ±Cf: individual skip evaluation, no flushing?
or interruption is
predetermined, as with Cb, fixed C f within C f L: a span of sampling across fixedL gaps?
alternating signed Cf s are averaged ±vP s?
Division: between L s, also inputs within
minimaldepth continuous dsign or morder derivation hierarchy?
tentative generalizations
and extrapolations
So, filter resolution is
increased per level, first for i filters and then for C filters: level 0 has input bit filter,
level 1 adds coordinate bit filter, level 2 adds input integer filter, level 3
adds coordinate integer filter.
// coordinate filters (Cb, Cf) are not inputspecific,
patterns are formed by comparing their contents.
Level 4 adds input multiple filter: eP match and its derivatives, applied in parallel to corresponding
variables of input pattern. Variablevalues are multiplied and evaluated to
form patternvalue, for inclusion into nextlevel ±pattern // if separately evaluated, inputvariable value = deviation from
average: signreversed match?
Level 5 adds coordinate multiple filter: a sequence of skippedinput spans by
iteratively projected patterns, as described in imagination section of part 3.
Alternatively, negative coordinate filters implement crosslevel shortcuts,
described in level 3 subpart, which select for projected matchassociated
novelty.
Additional variables in
positive patterns increase cost, which decreases positive vs. negative span
proportion.
Increased difference in
sign, syntax, span, etc., also reduces match between positive and negative
patterns. So, comparison, evaluation, prevaluation... on
higher levels is primarily for samesign patterns.
Consecutive differentsign
patterns are compared due to their proximity, forming ratios of their span and
other variables. These ratios are applied to project match across
differentsign gap or contrast pattern:
projected match +=
(projected match  intervening negative match) * (negative value / positive
value) / 2?
Î›V selection is incremented
by induction: forward and feedback of actual inputs, or by deduction: algebraic compression of input syntax, to find computational
shortcuts. Deduction is faster, but actual inputs also
carry empirical information. Relative value of additive information vs.
computational shortcuts is set by feedback.
Following subparts cover
three initial levels of search in more detail, though out of date:
Level 1: comparison to past
inputs, forming difference patterns and match patterns
Inputs to the 1^{st} level of search are single integers, representing pixels of 1D
scan line across an image, or equivalents from other modalities. Consecutive inputs are compared to form differences, difference
patterns, matches, relative match patterns. This comparison may be extended,
forming higher and distant derivatives:
resulting variables per
input: *=2 derivatives (d,m) per comp, + conditional *=2 (xd, xi) per extended
comp:
8 derivatives
// ddd, mdd, dd_i, md_i, + 1inputdistant dxd,
mxd, + 2inputdistant d_ii, m_ii,
/
\
4 der
4 der // 2 consecutive: dd, md, + 2
derivatives between 1inputdistant inputs: d_i and m_i,
/
\ / \
d,m d,m
d,m // d, m: derivatives from default comparison between consecutive
inputs,
/ \ /
\ / \
i >> i
>> i >> i // i: singlevariable inputs.
This is explained /
implemented in my draft python code: level_1_working. That first level is for generic 1D cognitive algorithm, its
adaptation for image and then video recognition algorithm will be natively 2D.
That’s what I spend most
of my time on, the rest of this intro is significantly out of date.
bitfiltering and
digitization
1^{st} level inputs are filtered by the value of most and least
significant bits: maximal and minimal detectable magnitude of inputs. Maximum
is a magnitude that cooccurs with average 1^{st} level
match, projected by outputted dP s. Least significant bit value is determined by maximal value and
number of bits per variable.
This bit filter is
initially adjusted by overflow in 1^{st} level
inputs, or by a set number of consecutive overflows.
It’s also adjusted by
feedback of higherlevel patterns, if they project over or under flow of 1^{st} level inputs that exceeds the cost of adjustment. Underflow is
average number of 0 bits above top 1 bit.
Original input resolution
may be increased by projecting analog magnification, by impact or by distance.
Iterative bitfiltering is
digitization: bit is doubled per higher digit, and exceeding summed input is
transferred to next digit. A digit can be larger than binary if the cost of
such filtering requires larger carry.
Digitization is the most
basic way of compressing inputs, followed by comparison between resulting
integers.
hypothetical: comparable
magnitude filter, to form minimalmagnitude patterns
Initial magnitude justifies
basic comparison, and summation of belowaverage inputs only compensates for
their lower magnitude, not for the cost of conversion. Conversion involves
higherpower comparison, which must be justified by higher order of match, to
be discovered on higher levels. Or by average neg. mag. span?
iP min mag span conversion cost and comparison
match would be on 2^{nd} level, but it’s not justified by 1^{st} level match, unlike D span conversion cost and comparison match,
so it is effectively the 1^{st} level of comparison?
possible +iP span evaluation: double evaluation + span representation cost
< additional lowerbits match?
The inputs may be
normalized by subtracting feedback of average magnitude, forming ± deviation,
then by dividing it by next+1 level feedback, forming a multiple of average
absolute deviation, and so on. Additive value of input is a combination of all
deviation orders, starting with 0^{th} or
absolute magnitude.
Initial input evaluation if
any filter: cost < gain: projected negativevalue (comparison cost 
positive value):
by minimal magnitude > ± relative magnitude patterns (iP s), and + iP s are evaluated or crosscompared?
or by average magnitude
> ± deviations, then by coaverage
deviation: ultimate bit filter?
Summation *may* compensate for conversion if its span is greater than average per
magnitude spectrum?!
Summation on higher levels
also increases span order, but withinorder conversion is the same, and
betweenorder comparison is intrapattern only. bf spans overlap vP span, > filter conversion costs?
Level 2: additional evaluation of input patterns for feedback, forming
filter patterns (out of date).
Inputs to 2^{nd} level of search are patterns derived on 1^{st} level: dP ( L, I, D, V, Q (d)) s or t dP ( tV, tD, dP ()) s.
These inputs are evaluated
for feedback to update 0^{th} level i LSB, terminating samefilter span.
Feedback increment of LSB is evaluated by deviation (∆) of magnitude, to avoid input overflow or
underflow:
∆ += I/ L  LSB a; ∆ > ff? while (∆ > LSB a){ LSB ±; ∆ = LSB a; LSB a *2};
LSB a is average input (* V/ L?) per LSB value, and ff is average deviation per
positivevalue increment;
Î£ (∆) before evaluation: no V patterns? #b++ and C LSB are more expensive,
evaluated on 3^{rd} level?
They are also compared to
previously inputted patterns, forming difference patterns dPs and value patterns vPs per input variable, then combined into dPP s and vPP s per input pattern.
L * sign of consecutive dP s is a known miss, and
match of dP variables is correlated by common derivation.
Hence, projected match of
other +dP and dP variables = amk * (1  L / dP). On the other hand, samesign dP s are distant by L, reducing projected match by amk * L, which is equal to reduction by miss of L?
So, dP evaluation is for two
comparisons of equal value: crosssign, then cross L samesign (1 dP evaluation is blocked by feedback of discovered or defined
alternating sign and covariable match projection).
Both of last dP s will be compared to the next one, thus past match per dP (dP M) is summed for three dP s:
dP M ( Î£ ( last 3 dP s L+M))  a dP M (average of 4Le +vP dP M) > v, vs;; evaluation / 3 dP s > value, sign / 1 dP.
while (vs = ovs){ ovs = vs; V+=v; vL++; vP (L, I, M, D) += dP (L, I, M, D);; default vP  wide sum, select preserv.
vs > 0? comp (3 dP s){ DIV (L, I, M, D) > N, ( n, f, m, d); vP (N, F, M, D) += n, f, m, d;; sum: der / variable, n / input?
vr = v+ N? SUB (nf) > nf m; vd = vr+ nf m, vds = vd  a;; ratios are too small for DIV?
while (vds = ovds){ ovds = vds; Vd+=vd; vdL++; vdP() += Q (d  ddP);; default Q (d  ddP) sum., select. preserv.
vds > 0? comp (1^{st} x l^{st} d  ddP s of Q (d) s);; splicing Q (d) s of matching dP s, cont. only: no comp ( Î£ Q (d  ddP)?
Î£ vP ( Î£ vd P eval: primary for P, redundant to individual dP s ( d s for +P, cost *2, same for
+P' I and P' M,D?
no Î£ V  Vd evaluation of cont. comp per variable or division: cost + vL = comp cost? Î£ V per fb: no vL, #comp;
 L, I, M, D: same value per mag, power / compression, but I  M, D redund = mag, +vP: I  2a,  vP: M, D  2a?
 no
variable eval: cost (sub + vL + filter) > comp cost, but match value must be
adjusted for redundancy?
 normalization
for comparison: min (I, M, D) * rL, SUB (I, M, D)? Î£ L (pat) vs C: more general but
interrupted?
variablelength DIV: while (i > a){ while (i> m){ SUB (i, m) > d; n++; i=d;}; m/=2; t=m; SUB (d, t); f+= d;}?
additive compression per d vs. m*d: > length cost?
tdP ( tM, tD, dP(), ddP Î£ ( dMÎ£ (Q (dM)), dDÎ£ (Q (dD)), ddLÎ£ (Q (ddL)), Q (ddP))); // last d
and D are within dP()?
Input filter is a higherlevel
average, while filter update is accumulated over multiple higherlevel spans
until it exceeds filterupdate filter. So, filter update is 2^{nd} order feedback relative to filter, as is filter relative to
match.
But the same filter update
is 3^{rd} order of feedback when used to evaluate
input value for inclusion into pattern defined by a previous filter: update
span is two orders higher than value span.
Higherlevel comparison
between patterns formed by different filters is mediated, vs. immediate continuation
of currentlevel comparison across filter update (mediated cont.: splicing
between differentfilter patterns by vertical specification of match, although
it includes lateral crosscomparison of skipdistant specifications).
However, filter update feedback
is periodic, so it doesn’t form continuous crossfilter comparison patterns xPs.
adjustment of forward
evaluation by optional feedback of projected input
More precisely, additive
value or novel magnitude of an input is its deviation from higherlevel
average. Deviation = input  expectation: (higherlevel summed input 
summed difference /2) * rL (L / hL).
Inputs are compared to last
input to form difference, and to past average to form deviation or novelty.
But last input is more
predictive of the next one than a more distant average, thus the latter is
compared on higher level than the former. So, input variable is compared
sequentially and summed within resulting patterns. On the next level, the sum
is compared vertically: to nextnextlevel average of the same variable.
Resulting vertical match
defines novel value for higherlevel sequential comparison:
novel value = past match  (vertical
match * higherlevel match rate)  average novel match:
nv = L+M  (m (I, (hI * rL)) * hM / hL)  hnM * rL; more precise than initial value: v = L+M  hM * rL;
Novelty evaluation is done
if higherlevel match > cost of feedback and operations,
separately for I and D P s:
I, M ( D, M feedback, vertical SUB (I, nM ( D, ndM));
Impact on ambient sensor is
separate from novelty and is predicted by representationalvalue patterns?
 nextinput
prediction: seq match + vert match * relative rate, but
predictive selection is per level, not input.
 higherorder
expectation is relative match per variable: pMd = D * rM, M/D, or D * rMd: Md/D,
 if rM 
rMd are derived by intrapattern comparison, when average M  Md >
average per division?
oneinput search extension
within crosscompared patterns
Match decreases with
distance, so initial comparison is between consecutive inputs. Resulting match
is evaluated, forming ±vP s. Positive P s are then evaluated for expanded internal search:
crosscomparison among 1inputdistant inputs within a pattern (on same level,
higherlevel search is between new patterns).
This cycle repeats to
evaluate crosscomparison among 2inputdistant inputs, 3inputdistant inputs,
etc., when summed currentdistance match exceeds the average per evaluation.
So, patterns of longer
crosscomparison range are nested within selected positive patterns of shorter
range. This is similar to 1^{st} level ddP s being nested within dP s.
Same input is reevaluated
for comparison at increased distance because match will decay: projected match
= last match * match rate (mr), * (higherlevel
mr / currentlevel mr) * (higherlevel distance /
next distance)?
Or = input * average match rate for that specific
distance, including projected match within negative patterns.
It is reevaluated also
because projected match is adjusted by past match: mr *= past mr / past projected mr?
Also, multiple comparisons
per input form overlapping and redundant patterns (similar to fuzzy
clusters),
and must be evaluated vs.
filter * number of prior comparisons, reducing value of projected match.
Instead of directly
comparing incrementally distant input pairs, we can calculate their difference
by adding intermediate differences. This would obviate multiple access to the
same inputs during crosscomparison.
These differences are also
subtracted (compared), forming higher derivatives and matches:
ddd, x1dd, x2d ( ddd: 3^{rd}
derivative, x1dd: d of 2inputdistant d s, x2d: d of 2inputdistant
inputs)
/ \
dd, x1d dd, x1d ( dd: 2^{nd} derivative, x1d = d+d = difference between 1inputdistant
inputs)
/
\
/ \
d d d ( d: difference between consecutive inputs)
/ \ /
\ / \
i
i
i i
( i: initial inputs)
As always, match is a
smaller input, cached or restored, selected by the sign of a difference.
Comparison of both types is
between all sametype variable pairs from different inputs.
Total match includes match
of all its derivation orders, which will overlap for proximate inputs.
Incremental cost of
crosscomparison is the same for all derivation orders. If projected match is
equal to projected miss, then additive value for different orders of the same
inputs is also the same: reduction in projected magnitude of differences will
be equal to reduction in projected match between distant inputs?
multiinput search
extension, evaluation of selection per input: tentative
On the next level, average
match from expansion is compared to that from shorterdistance comparison, and
resulting difference is decay of average match with distance. Again,
this decay drives reevaluation per expansion: selection of inputs with
projected decayed match above average per comparison cost.
Projected match is also
adjusted by prior match (if local decay?) and redundancy (symmetrical
if no decay?)
Slower decay will reduce
value of selection per expansion because fewer positive inputs will turn
negative:
Value of selection = Î£ comp cost of negvalue inputs  selection cost (average saved cost or relative delay?)
This value is summed
between higherlevel inputs, into average value of selection per increment of
distance. Increments with negative value of selection should be compared
without reevaluation, adding to minimal number of comparisons per selection,
which is evaluated for feedback as a comparisondepth filter:
Î£ (selection value per
increment) > average selection value;; for negative patterns of each depth,  >1 only?
depth adjustment value =
average selection value; while (average selection value > selection cost){
depth adjustment ±±; depth adjustment value = selection value per
increment (depthspecific?); };
depth adjustment >
minimal per feedback? >> lowerlevel depth filter;; additive depth = adjustment value?
 match filter is summed
and evaluated per current comparison depth?
 selected positive relative
matches don’t reduce the benefit of pruningout negative ones.
 skip if negative
selection value: selected positive matches < selection
cost: average value or relative delay?
Each input forms a queue of
matches and misses relative to templates within comparison depth filter. These
derivatives, both discrete and summed, overlap for inputs within each other’s
search span. But representations of discrete derivatives can be reused,
redundancy is only necessary for parallel comparison.
Assuming that environment
is not random, similarity between inputs declines with spatiotemporal
distance. To maintain proximity, a ninput search is FIFO: input is compared to
all templates up to maximal distance, then added to the queue as a new
template, while the oldest template is outputted into patternwide queue.
valueproportional
combination of patterns:
tentative
Summation of +dP and dP is
weighted by their value: L (summed dsign match) + M (summed i match).
Such relative probability
of +dP vs.  dP is indicated by corresponding ratios: rL =
+L/L, and rM = +M/M.
(Ls and Ms are compared by
division: comparison power should be higher for more predictive variables).
But weighting
complementation incurs costs, which must be justified by value of ratio. So,
division should be of variable length, continued while the ratio is above
average. This is shown below for Ls, also applies to Ms:
dL = +L  L, mL = min (+L, L); nL =0; fL=0; efL=1; // nL: L multiple, fL: L fraction,
efL: extended fraction.
while (dL > adL){ dL =
dL; // all Ls are positive; dL is evaluated for long division by adL: average
dL.
while (dL > 0){ dL =
mL; nL++;} dL = mL/2; dL >0? fL+= efL; efL/=2;} // ratio: rL= nL + fL.
Ms’ longdivision
evaluation is weighted by rL: projected rM value = dM * nL (reducedresolution rL)  adM.
Ms are then combined: cM =
+M + M * rL; // rL is relative probability of M across iterated cL.
Ms are not projected (M+= D * rcL * rM D (MD/cD) /2): precision of higherlevel rM D is below that of rM?
Prior ratios are
combination rates: rL is probability of M, and combined rL and rM (cr) is probability of D.
If rM < arM, cr = rL,
else: cr = (+L + +M) / (L + M) // cr = √(rL * rM) would lose L vs. M
weighting.
cr predicts match of
weighted cD between cdPs, where negativedP variable is multiplied by
aboveaverage match ratio before combination: cD = +D + D * cr. // after unweighted comparison between Ds?
Averages: arL, arM, acr, are feedback of ratios that cooccur with aboveaverage match of
spannormalized variables, vs. input variables. Another feedback is averages that evaluate long division: adL, adM, adD.
Both are feedback of
positive C pattern, which represents these variables, inputted & evaluated on 3^{rd} level.
; or 4^{th} level: value of dPs * ratio is compared to value of dPs, & the difference is multiplied by cL / hLe cL?
Comparison of oppositesign
Ds forms negative match = smaller D, and positive difference dD = +D+ D.
dD magnitude predicts its
match, not further combination. Single comparison is cheaper than its
evaluation.
Comparison is by division
if larger D cooccurs with hLe nD of aboveaverage predictive value (division
is signneutral & reductive). But average nD value is below the cost of
evaluation, except if positive feedback?
So, default operations for
L, M, D of complementary dPs are comparison by long division and combination.
D combination: +D D*cr, vs. 
cD * cr: +D vs. D weighting is lost, meaningless if
cD=0?
Combination by division is
predictive if the ratio is matching on higher level (hLe) & acr is fed back as filter?
Resulting variables: cL,
rL, cM, rM, cr, cD, dD, form top level of cdP: complemented dP.
Level 3: prevaluation of
projected filter patterns, forming updatedinput patterns
(out of date)
3^{rd} level inputs are ± V patterns, combined into
complemented V patterns. Positive V patterns include derivatives of 1^{st} level match, which project match within future inputs (D patterns only represent and project derivatives of magnitude).
Such projectedinputsmatch is prevaluated, negative prevaluespan inputs are summed or
skipped (reloaded), and positive prevaluespan inputs are evaluated or even
directly compared.
Initial upward (Î›) prevaluation by E filter selects for evaluation of V patterns, within resulting
± E patterns. Resulting prevalue is also projected
downward (V), to select future input spans for evaluation, vs. summation or
skipping. The span is of projecting V pattern, same as of lower
hierarchy. Prevaluation is then iterated over multiple projectedinput spans,
as long as last prevalue remains above average for the cost of prevaluation.
Additional interference of
iterated negative projection is stronger than positive projection of lower
levels, and should flush them out of pipeline. This flushing need not be final,
spans of negative projected value may be stored in buffers, to delay the loss.
Buffers are implemented in slower and cheaper media (tape vs. RAM) and accessed
if associated patterns match on a higher level, thus project aboveaverage
match among their inputs.
Iterative backprojection
is evaluated starting from 3^{rd} level:
to be projectable the input must represent derivatives of value, which are
formed starting from 2^{nd} level. Compare this to 2^{nd} level evaluation:
Î› for input, V for V filter, iterated within V pattern. Similar
subiteration in E pattern?
Evaluation value =
projectedinputsmatch  E
filter: average input match that
cooccurs with average higherlevel match per evaluation (thus accounting
for evaluation costs + selected comparison costs). Compare this to V filter that selects for 2^{nd} level
comparison: average input match that cooccurs with average higherlevel match
per comparison (thus accounting for costs of default crosscomparison only).
E filter feedback starts from 4^{th} level of search, because its inputs represent prevaluated
lowerlevel inputs.
4^{th} level also preprevaluates vs. prevaluation filter, forming preprevalue
that determines prevaluation vs. summation of next input span. And so on: the
order of evaluation increases with the level of search.
Higher levels are
increasingly selective in their inputs, because they additionally select by
higher orders derived on these levels: magnitude ) match and difference of
magnitude ) match and difference of match, etc.
Feedback of prevaluation is
± prefilter: binary evaluationvalue sign that determines evaluating vs.
skipping initial inputs within projected span, and flushing those already
pipelined within lower levels.
Negative feedback may be
iterated, forming a skip span.
Parallel lower hierarchies & skip spans may be assigned to different external sources or their
internal buffers.
Filter update feedback is
levelsequential, but prefilter feedback is sent to all lower levels at once.
Prefilter is defined per
input, and then sequentially translated into prefilters of higher derivation
levels:
prior value += prior match
> value sign: nextlevel prefilter. If there are multiple prefilters of
different evaluation orders from corresponding levels, they AND & define
infrapatterns: sign ( input ( derivatives.
filter update evaluation and feedback
Negative evaluationvalue blocks input evaluation (thus
comparison) and filter updating on all lower levels. Notevaluated input spans (gaps)
are also outputted, which will increase coordinate range per contents of both
higherlevel inputs and lowerlevel feedback. Gaps represent negative projectedmatch value, which must be combined with positive value of subsequent
span to evaluate comparison across the gap on a higher level. This is similar
to evaluation of combined positive + negative relative match spans, explained
above.
Blocking locations with
expected inputs will result in preference for exploration & discovery of new patterns, vs. confirmation of the old ones. It
is the opposite of upward selection for stronger patterns, but sign reversal in
selection criteria is basic feature of any feedback, starting with average
match & derivatives.
Positive evaluationvalue input spans are evaluated by
lowerlevel filter, & this filter is evaluated for update:
combined update = (output
update + output filter update / (samefilter span (fL) / output span)) /2.
both updates: = last
feedback, equalweighted because higherlevel distance is compensated by range:
fL?
update value = combined
update  update filter: average update per average higherlevel additive
match.
also differential costs of
feedback transfer across locations (vs. delay) + representation +
filter conversion?
If update value is negative: fL += new inputs, subdivided by their positive or
negative predictive value spans.
If update value is positive: lowerlevel filter += combined update, new fL (with new filter representation) is initialized on a current
level, while currentlevel part of old fL is outputted and evaluated
as nextlevel input.
In turn, the filter gets
updates from higherlevel outputs, included in higherhigherlevel positive
patterns by that level’s filter. Hence, each filter represents combined
spannormalized feedback from all higher levels, of exponentially growing span
and reduced update frequency.
Deeper hierarchy should
block greater proportion of inputs. At the same time, increasing number of
levels contribute to projected additive match, which may justify deeper search
within selected spans.
Higherlevel outputs are
more distant from current input due to elevation delay, but their projection
range is also greater. So, outputs of all levels have the same relative distance (distance/range) to a next input, and are equalweighted in combined update. But if input span is skipped, relative
distance of skipinitiating pattern to next input span will increase, and its
predictive value will decrease. Hence, that pattern should be flushed or at
least combined with a higherlevel one:
combined V prevalue = higherlevel V prevalue + ((currentlevel V prevalue  higherlevel V prevalue) / ((currentlevel span / distance) / (higherlevel span / distance)) /2. // the difference between
currentlevel and higherlevel prevalues is reduced by the ratio of their
relative distances.
To speed up selection,
filter updates can be sent to all lower levels in parallel. Multiple direct
filter updates are spannormalized and compared at a target level, and the
differences are summed in combined update. This combination is equalweighted
because all levels have the same spanperdistance to next input, where the
distance is the delay of feedback during elevation. // this happens automatically
in levelsequential feedback?
combined update = filter update + distancenormalized difference between output
& filter updates:
((output update  filter
update) / (output relative distance / higheroutput
relative distance)) /2.
This combination method is
accurate for postskipped input spans, as well as next input span.
 filter can also be
replaced by output + higherlevel filter /2, but value of such feedback is not known.
 possible fixedrate
sampling, to save on feedback evaluation if slow decay, ~ deep feedforward
search?
 selection can be by
patterns, derivation orders, subpatterns within an order, or individual
variables?
 match across distance
also projects across distance: additive match = relative match * skipped
distance?
crosslevel shortcuts: higherlevel subfilters and symbols
After individual input
comparison, if match of a current scale (lengthofalength…) projects positive
relative match of input lowerscale / higherderivation level, then the later
is also crosscompared between the inputs.
Lower scale levels of a
pattern represent old lower levels of a search hierarchy (current or buffered
inputs).
So, feedback of lower scale
levels goes down to corresponding search levels, forming shortcuts to preserve detail for higher levels. Feedback is generally negative:
expectations are redundant to inputs. But specifying feedback may be positive:
lowerlevel details are novel to a pattern, &
projected to match with it in the future.
Higherspan comparison
power is increased if lowerspan comparison match is below average:
variable subtraction ) span division )
superspan logarithm?
Shortcuts to individual
higherlevel inputs form a queue of subfilters on a lower level, possibly
represented by a queuewide prefilter. So, a level has one filter per parallel
higher level, and subfilter for each specified subpattern.
Subfilters of incrementally distant inputs are redundant to all previous ones.
Corresponding input value = match  subfilter value * rate of match to subfilter
* redundancy?
Shortcut to a whole level
won’t speedup search: higherlevel search delay > lowerhierarchy search
delay.
Resolution and parameter
range may also increase through interaction of colocated counterprojections?
Symbols, for communication
among systems that have common highlevel concepts but no direct interface, are
“coauthor identification” shortcuts: their recognition and interpretation is
performed on different levels.
Higherlevel patterns have
increasing number of derivation levels, that represent corresponding lower
search levels, and project across multiple higher search levels, each evaluated
separately?
Match across discontinuity may be due to additional
dimensions or internal gaps within patterns.
Search depth may also be
increased by crosscomparison between levels of scale within a
pattern: match across multiple scale levels also projects over multiple higher
and lower scale levels? Such comparison between variable types within a
pattern would be of a higher order:
5. Comparison between
variable types within a pattern (tentative)
To reiterate, elevation
increases syntactic complexity of patterns: the number of different variable
types within them. Syntax is identification of these types by their position
(syntactic coordinate) within a pattern. This is analogous to recognizing parts
of speech by their position within a sentence.
Syntax “synchronizes”
sametype variables for comparison  aggregation between input patterns. Access
is hierarchical, starting from sign>value levels within each variable of difference
and relative match: sign is compared first, forming + and  segments, which are
then evaluated for comparison of their values.
Syntactic expansion is
pruned by selective comparison vs. aggregation of individual variable types
within input patterns, over each coordinate type or resolution. As with
templates, minimal aggregation span is resolution of individual inputs, &
maximal span is determined by average magnitude (thus match) of new derivatives
on a higher level. Hence, a basic comparison cycle generates queues of
interlaced individual & aggregate derivatives at each template variable,
and conditional higher derivatives on each of the former.
Sufficiently complex syntax
or predictive variables will justify comparing across “syntactic“ coordinates
within a pattern, analogous to comparison across external coordinates. In fact,
that’s what higherpower comparisons do. For example, division is an iterative
comparison between difference & match: within a pattern (external
coordinate), but across derivation (syntactic coordinate).
Also crossvariable is
comparison between orders of match in a pattern: magnitude, match,
matchofmatch... This starts from comparison between match & magnitude:
match rate (mr) = match / magnitude. Match
rate can then be used to project match from magnitude: match = magnitude * output mr * filter mr.
In this manner, mr of each match order adjusts intraorderderived sequentially
higherorder match:
match *= lower interorder mr. Additive match is then projected from adjusted matches &
their derivatives.
This interorder projection
continues up to the top order of match within a pattern, which is the ultimate
selection criterion because that’s what’s left matching on the top level of
search.
Interorder vectors are Î›V symmetrical, but Î›V derivatives from lower order of match are also
projected for higherorder match, at the same rate as the match itself?
Also possible is comparison
across syntactic gaps: Î›Y comparison > difference, filter feedback VY hierarchy. For example, comparison between dimensions of a
multiD pattern will form possibly recurrent proportions.
Internal comparisons can
further compress a pattern, but at the cost of adding a higherorder syntax,
which means that they must be increasingly selective. This selection will
increase “discontinuity” over syntactic coordinates: operations necessary to
convert the variables before comparison. Eventually, such operators will become
large enough to merit direct comparisons among them. This will produce algebraic
equations, where the match (compression) is a reduction in the number of
operations needed to produce a result.
The first such shortcut
would be a version of Pythagorean theorem, discovered during search in 2D (part
6) to compute cosines. If we compare 2Dadjacent 1D Ls by division, over 1D
distance and derivatives (an angle), partly matching ratio between the ratio of
1D Ls and a 2nd derivative of 1D distance will be a cosine.
Cosines are necessary to
normalize all derivatives and lengths (Ls) to a value they have when orthogonal
to 1D scan lines (more in part 6).
Such normalization for a
POV angle is similar to dimensionality reduction in Machine Learning,
but is much more efficient because it is secondary to selective dimensionality
expansion. It’s not really “reduction”: dimensionality is prioritized rather
than reduced. That is, the dimension of pattern’s main axis is maximized, and
dimensions sequentially orthogonal to higher axes are correspondingly
minimized. The process of discovering these axes is so basic that it might be
hardwired in animals.
6. Cartesian dimensions and
sensory modalities
This is a recapitulation
and expansion on incremental dimensionality introduced in part 2.
Term “dimension” here is
reserved for a parameter that defines sequence and distance among inputs,
initially Cartesian dimensions + Time. This is different from terminology of
combinatorial search, where dimension is any parameter of an input, and their
external order and distance don’t matter. My term for that is “variable“,
external dimensions become types of a variable only after being encoded within
input patterns.
For those with ANN
background, I want to stress that a level of search in my approach is 1D queue
of inputs, not a layer of nodes. The inputs to a node are combined regardless
of difference and distance between them (the distance is the difference between
laminar coordinates of source “neurons”).
These derivatives are essential
because value of any prediction = precision of what *
precision of where. Coordinates and coderived differences are not
represented in ANNs, so they can't be used to calculate Euclidean vectors.
Without such vectors, prediction and selection of where must remain extremely crude.
Also, layers in ANN are
orthogonal to the direction of input flow, so hierarchy is at least 2D. The
direction of inputs to my queues is in the same dimension as the queue itself,
which means that my core algorithm is 1D. A hierarchy of 1D queues is the most
incremental way to expand search: we can add or extend only one coordinate at a
time. This allows algorithm to select inputs that are predictive enough to
justify the cost of representing additional coordinate and corresponding
derivatives. Again, such incremental syntax expansion is my core principle,
because it enables selective (thus scalable) search.
A common objection is that
images are “naturally” 2D, and our spacetime is 4D. Of course, these empirical
facts are practically universal in our environment. But, a core cognitive
algorithm must be able to discover and forget any empirical specifics on its
own. Additional dimensions can be discovered as some general periodicity in the
input flow: distances between matching inputs are compared, match between these
distances indicates a period of lower dimension, and recurring periods form
higherdimension coordinate.
But as a practical shortcut
to expensive dimensiondiscovery process, initial levels should be designed to
specialize in sequentially higher spatial dimensions: 1D scan lines, 2D frames,
3D set of confocal “eyes“, 4D temporal sequence. These levels discover
contiguous (positive match) patterns of increasing dimensionality:
1D line segments, 2D blobs,
3D objects, 4D processes. Higher 4D cycles form hierarchy of multidimensional
orders of scale, integrated over time or distributed sensors. These higher
cycles compare discontinuous patterns. Corresponding dimensions may not be
aligned across cycles of different scale order.
Explicit coordinates and
incremental dimensionality are unconventional. But the key for scalable search
is input selection, which must be guided by costbenefit analysis. Benefit is
projected match of patterns, and cost is representational complexity per
pattern. Any increase in complexity must be justified by corresponding increase
in discovered and projected match of selected patterns. Initial inputs have no
known match, thus must have minimal complexity: singlevariable “what”, such as
brightness of a greyscale pixel, and singlevariable “where”: pixel’s
coordinate in one Cartesian dimension.
Single coordinate means
that comparison between pixels must be contained within 1D (horizontal) scan
line, otherwise their coordinates are not comparable and can’t be used to
select locations for extended search. Selection for contiguous or proximate
search across scan lines requires second (vertical) coordinate. That increases
costs, thus must be selective according to projected match, discovered by past
comparisons within 1D scan line. So, comparison across scan lines must be done
on 2^{nd} level of search. And so on.
Dimensions are added in the
order of decreasing rate of change. This means spatial dimensions are scanned
first: their rate of change can be spedup by moving sensors. Comparison over
purely temporal sequence is delayed until accumulated change / variation
justifies search for additional patterns. Temporal sequence is the original
dimension, but it is mapped on spatial dimensions until spatial continuum is
exhausted. Dimensionality represented by patterns is increasing on higher
levels, but each level is 1D queue of patterns.
Also independently
discoverable are derived coordinates: any variable with cumulative match that
correlates with combined cumulative match of all other variables in a pattern.
Such correlation makes a variable useful for sequencing patterns before
crosscomparison.
It is discovered by summing
matches for sametype variables between input patterns, then crosscomparing
summed matches between all variables of a pattern. Variable with the highest
resulting match of match (mm) is a candidate coordinate. That mm is then
compared to mm of current coordinate. If the difference is greater than cost of
reordering future inputs, sequencing feedback is sent to lower levels or
sensors.
Another type of empirically
distinct variables is different sensory modalities: colors, sound and pitch,
and so on, including artificial senses. Each modality is processed separately,
up a level where match between patterns of different modalities but same scope
exceeds match between unimodal patterns across increased distance. Subsequent
search will form multimodal patterns within common ST frame of reference.
As with external
dimensions, difference between modalities can be predefined or discovered. If
the latter, inputs of different modalities are initially mixed, then segregated
by feedback. Also as with dimensions, my core algorithm only assumes
singlemodal inputs, predefining multiple modalities would be an addon.
7. Notes on clustering,
ANNs, and probabilistic inference
In terms of conventional
machine learning, my approach is a form of hierarchical fuzzy
clustering. Cluster is simply a different term for pattern: a set of
matching inputs. Each set is represented by a centroid: an input with
belowthreshold combined “distance” to other inputs of the set. The equivalent
of centroid in my model is an input with aboveaverage match (a complementary
of a distance) to other inputs within its search span. Such inputs are selected
to search nextlevel queue and to indirectly represent other crosscompared but
not selected inputs, via their discrete or aggregate derivatives relative to
the selected one.
Crucial differences here is
that conventional clustering methods initialize centroids with arbitrary random
weights, while I use matches (and so on) from past comparisons. And the weights
are usually defined in terms of one variable, while I select higherlevel
inputs based on a combination of all variables per pattern, the number of which
increases with the level of search.
Current methods in
unsupervised learning were developed / accepted because they solved specific
problems with reasonable resources. But, they aren’t comparable to human
learning in scalability. I believe that requires an upfront investment in
incremental complexity of representation: a syntactic overhead that makes such
representation uncompetitive at short runs, but is necessary to predictively
prune longerrange search.
The most basic example here
is my use of explicit coordinates, and of input differences at their distances.
I haven’t seen that in other lowlevel approaches, yet they are absolutely necessary
to form Euclidean vectors. Explicit coordinates is why I start image processing
with 1D scan lines,  another thing that no one else does. Images seem to be
“naturally” 2D, but expanding search in 2Ds at once adds the extra cost of two
sets of new coordinates & derivatives. On the other hand, adding 1D at a
time allows to select inputs for each additional layer of syntax, reducing
overall (number of variables * number of inputs) costs of search on the next
level.
Artificial Neural Networks
The same coordinateblind
mindset pervades ANNs. Their learning is probabilistic: match is determined as
an aggregate of multiple weighted inputs. Again, no derivatives (0D miss), or
coordinates (14D miss) per individual input pair are recorded. Without them,
there is no Euclidean vectors, thus pattern prediction must remain extremely
crude. I use aggregation extensively, but this degradation of resolution is
conditional on results of prior comparison between inputs. Neurons simply don’t
have the capacity for primary comparison.
This creates a crucial
difference in the way patterns are represented in the brain vs. my hierarchy of
queues. Brain consists of neurons, each likely representing a variable of a
given magnitude and modality. These variables are shared among multiple
patterns, or coactivated networks (“cognits” in terms of J. Fuster).
This is conceptually
perverse: relatively constant values of specific variables define a pattern,
but there’s no reason that same values should be shared among different
patterns. The brain is forced to share variables because it has fixed number of
neurons, but a fluid and far greater number of their networks.
I think this is responsible
for our crude taxonomy, such as using large, medium, small instead of specific
numbers. So, it’s not surprising that our minds are largely irrational, even
leaving aside all the subcortical nonsense. We don’t have to slavishly copy
these constraints. Logically, a cooccurring set of variables should be
localized within a pattern. This requires more local memory for redundant
representations, but will reduce the need for interconnect and transfers to
access global shared memory, which is far more expensive.
More broadly, neural
networkcentric mindset itself is detrimental, any function must be initially
conceptualized as a sequential algorithm, parallelization is an optional
superstructure.
Probabilistic inference:
AIT, Bayesian logic, Markov models
A good introduction to
Algorithmic information theory is Philosophical Treatise of Universal Induction
by Rathmanner and Hutter. The criterion is same as mine: compression and
prediction. But, while a progress vs. frequentist probability calculus, both
AIT and Bayesian inference still assume a prior, which doesn’t belong in a
consistently inductive approach. In my approach, priors or models are simply
past inputs and their patterns. Subjectspecific priors could speedup
learning, but unsupervised pattern discovery algorithm must be the core on
which such shortcuts are added or removed from.
More importantly, as with
most contemporary approaches, Bayesian learning is statistical and
probabilistic. Probability is estimated from simple incidence of events, which
I think is way too coarse. These events hardly ever match or miss precisely, so
their similarity should be quantified. This would add a whole new dimension of
micrograyscale: partial match, as in my approach, vs. binary incidence in
probabilistic inference. It should improve accuracy to the same extent that
probabilistic inference improved on classical logic, by adding a
macrograyscale of partial probability vs. binary true  false values of the
former.
Resolution of my inputs is
always greater than that of their coordinates, while Bayesian inference and AIT
typically start with the reverse: strings of 1bit inputs. These inputs, binary
confirmations / disconfirmations, are extremely crude way to represent “events”
or inputs. Besides, the events are assumed to be highlevel concepts: the kind
that occupy our conscious minds and are derived from senses by subconscious
cognitive processes, which must be built into general algorithm. Such choice of
initial inputs in BI and AIT demonstrates a lack of discipline in incrementing
complexity.
To attempt a general
intelligence, Solomonoff introduced “universal prior“: a class of all models.
That class is a priori infinite, which means that he hits combinatorial
explosion even *before* receiving actual inputs. It‘s a solution that only a
mathematician may find interesting. Marginally practical implementation of AIT
is Levin Search,
which randomly generates models / algorithms of incremental complexity and
selects those that happen to solve a problem or compress a bit string.
Again, I think starting
with prior models is putting a cart before a horse: cognition must start with
raw data, complex math only becomes costefficient on much higher levels of
selection and generalization. And this distinction between input patterns and
pattern discovery process is only valid within a level: algorithm is embedded
in resulting patterns, hence is also compared on higher levels, forming
“algorithmic patterns“.
8. Notes on working mindset
and a prize for contributions
My terminology is as
general as the subject itself. It’s a major confounder,  people crave context,
but generalization is decontextualization. And cognitive algorithm is a
metageneralization: the only thing in common for everything we learn. This
introduction is very compressed, because much of it is work in progress. But I
think it also reflects and cultivates ruthlessly reductionist mindset required
for such subject.
My math is very simple,
because algorithmic complexity must be incremental. Advanced math can
accelerate learning on higher levels of generalization, but it’s too expensive
for initial levels. And general learning algorithm must be able to discover
computational shortcuts (= math) on it’s own, just like we do. Complex math
definitely isn’t innate in humans on any level: cavemen didn’t do calculus.
This work may seem too
speculative, but any degree of generalization must be correspondingly lossy.
Which is contrary to deductionheavy culture of math and computer science.
Hence, current Machine Learning is mostly experimental, with constrained tasks
and supervised techniques. A handful of people aspire to work on AGI,
but they either lack or neglect functional definition of intelligence, their
theories are only vague inspiration.
I think working on this
level demands greater delay of experimental verification than is acceptable in
any established field. Except for philosophy, which has nothing else real to
study. But established philosophers have always been dysfunctional fluffers,
not surprisingly as their paying customers are college freshmen.
Our main challenge in
formalizing GI is a speciewide ADHD. We didn’t evolve for sustained focus on
this level of generalization, that would cause extinction long before any
tangible results. Which is no longer a risk, GI is the most important problem
conceivable, and we have plenty of computing power for anything better than
bruteforce algorithms. But our psychology lags a lightyear behind technology:
we still hobble on mental crutches of irrelevant authority and peer support,
flawed analogies and needless experimentation.
Prize for contributions
I offer prizes up to a
total of $500K for debugging, optimizing, and extending this algorithm: github.
Contributions must fit into
incrementalcomplexity hierarchy outlined here. Unless you find a flaw in my
reasoning, which would be even more valuable. I can also pay monthly, but there
must be a track record.
Winners will have an option
to convert the awards into an interest in all commercial applications of a
final algorithm, at the rate of $10K per 1% share. This option is informal and
likely irrelevant, mine is not a commercial enterprise. Money can’t be primary
motivation here, but it saves time.
Winners so far:
2010: Todor
Arnaudov, $600 for suggestion to buffer old inputs after search.
This occurred to me more than once before, but I rejected it as redundant to
potential elevation of these inputs. Now that he made me think about it again,
I realized that partial redundancy can preserve the detail at much lower cost
than elevation.
The buffer is accessed if
coordinates of contents get relatively close to projected inputs (that and
justification is mine). It didn’t feel right because brain has no substrate for
passive memory, but we do now.
2011: Todor, $400
consolation prize for understanding some ideas that were not clearly explained
here.
2014: Dan He, $600 for pushing me to be more
specific and to compare my algorithm with others.
2016: Todor, $500 for
multiple suggestions on implementing the algorithm, as well as for the effort.
2017: Alexander Loschilov, $2800 for help in
converting my level 1 pseudo code into Python, consulting on PyCharm and SciPy,
and for insistence on 2D clustering, FebruaryApril.
Todor: $2000 for help in
optimizing level_1_2D, JuneJuly.
Kapil Kashyap: $ 2000 for stimulation and effort, help with
Python and level_1_2D, SeptemberOctober
2018:
Todor Anaudov, $1000 mostly for effort and stimulation, JanuaryFebruary
Andrei Demchenko, $1800 for conventional
refactoring in line_POC_introductory.py, interface improvement and few improvements
in the code, April  May.
108 comments:
I wish you had illustrations to make this easier to follow. I think what I'm missing is sufficient context to follow your initial development of your ideas.
I am not sure about illustrations Aaron, they may mislead rather than explain. The only macro "architecture" I have is a hierarchy of generalization. There are micro hierarchies of derivation within each level, but they have variable / incremental depth & dimensionality. I don't think you're missing context,  there isn't any. Generalization is decontextualization, & nothing's more general than GI. It comes with the territory,  you pay for an altitude with a lack of air:). But I'll think about illustrations, & please let me know if you have specific questions.
Alper E:
I need an example (such as character recognition, or pattern recognition) in relation to which I can think on the algorithm. You already mention "pixels across a 1 dimensional scan line". Do you have a more elaborate example and a description of the problem you are trying to tackle with this alogrithm?
Thanks,
Alper
All of this is pattern recognition, Alper, according to my definition of a pattern (haven't seen any constructive alternative to that). It's just that for me a pattern can be of any level of complexity, & you are probably thinking of something specific. For example, a character is a contiguous 2D pattern. So in my scheme it would be formed on the second level, out of 1D segments (even of length = 1) contiguously matching in 2nd D (across scan lines). "Contiguous" means that minimal d_coordinate between any constituent pixels of the segments = 1, in both dimensions. These 2D patterns (characters) will be automatically compared & recognized on higher levels: 3D & higher contiguous or 2D & higher discontinuous search. This comparison will be between sametype variables of both comparands (multivariate patterns). These variables will include presumablyidentical brightness & linewidth (angle (d_1D_coordinate) normalized 1D segments' length), & characterspecific length (anglenormalized 2nd D height). This 2D length will mediate constituent subpatterns: 2Dcontiguous sequence of angles & 2D lines, which will be compared across patterns if their macropatterns (above) match. Overall match will be a sum of matches between individual variables & subpatterns, each possibly weighted by feedback.
This is a very rough description, I don't design this algorithm for character recognition, or any specific task. It should simply recognize anything recognizable, on corresponding level of search & with sufficient resources.
My "problem" is efficient search, where efficiency is resource allocation in proportion to predictive value, on all levels of representation. Predictive value is past_match  systemaverage_match, both projected by coderived partial misses: feedforward & feedback vectors.
Alper E said:
It now becomes a little bit more clear (but, only a little bit :)
Do you have a description of your algorithm in a formal language (you had mentioned a pseudo language). I really need a formal description of it, because many of the terms you use do not look clear to me. Moreover, it'd be fine if we stick to a simple example, like character recognition.
How would you propose to proceed?
Alper
That pseudo is my own very highlevel language, I sent you a sample. I tried more formal descriptions, but it's very awkward to think in these terms, I need more rhyme & (lossy) compression. None of that pseudo is about 2D, though. I currently work on the core algorithm, which is 1D: timeonly. Spatial dimensions are empiricallyspecific addons to that core algorithm. Still worth thinking about, might be suggestive beyond spatial specifics. So, we can talk in terms of character recognition. But email is a low throughput medium, especially as I am a slow typist. For me to initially explain my terms, it should be questionanswer in realtime, that's why I suggested skype or conference.
I think since so little is know about the algorithmic structure of the human brain you had better leave it to evolutionary algorithms to forge an AI. It would seem from the evolutionary timeline here on earth that creating higher intelligence is a trivial matter compared to perfecting the biochemical basis of life. I think you give an evolutionary algorithm as facile a medium to work with as possible and then, simply let it work. The medium of which I speak is probably some form of patterned arithmetic structure where where mutations do not generally have catastrophic results. And where there is an extremely high level of malleability with regard to what is constructed.
>I think since so little is know about the algorithmic structure of the human brain
I know enough to realize that it's a misshapen kludge, a worthy product of dumb & blind evolution.
> you had better leave it to evolutionary algorithms to forge an AI.
We did, for 3B years.
Now that we have brains, why not use them?
> creating higher intelligence is a trivial matter compared to perfecting the biochemical basis of life.
That "biochemical basis" is all "arithmetic" too. If you design an algorithm to maximize its own reproduction, then you're starting from 3B years ago.
And if you want to maximize intelligence, then you need to define it functionally, which is where I start in this intro.
> The medium of which I speak ...
You can't get a *general* intelligence to evolve in a severely *constrained* medium.
Laszlo Abraham:
Hi, interesting "project". Read your article (and the indicated literature). In my opinion, your approach to this domain is too philosophical. You set a so high level of generality that keeps you out from any concrete (usable) discovery / solution .
The problem is not to understand your ideas. Any interested scholar, with a sufficient culture in the domain, at a moment, obviously arrives to these conclusions about intelligence and pattern matching. The problems are starting only from here.
If you want to say something new (usable for others too) you must dive in the details. You must split the general problem in small, maybe banal and trivial parts, until the solution is feasible and achievable. This is what Plato names diaeresis. From your level of generality this is impossible. You are blocked there. You want a too big junk with one bite. Philosophy will never solve any concrete problem.
Keep in mind what Henry Ford said: "Nothing is impossible if we could disassemble it in small (simple, solvable) parts". Every serious researcher in the domain is doing this. do not disdain them. This not means that they all miss the general view you just discovered. This means they want more: DO something, even if they could not solve alone all the problems. This is a cooperative game.
Of course we must see the forest, not only the trees. It is important, it will guide us. But, in the same time, never could make a fire with the "forest", only with woods.
Told you all these because (it seems) you traversed the same stages as I did years ago. My critics want to be constructive, want to help you. Don't be offended.
I am working in this domain from more then 20 years. Have some ideas, what and how to do. Never had enough time (money) to work only on these ideas. Would be very happy if could realize something together, but ONLY if you don't want the Moon from the sky in one step, and we have a clear (partial) target every moment.
Waiting for your reply,
I think it's a solvable problem, Laszlo: we have a solution between the ears. And I did define it, from the very beginning. It's true that no solution is final,  there is no limit to how efficient / scalable a search can be.
But there must be a strictly incrementalcomplexity progression of improving scalability without losing generality, as I outlined. Small steps, as you say.
The difference is that I want these steps to implement first principles, & you want them to solve specific problems. That's a common dilemma in such work,  scalability vs. quick fixes.
You say any reasonable scholar would agree with my conclusions,  name one? "Indicated literature" in my intro is precisely to explain fundamental differences between my approach & anything else I came across.
If you can show me one work that follows roughly the same progression as mine,  that's a big award right there. Even something as basic as defining a match: absolute, projected, relative,  as in part 2? Or *one* approach to clustering that proceeds with incremental spatiotemporal dimensionality, even if their definition of "distance" is not fully consistent with mine. Hello?
This is not philosophy, never mind Plato or Ford.
Re "partial targets, in my sense, I am currently trying to better define mean match, as a criterion for extended search.
Basic definition is: value of currentlevel projected match that cooccurs with (thus predicts) a mean projected higherlevel match.
But I think that mean currentlevel projected match must be adjusted by projected consequences of its own feedback?
Would be interested in your ideas, of course. Can't promise to accept them, but if I don't, I will tell you exactly why.
Laszlo Abraham said:
Hi Boris,
Unfortunately, you feel offended. Sorry for this. Have no time to make a thorough (statement level) analysis of your theory. I think I understood the main message. Sorry, but I don't saw any original idea in it. You are using other words for the known things. This is not improving our knowledge in the domain. Additionally, the problem you "defined" is existing, at least, from 1676 (see "Characteristica universalis" of Leibnitz).
Since this domain (AI) existed long before we started to study it, and many smart people contributed to it with concrete and relevant results, it is an elementary expectation to use the already consecrated terminology when we are discussing about these things. For example, your "incremental spatialtemporal dimensionality" is the well known phase space in physics. Searching patterns in it is the identification problem in the dynamic systems theory, and the analog problem in the abstract notions world is modeled very well with the causal (knowledge) graphs. If you want to introduce a new terminology, you must first argument its necessity. It is very good if you want to discover some new things, but this is not a linguistic problem: renaming the notions and methods is not helping us at all. Let's speak the same language!
When you try to build a new theory, you must know almost everything about what others done in this direction, and respect their results. In other case, you will prove only your ignorance, if some will take the time to decrypt what you want to say. Also, when you expose your new theory, you must offer links to the already known and generally accepted facts, and your article must contain relevant citations. In academic circles this is an elementary expectation too. At least, you must explain why your theory is necessary, what improvements it brings, and what are YOUR ORIGINAL RESULTS (as Einstein did in 1904). If you want to advocate for a new method (theory), first you must have results, Maybe you're right, maybe your idea is genial, anyhow nobody will believe you till you prove it. So, let's start with the small problems!
Regarding the examples you ask: at a popular science level Kurzweil's, Hofstadter's and Mero's books contain all your ideas about the link between intelligence and pattern matching (and much more). At a scientific level, conforming to my knowledge, Schmidhuber's and Hutter's works are the examples which are working with the same notions / methods, and HAVE SPECTACULAR RESULTS. For example, google for the Hutter Price. This is a very good example how an award promising theoretical problem most be exposed.
Additionally, you could take a look on Judea Pearl's books, and the domains of knowledge graphs, chat robots ontology and the Loebner Price. In my opinion, you rediscovered the contextuality of the knowledge. Bravo, how we go further?
Of course, you are free to publish anything on your website, even on Elance. The risk is only to receive unpleasant comments from old dinosaurs as I am. Again, sorry if offended you, don't bother you more. Except if you really want to put at work the contextual model of the world.
Thank you for your attention,
Laszlo
As you admitted Laszlo, you didn’t get very far in my intro. Just recognized some basic points you already knew, & ignored parts you didn’t understand,  most of it.
Which means that you can’ judge whether it’s original. For example “phase space” has nothing to do with my incremental dimensionality, see part 3.
And other suggestions are pure fluff: “philosophy” as you say.
Except for two: I don’t use accepted terminology & build on other people’s work.
Well, there are several fields involved, all significantly off of my focus. And each has its own terminology, usually incoherent & mindpolluting.
Like I said, generalization is a reduction. You can spend many lifetimes reading related work, only to realize that it’s not related enough.
Or, you can define & describe the problem in general English terms & phrases that fit it best. Then you google them, & see that my site is on top of the list.
I don’t think the problem here is my terminology, it simply takes a while to acquire appropriate context (in my case, ~ 27 years).
I am dully impressed that there are plenty of smart people in the world, many smarter that I am. But, as with anything else, intelligence is not what you got, it’s how you use it.
And almost all use it mainly to succeed in life, according to societal standards. Yes, even in science: it’s all about peer review. Which is great, but only if you’re lucky enough to have peers.
Anyway, thanks for trying.
Dan He said:
I briefly read your blog, very interesting work. I have the following thoughts.
The proposed method is contrary to deep learning. In deep learning, the number of features in the model presentation is reduced layer by layer, starting from the full set of features. In your model, the number of features increases layer by layer, starting from single features. However, your model also allows cross layer communications from top layers down to the bottom layers, which is similar to deep learning. I think this is indeed how the brain works: when we learn something, we often jump forwards and backwards to connect different levels of details and information. I think this process should be developed in a more systematic way.
Human brains learn new knowledge, or patterns, differently. Someone may pay attention to the big picture first, someone may pay attention to details first. Someone may pay attention to shape first, someone may pay attention to color first. Someone may pay attention to red color first, someone may pay attention to blue color first. Thus there should not be a single "best" order of the dimensions to be selected, alternatively speaking, the incremental growth could start from different dimensions. Instead of fixing the order of the dimensions to be selected as proposed by the blog, we could use a mergebased strategy: we first select a set of size1 dimensions that are significant enough, assuming we select dimensions A,B,C,D. Then we merge these significant dimensions into size2 dimensions, AB,AC,AD,BC,BD,CD and then select from them the significant size2 dimensions. Assuming we select AB,AC,BC,BD,CD. Then we merge these significant size2 dimensions into size3 dimensions, ABC,BCD. We repeat the process till no more dimensions can be merged. The advantage of the mergebased strategy is we allow different start points in the search space in a systematic way, which models the multiple possibilities of the brain functions.
One question I have is: the dimensions for the incremental selection seems to be different from the features of the data. For example, an image may have thousands of pixels and each one is considered as a feature. But it seems your dimension here is the 2D space. Then when you make comparisons, how exactly you handle the features of the data, which is critical for the scalability of the method?
Finally, I would suggest you have a section of "concept definition" and "table of annotations". The current writing is kind of hard to follow as the definitions of some concepts and annotations are missing, or at least not easy to locate.
These are all my brief ideas, please let me know if you think we can talk more.
Thanks
Dan
Thanks for your comments Dan,
> In deep learning, the number of features in the model presentation is reduced layer by layer, starting from the full set of features...
"Features" in NNs correspond to two different things in my model:
1) Positionally distinct inputs / patterns, ordered along ST dimensions within input flow: 1D length ) 2D width ) 3D depth ) 4D duration. These dimensions are defined by explicit coordinates in my model, but are not easily represented & compared in a NN, which I think is one of their basic flaws. The number of such patterns is reduced with elevation, just as it is in deep NN, because input patterns to higher levels are ST contiguous spans of lowerlevel patterns.
2) Variables internal to these patterns, such as different original input modalities, & matches / differences derived from them by lowerlevel comparisons. Again, there is no such internal variables in NNs, instead, they are distributed across network / brain, & shared among multiple "patterns" or "cognits" in terms of J. Fuster (which are comprised of longrange connections to such variables). This is another basic flaw of NNs because such longrange connections are very expensive, & must be activated all at once. The number of these variables per pattern is increasing with elevation in my model because they are necessary to represent increasingly complex relationships discovered during lower levels of search. This is also true for neocortex, as the input flow proceeds from primary to association cortices. Again, in my model such incremental syntax is localised within a pattern, but is distributed across "neocortex" in neural models.
I tried to explain these things in parts 4 & 8, don't know if you made it that far.
> Someone may pay attention to the big picture first, someone may pay attention to details first...
"Attention" is a feedback from higher levels, you don't initially have it because higher levels are "empty",  they fillin with generalized outputs of lower levels. A designer determines the order of feedforward, which can latter be altered by selfgenerated feedback.
> One question I have is: the dimensions for the incremental selection seems to be different from the features of the data. For example, an image may have thousands of pixels and each one is considered as a feature. But it seems your dimension here is the 2D space.
It's 4D spacetime, see above.
> Then when you make comparisons, how exactly you handle the features of the data, which is critical for the scalability of the method?
I compare sametype variables between different patterns: their syntax must "synchronized" before comparison.
> Finally, I would suggest you have a section of "concept definition" and "table of annotations". The current writing is kind of hard to follow as the definitions of some concepts and annotations are missing, or at least not easy to locate.
This would depend on reader's background, which I don't know. That's why I use general English terms, usually in their original meaning. I know it's confusing for a specialist, but my subject is *general* intelligence :).
Thanks again Dan, hopefully we can get into procedural details.
Daniel Michulle said:
Hi there,
I am reading your description on the web site and will put down some of my thoughts:
1. "Incremental": I am not sure whether incremental approaches to pattern discovery work since this requires sequential processing. That is, unless you mean 1a or 1b.
1a: Attention: Attention is a sequential process that looks for "interesting things". This would mean that you need a guidance system for attention that I presume should have the same structure as the cognitive apparatus you intend to create. This can be just another level in the hierarchy.
1b: Hierarchical increments which you already included by stating you want a hierarchic system.
2. Feedback: Feedback must be two way in your solution.
2a Either you are able to do this fully parallely or
2b you define an artificial discrete state and update it in artificial sequentially without any input/output occurring during this time or
2c or you select just some of the processing units in your cognitive apparatus to update and try to stay realtime
Anyhow, control flow seems a critical design decision.
PS: Found this today, nothing scientific but maybe a useful analogy: http://nautil.us/issue/12/feedback/antsswarmlikebrainsthink
3. Homoiconicity
In order to be efficient at implementing your Cognitive Apparatus (CA), you would do well in restricting input to some sort of data structure representable by modern day PCs. Bitwise is an option but w.r.t. efficiency I presume that "array of double" is a better format. You can downscale later.
Also, in order to be able to create your CA in a finite amount of time, I presume it is much easier to define the output of any functional unit in your CA to be of the type of your selected input. In this way you can compose your CA of different units.
Still, you could later substitute some units by others without sacrificing composability.
Given these thoughts, its no wonder why often people like deep learning based on ANNs because they are homoiconic and composable.
4. Parameters
Your system will have a huge number of parameters and you will need to search for optimal parameters. You don't want to do this by hand so you'll need a metaalgorithm.
Up to now I just found one "universal algorithm" that works context free: Evolutionary algorithms where competition leads to improvement (you dont know how good your CA will be but if you have 10 you at least have a measure of how good it is compared to peers)
Moreover, if you like challenges (and it certinaly seems so), you can use genetic programming to find components. Also, you will obtain a measure for the overall cost of the system which may prove useful in finding:
5. local computation cost
Since you mention search cost, you will need an intrinsic measure of cost. I have no idea how you could implement that but brains do this via its neurotransmitters and vesicles. These are only slowly recharged and you cant spend more than you have. The disadvantage is that you need to model time in this case.
6. Background: I am a big fan of LISP. I encourage you to realize any of your ideas in Clojure because of homoiconicity and representation of logic (core.logic) and the huge amount of libraries in Clojure and Java.
I myself worked on two different AI "leaps":
6.1. A General Game Playing agent: An agent that is able to play any game (deterministic games with complete information, e.g. Chess, Checkers, Connect Four, TicTacToe, RockPaperScissors)
6.2 A generic time series forecasting mechanism that takes any kind of time series data and is able to extract all the info for forecasting, including selecting its internal model, time sieres transformations, outlier detection , ...
I founded a company in 2012 for promoting the system and we have modest success right now.
That being said, I am a big fan of this kind of ideas but I lack time.
Hope you find this interesting!
BR, Daniel
Thanks for your comments, Daniel,
>1. "Incremental": I am not sure whether incremental approaches to pattern discovery work since this requires sequential processing.
Any pattern discovery is an iterative comparison to other inputs within a search range. I tried to explain how search should incrementally / hierarchically expand, let me know what's unclear.
> 1a: Attention: Attention is a sequential process that looks for "interesting things". This would mean that you need a guidance system for attention that I presume should have the same structure as the cognitive apparatus you intend to create. This can be just another level in the hierarchy.
Attention is simply another term for feedback down the hierarchy, which I also tried to explain.
> 2. Feedback: Feedback must be two way in your solution.
2a Either you are able to do this fully parallely
All levels should process in parallel, both feedforward & feedback.
> Anyhow, control flow seems a critical design decision.
"Control" is also another term for feedback.
> 3. Homoiconicity: In order to be efficient at implementing your Cognitive Apparatus (CA), you would do well in restricting input to some sort of data structure representable by modern day PCs. Bitwise is an option but w.r.t. efficiency I presume that "array of double" is a better format. You can downscale later.
All I need is bitmapped images.
> I presume it is much easier to define the output of any functional unit in your CA to be of the type of your selected input.
Basic units in my approach are levels, syntactic complexity of their inputs & outputs is incremental with elevation, as I tried to explain.
> 4. Parameters: Your system will have a huge number of parameters and you will need to search for optimal parameters. You don't want to do this by hand so you'll need a metaalgorithm. Up to now I just found one "universal algorithm" that works context free: Evolutionary algorithms where competition leads to improvement (you dont know how good your CA will be but if you have 10 you at least have a measure of how good it is compared to peers)
Input parameters are also altered by feedback, not at random as in GAs, see part 1.
> Since you mention search cost, you will need an intrinsic measure of cost. I have no idea how you could implement that
I use "opportunity cost": average projected match per search step. This is the most basic form of feedback, see part 2.
> The disadvantage is that you need to model time in this case.
I use explicit 4D spacetime coordinates, see part 4.
> I encourage you to realize any of your ideas in Clojure because of homoiconicity and representation of logic (core.logic) and the huge amount of libraries in Clojure and Java.
My approach is expressed in its own language. The core algorithm is supposed to be selfcontained, I don't plan to use any external libraries for it.
Sorry for being brisk. It's just that I've been repeating these things forever, don't know how to make it any clearer...
Dan He said:
Hi, Boris:
Sorry for the late response. I was traveling the last couple of weeks.
SO your pattern is positionally distinct patterns. What about other types of patterns? Is your model generic to all types of patterns or only to this type of specific patterns?
Also if the dimension is only 4D, does incremental really help? I am
not sure if my understanding is correct, but it seems you will only have 4 layers?
Thanks
Dan
No problem Dan.
"Positionally" includes temporally distinct,  each level gets a temporal sequence of inputs, each with distinct temporal coordinate. In that sense, any type of pattern is positionally distinct. Initial "patterns" are pixels in 1D scan line,  temporal sequence that is mapped into spatial coordinate on a higher level.
On the other hand, all constituent variables cooccur in a pattern, thus are not positionally distinct. On a given level of differentiation, of course. A pattern may contain multiple levels of partial differences / derivatives relative to other patterns, which are positionally distinct.
Incremental dimensionality is: 1D 1st level, 2D 2nd level, 3D 3rd level,TD 4rth level.
This 4level cycle repeats indefinitely, with patterns of hierarchically greater span / scale.
Dimensions in cycles of different order of scale are not necessarily aligned with each other.
Anyway, our 3D space is empirically specific & this dimensionality should be learnable.
My core algorithm is 1D,  time only.
The point of having incremental dimensionality is that it lets you filterout weaker lowerD patterns before adding the costs of higherD coordinates.
Dan He said:
If the level could be infinite, what do you think that we don't fix the order of the incremental search, instead we increase the levels by merging the lower levels? I think this way we would still consider all possible
combination of levels while pruning the search space effectively.
Dan
Not sure I understand you, Dan.
"Level" is a span of search before selection = elevation.
(It's actually more complex, see part 3, somewhat out of date)
Such span is "increment" in incremental search.
That span ends when the benefit of selection exceeds its cost.
The benefit of selection is (cost  benefit) of comparing subcritical
(cost > benefit) patterns, which should be prunedout.
This point is reached because the cost of comparison is fixed,
but the benefit (average match) declines with the distance
between comparands (if the data set is not random).
That point doesn't depend on total memory of the system,
so we fix the number of inputs per level, not the number of levels.
I am glad you are thinking about it Dan, but these ideas must be justified
:).
Dan He said:
Do you know Apriori algorithm? The idea is justified there. It's just a
matter whether it can be applied
here.
Dan
Looking at wiki, I don't see a justification.
Anyway, in breadthfirst all data is available at once.
In realtime learning, inputs are searched in the order
they come in, & data set is never complete.
It's neither breadth nor depth, but proximity first.
Dan He said:
Why we have to follow the input order? I mean we could cache the input
within a certain short amount of time window and the apriori algorithm can
then be applied, right?
Dan
We have to follow proximity between inputs because it's
the most basic predictor of similarity / match.
This may not sound obvious to you because you are used to
dealing with heavily preprocessed data. Compressive
preprocessing (starting with imagecompression transforms)
already exploited this proximity>similarity correlation,
& you can't do that again.
I start from raw images because the algorithm has to scale
down in order to scale up. Generality / scalability means that the same principles should apply on all levels, only the syntax of input data is getting more complex.
"Time window" is my search level, & yes, it's breadthfirst
within it. But apriori algorithm searches for binary match between
combinations among adjacent inputs, & I start by searching for
grayscale match between individual inputs.
Again, this is because I start from the beginning, while the algorithms you are used to are designed to deal with preprocessed / humangenerated symbolic data. That's why they don't scale.
Dan He said:
I don't think Apriori can only deal with binary match. You can use any matching criteria to determine if a candidate should be selected or not.
Also when you include a new dimension, it's either include or not, right?
That's the same thing for Apriori. Even if you allow partially include, you could still do the same thing in Aprior, with partial membership, something similar to fuzzy logic. I just don't see why you can not apply Aprior here.
Dan
I guess you can morph any algorithm into any other, if definitions are fuzzy enough.
That's why I work on the level of operations, rather than wholesale algorithms. What exactly do you suggest to change / add to my algorithm?
Dan He said:
Well, my suggestion is:
Instead of sequential incremental search, we might be able to do parallel incremental search from multiple dimensions, where we increase the dimensions in parallel by merging the dimensions of the current level to generate the candidates for the next level. At each level, the candidates are selected using any user specified rule.
Dan
With incremental dimensionality, a 2nd D (vertical) level
receives inputs (1D patterns) that were compressed &
selected on a lower level. And so on.
In your version, multiple levels would process in parallel
the same uncompressed & unselected inputs, which is
a horrible waste.
Incremental selection is my core principle: it's what
should make the algorithm scalable. Nice try :).
> At each level, the candidates are selected using any user
specified rule.
The algorithm is supposed to be general: autonomous &
unsupervised. Selection criteria is what it's all about, there's
no user in the loop.
BTW, I'd like to have this discussion in comments on my
intro, someone else may find it useful. I'll copy old messages
there, If you don't mind.
Dan He said:
Only on 1D the inputs will be uncompressed & unselected. Since 2D, we will be merging the dimensions from the previous level so it won't be uncompressed. At each level we make selections and we only merge the dimensions that are "significant".
The selection is generic. It doesn't need to be user defined. Whatever method you used to select a dimension can be used here.
Yeah, you can publish these discussions.
Dan
You said:
> parallel incremental search from multiple dimensions,
> where we increase the dimensions in parallel by merging
> the dimensions of the current level to generate the
> candidates for the next level.
Do you mean expanding search in multiple additional
dimensions (& adding corresponding coordinates) per
single additional level, then?
If so, I already answered that in May 11nth message:
"The point of having incremental dimensionality is that it
lets you filterout weaker lowerD patterns before adding
the costs of higherD coordinates."
This is an instance of "incremental syntax",  another core
principle in my approach.
I don't know what you mean by "merging" dimensions?
In my model, lowerlevel variables are selectively
represented or merged with related variables on a
higher level, see part 5.
Although that part is a bit messy & out of date, sorry.
Dan He said:
If you have ABCD 4 dimensions, and at current level, say 2 dimension level, you have AB and BC, both are "significant", then the next level is 3 dimension, you merge AB and BC and you obtain ABC, which is 3 dimension.
That's what I mean by "merging". You merge two ndimensions to obtain one n+1dimension, where the two ndimensions overlap by n1 dimensions.
Apriori algorithm then guarantees that you won't miss any significant dimensions and you don't need to do exhaustive search.
Dan
I thought we agreed to use "dimension" to mean Cartesian
coordinates + time the order of input, for clarity?
You are talking about dimensions of combinatorial search,
while mine is proximityordered. As in, proximity between
comparands, rather than proximity in expanding the number
of constituent variables within a comparand.
I have multivariate patterns, but their variables aren't included from outside, they are *derived* through lowerlevel comparisons to a fixed syntax.
When I compare these patterns, I don't compare their variables in all possible combinations, I only compare variables of the same type between different patterns.
So, the difference is that variables in my patterns ("dimensions" in your terms) are added to represent derivatives from comparing & compressing lowerlevel patterns. While in apriori algorithm, they *are* lowerlevel patterns, combined into higherlevel ones if they cooccur.
My model is finergrained, thus should be more compressive.
Dan He said:
For proximityorder, you can still cache the input for a certain amount of time, right? Then we can search in parallel? If not, what do you exactly mean "proximity order"?
So why Apriori can't be applied on the variables in your patterns? And also when your patterns grow to higherlevel, eventually Apriori can be applied,right?
Dan
> For proximityorder, you can still cache the input for a certain
> amount of time, right? Then we can search in parallel?
Yes, that;s how you get fuzzy patterns (~fuzzy clusters).
Each input member is compared to a sequence of old inputs
(template), possibly in parallel, forming a sequence of
derivatives that overlaps those of adjacent inputs.
This overlapping / redundant representation is far more
expensive than unique representations produced by strictly
sequential search.
> why Apriori can't be applied on the variables in your patterns?
Because individual variable's match is "merged" with those of
coderived variables,  a group is defined by derivation rather
than simple proximity. That's one of the difference between
variables & patterns: the former are ordered / merged
by derivation syntax rather than input sequence.
Dan He said:
What's your definition of "derivation" and "derivatives"? You compare the
inputs with the templates,
and you call those overlapped ones as "derivatives"?
I still don't get the point that you eventually need to increase the
dimensionality or whatever, and whatever you need to increase, can be feed
into the Apriori algorithm, right? Even if the derivatives depend on the
sequence of inputs, if we can cache the inputs, we could obtain the
sequence of derivatives, right?
Dan
Derivatives are not inputs, they are matches & misses
produced by comparing inputs. They replace original
inputs to achieve lossless compression. I described
this process in part 3, but it's messy & out of date.
I am rewriting it, will let you know when it's done.
Dan He said:
Ok, so please let me know when you have it updated. Also if you could
provide a concrete runningexample, that would be very helpful. The current
description contains lots of definitions and it's hard to map them into
some math models.
Dan
Hi Dan, sorry it took me so long.
I was working on something else, & also,
it’s hard to explain a workinprogress.
Anyway, I just posted a bunch of updates,
especially in part 3, take a look.
Obviously, I am not done yet.
So, I can’t select  exclude variables depending on
their cooccurrence, as in apriority algorithm,
because then syntactic order will be lost.
The syntax is fixed for each search & differentiation
level, & variable types can be deleted / excluded only
by a level of syntax: a group of variables with a
common root (variable from which they are derived).
Such as dP: i: last input, L: samedsign length,
D: summed ds, M: summed ms, Q(d): discrete ds.
See part 3.
So, match is combined within a group only.
If it is below average for a corresponding root
variable, the group is deleted, & derivation flag
at the root variable is set to = 0.
> The current description contains lots of definitions
> and it's hard to map them into some math models.
It’s hard because they don’t fit, at least those I know
(not very many). That’s why I start from the scratch.
Terminology & methods must fit the problem,
not the other way around.
Anyway, it’s good that you pushed me to get into
details, here is $200 as a token of my appreciation :).
Dan He said:
Thanks
Also if I fully understand your method, I believe there should be more I can contribute.
Dan
Right, I hope you will understand it better, let me know if any questions.
Regarding the third paragraph  in other words our mind understands the experiences in our life by remembering patterns and later associating them with different patterns, right? I mean, this is what I have been theorizing myself and you have reached conclusions very similar to my own.
For example let's take an apple. When you read that word, you suddenly remember the smell, look, taste and feel of this entity  patterns from different sensory inputs.
Yes, but the key is to formally define patterns, and the algorithm to discover them. Do you have any specific ideas or comments on my algorithm? Thanks.
Hi, Boris, a few comments from a partial reading:
1. Noise IS a pattern and IS predictable  that's why it's boring
2. Higher levels are not always smarter
3. Lower levels of cognition are not unconscius
4. Causality, causes and correlations
...
1. Noise IS a pattern and IS predictable  that's why it's boring
Boris: "Novelty can’t be a primary criterion: it would select for noise and filter out all patterns, which are defined by match."
Todor:
I think that's a common, partly superficial, cliche. Noise, if having a stable or predictable distribution, is also a pattern, actually a very simple one, and it is recognized as some kind of distribution or remembered/classified otherwise (with a generative capabilities), depending on the system.
Noise's representations are "boring" (for watching, like TV static) not because they are too "unpredictable", but the opposite. Respectively it is not true that noise is "uncompressible".
It is compressible  to a noise distribution which is precise enough, since or if the cognitive system cannot remember the specific details, thus it wouldn't care or notice in real time if the specific low level values are slightly altered to the original.
If one tries to compress noise as specific values exactly, then she treats that input not as noise.
So in a series of frames with TV static only the specific values of the lowest level representation are "unpredictable", but the distribution is "the same", given the resolution of perception  in abstract sense of resolution,  that goes also for other details such as the same screen/dynamic area (physical coordinates), the same average brightness  applied in many levels/resolution of aggregation (and matching between each other) etc.. Also noise is locally selfpredictive, selfmatching after General Processing** or adjacent comparison, which makes it boring too.
Noise matches the predictions too match and watching the "static" on a TV is somewhat similar to watching the wall.
That goes for audio noise as well, it immediately starts to sound like a boring periodic pattern and different spectra of noise are recognized  white, pink, low frequency, middle frequency, high frequency* and if one is given synthesizing tools to play with, she can recreate the noise that she has heard, i.e. it is a pattern.***
* These are "generalized", higher level labels, but they are just pointers to the representations. People with good musical ear or with "absolutist" audio sense could remember and recreate or even tell the frequencies, just like they could replay music by ear. See 3. below.
** ...
*** Another theory is that the apparent local unpredictability overdrives brain processing, the system suffers cognitive overload due the unsuccessful attempts to compress, but I think that's more likely true only for the "garbage input", confusing ones, ones with too many recognizable details, like pictures of dumping grounds, general "disorder"; that is, something IS recognized, some part are "meaningful", but there's an overload in making sense of the whole picture. In the case of noise, the properties are highly local and homogeneous.
2. Higher levels are not always smarter
Boris: "... Intelligent variation must be driven by feedback within cognitive hierarchy: higher levels are presumably “smarter” than lower ones. That is, higherlevel inputs represent operations that formed them, and are evaluated to alter future lowerlevel operations. ..."
Todor:
I don't think that's always true (only "presumably" and sometimes), the "cleverness" of the higher level depends on the specific environment and the history of the experience and the biases, for how long the higher level was relatively "stable", how "stable", how "stability" is measured, how stable are/were the lower levels etc. In a more dynamic environment the lower levels might be "smarter", because they would adapt and react first and higher level plans might go completely wrong, with (correct) prediction value = 0, like with the branch prediction in the CPUs or the cache misses.
There's also something that Kant and Schopenhauer call "facculty of judgment", which is the glue between different levels of generalization and the capability to recognize and classify correctly the lowest sensory input to correct respective higher level representations.
The higher level is given higher *power* in longer term, because it may push the agent to this or that path for a bigger spatiotemporal radius, however whether this is more "predictive" in the long run depends on the whole Universe of things. "Smart" leaders often send their followers to hell and apparently wise maxims have kept people in darkness, and the world is occupied with other agents of which there are limitations for predictability. Their intentions are both partially hidden/unattainable or their goals and behavior is flexible and may change in reaction to your predictions/actions.
Brain is not strictly cognitive and the prefrontal cortex integrates too many inputs, but the world is also not just cognitive, and the topdown tyranny in general is prone to and causes corruption, conservatism, political tyranny, pseudodemocracy.
3. Lower levels of cognition are not unconscius
Boris: "All our knowledge is ultimately derived from senses, but lower levels of perception are unconscious. Only generalized concepts make it into human consciousness, AKA declarative memory, where we assign them symbols (words) to facilitate communication."
Todor:
The artists of all arts and the athletes, especially in technical sports are supposed to operate masterfully with "lowest level" stuff, that is simple actions or perceptions. Making a concept verbal might be just labeling, a quick addressing. A talented artist knows and opera