This intro is out of date, please see Readme on GitHub
I am designing this algorithm for strictly bottom-up hierarchical clustering, from pixels to eternity, draft code: agg_recursion. The process is derived from the definition of intelligence as the ability to predict from prior or proximate input. Practically, that means forming and tracing causal links in segmented graphs (connectivity-based clusters or patterns), same as much-ballyhooed reasoning and planning. Any prediction is interactive projection of known patterns, hence the primary process must be pattern discovery. AKA unsupervised learning: an obfuscating negation-first term. This perspective is not novel, pattern recognition is a core of any IQ test, and main focus of ML.
But in statistical ML, generalization / pattern discovery is a side-effect of randomization, not an explicit objective. Such methods are not a rational design, they rely on brute-force fitting. Especially Neural Nets, neuromorphic or artificial:
- Hebbian Learning is a local weight adjustment by output / input coincidence: a binary version of direct similarity between input and normalized sum of weighted inputs. So each neuron becomes a fuzzy centroid-based cluster of its inputs.
- In Deep Learning, the weights are adjusted by backprop of decomposed error: inverse similarity to the target layer. Basic ANN: multi-layer perceptron or KAN performs lossy stochastic chain-rule curve fitting. The logic is basically prototype-based clustering, clusters being the features represented by hidden-layer nodes. But it's non-linear and fuzzy: fully connected in MLP, with multi-layer summation / distribution in backprop. All this fitting is a vestige of supervised learning, that's why they call it "self-supervised", it doesn't belong in purely unsupervised learning.
Modern ANNs combine such vertical training with lateral cross-correlation, within an input vector. CNN filters are designed to converge on edge-detection in initial layers. Edge detection means computing lateral gradient, by weighted pixel cross-comparison within kernels. Graph NNs embed lateral edges, representing similarity or/and difference between nodes, also produced by their cross-comparison. Popular transformers can be seen as a variation of Graph NN. Their first step is self-attention: computing dot product between KV vectors within context window of an input. This is a form of cross-comparison because dot product serves as a measure of similarity, although an unprincipled one.
So basic operation in both trained CNN and self-attention is what I call cross-comparison, but the former selects for variance and the latter for similarity. I think the difference is due to relative rarity of each in respective target data: mostly low gradients in raw images and sparse similarities in compressed text. This rarity or surprise determines information content of the input. But almost all text ultimately describes generalized images and objects therein, so there should be a gradual transition between the two. In my scheme higher-level cross-comparison computes both variance and similarity, for differential correlation clustering.
GNN, transformers, and Hinton's Capsule Networks all use positional embeddings, as I use explicit coordinates. But they are still trained through destructive backprop: randomized summation first, meaningful output-to-template comparison last. This primary summation degrades resolution of the whole learning process, exponentially with the number of layers. Hence, a ridiculous number of backprop cycles is needed to fit hidden layers into generalized representations of the input. Most practitioners agree that this process is not very smart, that noise-worship alone is the definition of stupidity. I think it's just a low-hanging fruit for terminally lazy evolution, and slightly more disciplined human coding. And it's easy to parallelize, which is crucial for glacially slow cell-based biology.
I propose a reversed sequence: cross-comparison of original inputs, followed by summing into match-defined clusters. That's a lateral proximity-constrained connectivity-based clustering, vs. stochastic vertical fitting in NN. This cross-comp and clustering should be recursively hierarchical, forming patterns of patterns, and so on. Initial connectivity is in space-time, but feedback should re-order input along all sufficiently predictive derived dimensions (eigenvectors). This is similar to spectral clustering, but the actual clustering is still connectivity-based, within a new frame of reference. Feedback will only adjust hyperparameters to filter future inputs: no top-down training, just bottom-up learning.
Connectivity likely represents local interactions, which may form both similarity clusters and their high-variance boundaries. Such boundaries reflect stability (resilience to external impact) of the core similarity cluster. The most basic example is image contours, which are initially more informative than the flat areas they demarcate. Cross-similarity is not likely to continue immediately beyond such contours, they represent separability of the core cluster. The next cross-comp should be between new higher-composition and higher-derivation cluster + contour representations. This "complemented" clustering is inherently discontinuous and much more complex, due to the layer of new derivatives added in cross-comparison. So we need to alternate generative and compressive phases in each level:
- cross-comp of incremental range and derivation: a generative stage, followed by two compressive clustering stages:
- exemplar selection and centroid clustering, reducing multiple similar patterns to a single template / type,
- connectivity clustering of compared nodes and correlation clustering of resulting links,
- feedback to scale filters and input coordinates, potentially reframing to consolidate connections and projecting to simulate.
This is vaguely similar to alternating self-attention and feed-forward in transformer layers, but with explicit clustering and pruning, both strictly laminar and vastly different in detail.
Connectivity clustering is among the oldest methods in ML, but I think my scheme should be uniquely scalable in complexity of discoverable patterns:
- Links are valued by both similarity and variance between the nodes.
- Similarity is defined as compression (vs. dot product), direct or inverse. Variance has intrinsically negative value, borrowed from co-projected similarity.
- This projected similarity is a general cognitive objective / fitness function, which can be used to optimize the process automatically.
- Comparison derivatives parameterize clusters for higher-order cross-comp, incremental in range, derivation, and composition (of both nodes and their param sets).
The process is self-contained, there is no preprocessing, and all operations are derived from the above principles. This conceptual integrity provides the confidence to design an indefinitely nested param set per node, to be compared on a higher composition level. Such compressive encoding should be far more meaningful and interpretable than huge flat weight matrices in ANNs. But it’s very complex to consistently design and parallelize, precluding immediate trial and error that dominates ML. Which is probably why I can't find anything that's close enough.
Below I describe it in more detail, then extend comparisons to ANN and BNN. This is an open project: CogAlg, we need help with design and implementation, in Python. I offer awards for contributions, or monthly payment if there is a track record, see the last part here.
This content is
published under Creative Commons Attribution 4.0 International License.
Outline of my
approach
Initial clustering levels, positional resolution (macro)
lags value resolution (micro) by one quantization order: 
| Inputs | Comparison  
 |  Positional resolution   | Outputs | Conventionally known as | 
| unary intensity | AND | none, all in the same
  coordinates | pixels of intensity | digitization | 
| integer pixels | SUB | binary: direction of comparison | blobs of gradient | edge detection, flood fill  | 
| float: averaged params of blobs | DIV:  comp ave params | integer: distance between blob
  centers | graphs of blobs  | connectivity-based clustering | 
| complex: norm. params of graphs | LOG:  params hierarchy | float: distance between graph
  centers | hierarchical graphs | agglomerative clustering | 
And so on, higher levels should be added recursively. Such
process is very complex and deeply structured, there is no way it could evolve
naturally. Since the code is supposed to be recursive, testing before it is complete
is useless. Which is probably why no one seems to work on such methods. But once
the design is done, there is no need for interminable glacial and opaque training,
my feedback only adjusts hyperparameters.
So, a pattern is
a cluster of matching input elements, where match is compression achieved by
encoding the input as derivatives, see “Comparison” section below. Some define
pattern as a recurring item or group, in my terms these are pattern elements.
If the items co-vary: don't match but their derivatives do, then they form
higher-derivation pattern, where the elements are derivatives. 
But
lower-derivation and shorter-range cross-comp must be done first, starting with
consecutive atomic inputs. That means sensory input at the limit of resolution:
adjacent pixels of video or equivalents in other modalities. All primary
modalities form dense array of such inputs in Cartesian dimensions, symbolic
data is subsequent encoding. To discover meaningful patterns, the symbols must
be decoded, which is exponentially more difficult with the level of encoding. Thus
a start with raw sensory input is by far the easiest to implement (part 0).
This low-level process, directly translated into my code,
seems like quite a jump from the generalities above. But it really isn’t, internally
consistent pattern discovery must be strictly bottom-up, in complexity of both
inputs and operations. And there is no ambiguity at the bottom: initial predictive
value that defines patterns is a match from cross-comparison among their
elements, starting with pixels. So, I think my process is uniquely consistent
with high-level definitions, please let me know if you see any discrepancy in
either.
Comparison, more in part 1:
Basic comparison is
inverse arithmetic operation between single-variable comparands, of incremental
power: Boolean, subtraction, division, etc. Each order of comparison forms miss or variance:
XOR, difference, ratio, etc., and match or similarity, which can be defined directly or as inverse
deviation of miss. Direct match is compression of represented magnitude by
replacing larger input with corresponding miss between the inputs: Boolean AND,
the smaller input in
comp by subtraction, integer part of ratio in comp by division, etc.
These direct similarity measures work if input intensity
corresponds to some measure of stability of an object: mass, energy, hardness. This
is the case in tactile but not in visual input: brightness doesn’t correlate
with inertia or invariance, dark objects are just as stable as bright ones.
Thus, initial match in vision should be defined indirectly, as inverse
deviation of variation in intensity. 1D variation is difference, ratio, etc.,
while multi-D comparison has to combine them into Euclidean distance and
gradient, as in common edge detectors.
Patterns, more in part 2:
Cross-comparison among patterns forms match and miss per
parameter, as well as dimensions and distances: external match and miss (these
are separate parameters: value = precision of what * precision of where).
Comparison is limited by max. distance between patterns. Overall hierarchy has
incremental dimensionality: search levels ( param levels ( pattern levels)).., and
pattern comparison is selectively incremental per such level. This is hard to
explain in NL, please see the code, starting with line_Ps and line_PPs.
Resulting matches and misses are summed into lateral match
and miss per pattern. Proximate input patterns with above-average match to
their nearest neighbors are clustered into higher-level patterns. This adds two
pattern levels: of composition and derivation, per level of search. Conditional
cross-comp over incremental range and derivation, among the same inputs, may
also add sub-levels in selected newly formed patterns. On a pixel level,
incremental range is using larger kernels, and incremental derivation starts
with using Laplacian. 
Feedback, more in part 3 (needs editing):
Average match is the first order of value filter,
computed on higher levels. There are also positional filters, starting
with pixel size and kernel size, which determine external dimensions of the
input. Quantization (bit, integer, float..) of internal and external filters
corresponds to the order of comparison, The filters are similar to hyperparameters
in Neural Nets, with values updated by feedback. But I have no equivalent of
weight matrix: my learning is connectivity clustering, vs. vertical clustering
via backprop or Hebbian learning.
All filter types represent co-averages to a higher-level
average value, locally projected by higher-level patterns. Clustering on a
filtered level is by the sign of deviation from those filters (cross-input-element-match
- filter), so using averages balances positive and negative patterns: spans of above-
and below- average cross-match in future inputs. Resulting positive patterns contain
input elements that are both novel: exceeding expectations of higher levels,
and similar to each other: making them predictive of future input.   
Hierarchy, part 4 but out of date:
There is a single global hierarchy: feedforward inputs
and feedback filters pass through the same levels of search and composition. Each
higher level is a nested hierarchy, with depth proportional to elevation, but
sub-hierarchies are unfolded sequentially. That’s why I don’t have many
diagrams: they are good at showing relations in 2D, but I have a simple 1D sequence
of levels. Nested sub-hierarchies are generated by the process itself,
depending on elevation in a higher-order hierarchy. That means I can’t show
them in a generic diagram.  
Brain-inspired schemes have separate sensory and motor
hierarchies, in mine they combined into one. The equivalent of motor patterns in
my scheme are positional filter patterns, which ultimately move the sensor. The
first level is co-located sensors: targets of input filters, and more coarse
actuators: targets of positional filters. I can think of two reasons they are
separated in the brain: neurons and axons are unidirectional, and training
process has to take the whole hierarchy off-line. Neither constraint applies to
my scheme.
Final algorithm will consist of first-level operations +
recursive increment in operations per level. The latter is a meta-algorithm
that extends working level-algorithm, to handle derivatives added to current inputs.
So, the levels are: 1st level: G(x), 2nd level: F(G)(x), 3rd level: F(F(G))(x)..,
where F() is the recursive code increment.
Resulting hierarchy is a pipeline: patterns are outputted
to the next level, forming a new level if there is none. Given novel inputs, higher
levels will discover longer-range spatio-temporal and then conceptual patterns.
Some notes:
- There should be a unique set of operations added per
level, hence a singular in “cognitive algorithm”.
- Core design must be done theoretically: generality
requires large upfront investment in process complexity, which makes it a huge
overkill for any specific task. That’s one reason why such schemes are not
explored.
- Many readers note disconnect between abstractions in
this outline, and the amount of detail in current code. That’s because we are
in space-time continuum: search must follow proximity in each dimension, which
requires specific processing. It’s not specific to vision, the process is
mostly the same for all raw modalities. 
- Another complaint is that I don't use mathematical
notation, but it simply doesn't have the flexibility to express deeply
conditional process, with recursively increasing complexity.
- Most people who aspire to work on AGI think in terms
behavior and robotics. I think this is far too coarse to make progress, the most
significant mechanisms are on the level of perception. Feedforward (perception)
must drive feedback (action), not the other way around.
- Other distractions are supervision and reinforcement.
These are optional task-specific add-ons, core cognitive process is
unsupervised pattern discovery, and main problem here is scaling in complexity.
- Don’t even start me on chatbots.  
Comparison to artificial and
biological neural networks
All unsupervised learning is some form of pattern
discovery, by input comparison and clustering. I do both laterally: among
inputs within a level, while in statistical learning they are vertical: between
layers of weighted summation. Weight adjustment from error in final comparison is
a soft clustering: modulated inclusion or exclusion of subsequent
inputs into next output. So, vertical weighted summation is primary to
comparison, which makes the comparands distant. This is a conceptual flaw: comparison
must follow proximity.
Neural Nets is a version of statistical learning, I think
it is best understood as centroid clustering (centroid
doesn’t have to be a single value, fitted line in linear regression can be
considered a one-dimensional centroid). Basic ANN is a multi-layer perceptron: each node weighs the inputs at synapses,
then sums and thresholds them into output. This normalized sum of inputs is
their centroid. Output of the top layer is compared to some template, forming
an error. Stochastic Gradient Descent then backpropagates the error, training
initially random weights into transformations (reversed vertical derivatives) that
reduce future error.
That usually means training CNN to perform some sort of
edge-detection or cross-correlation (same as my comparison but the former terms
lose meaning on higher levels of search). But CNN operations are initially
random, while my process is designed for cross-comp from the start. This is why
it can be refined by my feedback, updating the filters, which far more subtle
and selective than training by backprop. 
So, I have several problems with basic process in ANN:
- Vertical learning (via feedback of error) takes tens of
thousands of cycles to form accurate representations. That's because summation
per layer degrades positional input resolution. With each added layer, the
output that ultimately drives learning contains exponentially smaller fraction
of original information. Cross-comp and clustering is far more complex per
level, but the output contains all information of the input. Lossy selection is
only done on the next level, after evaluation per pattern (vs. before
evaluation in statistical methods). 
- Both initial weights and sampling that feeds SGD are
randomized. Also driven by random variation are RBMs, GANs, VAEs, etc. But randomization
is antithetical to intelligence, it's only useful in statistical methods
because they merge inputs with weights irreversibly. Thus, any non-random
initialization and variation will introduce bias. All input modification in my
scheme is via hyper-parameters, stored separately and then used to normalize
(remove bias) inputs for comparison to inputs formed with different-value hyper-parameters.
- SGD minimizes error (top-layer miss), which is
quantitatively different from maximizing match: compression. And that error is
w.r.t. some specific template, while my match is summed over all past input /
experience. The “error” here is plural: lateral misses (differences, ratios,
etc.), computed by cross-comparison within a level. All inputs represent
environment and have positive value. But then they are packed (compressed) into
patterns, which have different range and precision, thus different relative
value per relatively fixed record cost.
- Representation in ANN is fully distributed, similar to
the brain. But the brain has no alternative: there is no substrate for local
memory or program in neurons. Computers have RAM, so parallelization is a
simple speed vs. efficiency trade-off, useful only for complex semantically
isolated nodes. Such nodes are patterns, encapsulating a set of co-derived
“what” and “where” parameters. This is similar to neural ensemble, but
parameters that are compared together should be localized in memory, not
distributed across a network.
More basic neural learning mechanism is Hebbian, though
it is rarely used in ML. Conventional spiking version is that weight is
increased if the synapse often receives a spike just before the node fires,
else the weight is decreased. But input and output don't have to be binary, the
same logic can be applied to scalar values: the weight is increased / decreased
in proportion to some measure of similarity between its input and following
output of the node. That output is normalized sum of all inputs, or their
centroid.
Such learning is local, within each node. But it’s still a
product of vertical comparison: centroid is higher order of composition than
individual inputs. This comparison across composition drives all statistical
learning, but it destroys positional information at each layer. Compared to
autoencoders: main backprop-driven unsupervised learning technique, Hebbian learning
lacks the decoding stage (as does the proposed algorithm). Decoding decomposes hidden
layers, to equalize composition orders of output and compared template.
Inspiration by the brain kept ANN research going for
decades before they became useful. Their “neurons” are mere stick figures, but
that’s not a problem, most of neuron’s complexity is due to constraints of
biology. The problem is that core mechanism in ANN, weighted summation, may
also be a no-longer needed compensation for such constraints: neural memory
requires dedicated connections. That makes representation and cross-comparison
of individual inputs very expensive, so they are summed. But again, we now have
dirt-cheap RAM.
Other biological constraints are very slow neurons, and the
imperative of fast reaction for survival in the wild. Both favor fast though
crude summation, at the cost of glacial training. Reaction speed became less
important: modern society is quite secure, while continuous learning is far
more important because of accelerating progress. Summation also reduces noise,
which is very important for neurons that often fire at random, to initiate and
maintain latent connections. But that’s irrelevant for electronic circuits.
Evolution is extremely limited in complexity that can be
added before it is pruned by natural selection, I see no way it could produce
proposed algorithm. And that selection is for reproduction, while intelligence is distantly instrumental. The brain evolved to guide the
body, with neurons originating as instinctive stimulus-to-response converters.
Hence, both SGD and Hebbian learning is fitting, driven by feedback of action-triggering
weighted input sum. Pattern discovery is their instrumental upshot, not an
original purpose.
Uri Hasson, Samuel Nastase, Ariel Goldstein reach a
similar conclusion in “Direct fit to nature: an evolutionary perspective on
biological and artificial neural networks”: “We argue that neural computation is grounded in
brute-force direct fitting, which relies on over-parameterized optimization
algorithms to increase predictive power (generalization) without explicitly
modeling the underlying generative structure of the world. Although
ANNs are indeed highly simplified models of BNNs, they belong to the same
family of over-parameterized, direct-fit models, producing solutions that are
mistakenly interpreted in terms of elegant design principles but in fact
reflect the interdigitation of ‘‘mindless’’ optimization processes and the
structure of the world.”
Comparison to Capsule Networks
The nearest
experimentally successful method is recently introduced “capsules”. Some similarities to CogAlg:
- capsules also
output multivariate vectors, “encapsulating” several parameters, similar to my
patterns,
- these parameters
also include pose: coordinates and dimensions, compared to compute transformations,
- these transformations
are compared to find affine transformations or equivariance: my match of
misses,
- capsules also send
direct feedback to lower layer: dynamic routing, vs. trans-hidden-layer
backprop in ANN.
My main problems
with CapsNet and alternative treatment:
-        
Object
is defined as a recurring configuration of different parts. But such recurrence
can’t be assumed, it should be derived by cross-comparing relative position
among parts of matching objects. This can only be done after their positions
are cross-compared, which is after their objects are cross-compared: two levels
above the level that forms initial objects. So, objects formed by positional
equivariance would be secondary, though they may displace initial segmentation
objects as a primary representation. Stacked
Capsule Autoencoders also have exclusive
segmentation on the first layer, but proximity doesn’t matter on their higher
layers.
-        
Routing
by agreement is basically recursive centroid clustering, by match of input
vector to the output vector. The output (centroid) represents inputs at all
locations, so its comparison to inputs is effectively mixed-distance. Thus,
clustering in CapsNet is fuzzy and discontinuous, forming redundant
representations. Routing by agreement reduces that redundancy, but not
consistently so, it doesn’t specifically account for it. My default clustering
is exclusive segmentation: each element (child) initially belongs to one
cluster (parent). Fuzzy clustering is selective to inputs valued above the cost
of adjusting for overlap in representation, which increases with the range of
cross-comparison. This conditional range increase is done on all levels of
composition.
-        
Instantiation
parameters are application-specific, CapsNet has no general mechanism to derive
them. My general mechanism is cross-comparison of input capsule parameters,
which forms higher-order parameters. First level forms pixel-level gradient,
similar to edge detection in CNN. But then it forms proximity-constrained
clusters, defined by gradient and parameterized by summed pixel intensity, dy,
dx, gradient, angle. This cross-comparison followed by clustering is done on
all levels, with incremental number of parameters per input.
-        
Number
of layers is fixed, while I think it should be incremental with experience. My
hierarchy is a dynamic pipeline: patterns are displaced from a level by
criterion sign change and sent to existing or new higher level. So, both
hierarchy of patterns per system and sub-hierarchy of derivatives per pattern
expand with experience. The derivatives are summed within a pattern, then
evaluated for extending intra-pattern search and feedback.
-        
Output
vector of higher capsules combines parameters of all lower layers into
Euclidean distance. That is my default too, but they should also be kept
separate, for potential cross-comp among layer-wide representations.
Overall, CapsNet is
a variation of ANN, with input summation first and dynamic routing second. So,
it’s a type of Hebbian learning, with most of the problems that I listed in the
previous section.
Elaboration, parts 4 and below are out of
date:
0. Cognition vs. evolution, analog
vs. symbolic initial input
Some say intelligence can
be recognized but not defined. I think that’s absurd: we recognize some
implicit definition. Others define intelligence as a problem-solving ability,
but the only general problem is efficient search for solutions. Efficiency is a
function of selection among inputs, vs. brute-force all-to-all search. This
selection is by predicted value of the inputs, and prediction is interactive
projection of their patterns. Some agree that intelligence is all about pattern
discovery, but define pattern as a crude statistical coincidence.
Of course, the only
mechanism known to produce human-level intelligence is even cruder, and that
shows in haphazard construction of our brains. Algorithmically simple,
biological evolution alters heritable traits at random and selects those with
above-average reproductive fitness. But this process requires almost
inconceivable computing power because selection is extremely coarse: on the
level of whole genome rather than individual traits, and also because
intelligence is only one of many factors in reproductive fitness.
Random variation in
evolutionary algorithms, generative RBMs, and so on, is antithetical to intelligence. Intelligent
variation must be driven by feedback within cognitive hierarchy: higher levels
are presumably “smarter” than lower ones. That is, higher-level inputs
represent operations that formed them, and are evaluated to alter future
lower-level operations. Basic operations are comparison and summation among
inputs, defined by their range and resolution, analogous to reproduction in
genetic algorithms.
Range of comparison per
conserved-resolution input should increase if projected match (cognitive
fitness function) exceeds average match per comparison. In any non-random
environment, average match declines with the distance between comparands. Thus,
search over increasing distance requires selection of above- average
comparands. Any delay, coarseness, and inaccuracy of such selection is
multiplied at each search expansion, soon resulting in combinatorial explosion
of unproductive (low additive match) comparisons.
Hence, my model is strictly
incremental: search starts with minimal-complexity inputs and expands with
minimal increments in their range and complexity (syntax). At each level, there
is only one best increment, projected to discover the greatest additive match.
No other AGI approach follows this principle.
I guess people who aim for
human-level intelligence are impatient with small increments and simple sensory
data. Yet, this is the most theoretical problem ever, demanding the longest
delay in gratification.
symbolic obsession and its
discontents
Current Machine Learning and related theories (AIT, Bayesian inference, etc.) are largely
statistical also because they were developed primarily for symbolic data. Such
data, pre-compressed and pre-selected by humans, is far more valuable than
sensory inputs it was ultimately derived from. But due to this selection and compression,
proximate symbols are not likely to match, and partial match between them is very
hard to quantify. Hence, symbolic data is a misleading initial target for
developing conceptually consistent algorithm.
Use of symbolic data as
initial inputs in AGI projects betrays profound misunderstanding of cognition.
Even children, predisposed to learn language, only become fluent after years of
directly observing things their parents talk about. Words are mere labels for
concepts, the most important of which are spatio-temporal patterns, generalized
from multi-modal sensory experience. Top-down reconstruction of such patterns
solely from correlations among their labels should be exponentially more
difficult than their bottom-up construction.
All our knowledge is
ultimately derived from senses, but lower levels of human perception are
unconscious. Only generalized concepts make it into our consciousness, AKA declarative memory, where we assign them
symbols (words) to facilitate communication. This brain-specific constraint
creates heavy symbolic vs. sub-symbolic bias, especially strong in artificial
intelligentsia. Which is putting a cart in front of a horse: most words are meaningless
unless coupled with implicit representations of sensory patterns.
To be incrementally
selective, cognitive algorithm must exploit proximity first, which is only
productive for continuous  and loss-tolerant raw sensory data. Symbolic
data is already compressed: consecutive characters and words in text won’t
match. It’s also encoded with distant cross-references, that are hardly ever
explicit outside of a brain. Text looks quite random unless you know the code:
operations that generalized pixels into patterns (objects, processes,
concepts). That means any algorithm designed specifically for text will not be
consistently incremental in the range of search, which will impair its
scalability.
In Machine Learning, input
is string, frame, or video sequence of a defined length, with artificial
separation between training and inference. In my approach, learning is
continuous and interactive. Initial inputs are streamed pixels of maximal
resolution, and higher-level inputs are multi-variate patterns formed by
comparing lower-level inputs. Spatio-temporal range of inputs, and selective
search across them, is extended indefinitely. This expansion is directed by
higher-level feedback, just as it is in human learning.
Everything ever written is
related to my subject, but nothing is close enough: not other method is meant
to be fully consistent. Hence a dire scarcity of references here. My approach
is self-contained, it doesn’t require references. But it does require clean
context, hopefully cleaned-up by reader‘s introspective generalization. 
1. Atomic comparison: quantifying
match and miss between two variables
First, we need to quantify
predictive value. Algorithmic information theory defines it as
compressibility of representation, which is perfectly fine. But compression is
currently computed only for sequences of inputs, while I think a logical start
is analog input digitization: a rock bottom of organic compression hierarchy.
The next level is cross-comparison among resulting pixels, commonly known as
edge detection, and higher levels will cross-compare resulting patterns. Partial
match computed by comparison is a measure of compression.
Partial match between two
variables is a complementary of miss, in corresponding power of comparison: 
- Boolean match is AND and miss is XOR (two zero inputs form zero
match and zero miss), 
- comparison by subtraction
increases match to a smaller comparand and reduces miss to a difference,
- comparison by division
increases match to min * integer part of ratio and reduces miss to a fractional
part
(direct match works for
tactile input. but reflected-light in vision requires inverse definition of
initial match)
In other words, match is a
compression of larger comparand’s magnitude by replacing it with miss. Which
means that match = smaller input: a common subset of both inputs, = sum of AND between their uncompressed
(unary code) representations. Ultimate criterion is recorded magnitude, rather
than bits of memory it occupies, because the former represents physical impact
that we want to predict. The volume of memory used to record that magnitude
depends on prior compression, which is not an objective parameter. 
Some may object that match
includes the case when both inputs equal zero, but then match should also be
zero. The purpose here is prediction, which represents conservation of some physical
property of observed objects. Ultimately, we’re predicting potential impact on
observer, represented by input. Zero input means zero impact, which has no
conservable property (inertia), thus no intrinsic predictive value.
Given incremental
complexity, initial inputs should have binary resolution and implicit
coordinate (which is a macro-parameter, so its resolution lags that of an
input). Compression of bit inputs by AND is well known as digitization:
substitution of two lower 1 bits with one higher 1 bit. Resolution of coordinate
(input summation span) is adjusted by feedback to form integers that are large
enough to produce above-average match. 
Next-order compression is
comparison between consecutive integers, with binary (before | after)
coordinate.
Additive match is achieved
by comparison of a higher power than that which produced comparands: AND will
not further compress integers digitized by AND. Rather, initial comparison
between integers is by subtraction, resulting difference is miss, and smaller
input is absolute match. Compression of represented magnitude is by replacing
i1, i2 with their derivatives: match (min) and miss (difference). If we sum
each pair:
inputs: 5 + 7 -> 12, derivatives:
match = 5 + miss = 2 -> 7. Compression by replacing = match: 12 - 7 -> 5.
Difference is smaller than XOR (non-zero complementary of AND) because XOR may
include opposite-sign (opposite-direction) bit pairs 0, 1 and 1, 0, which are
cancelled-out by subtraction.
Comparison by division
forms ratio, which is a compressed difference. This
compression is explicit in long division: match is accumulated over iterative
subtraction of smaller comparand from remaining difference. In other words,
this is also a comparison by subtraction, but between different orders of
derivation. Resulting match is smaller comparand * integer part of ratio, and miss is final reminder or
fractional part of ratio. The ratio can be further
compressed by converting it to radix or logarithm, and so on.
By reducing miss, higher-power
comparison increases complementary match (match = larger input - miss):
to be compressed:   larger
input   |   XOR                             |   difference: combined current-order match &
miss
additive match:
      AND              
 |  
opposite-sign XOR   |   multiple: of a smaller input within a
difference
remaining miss:      XOR
               |  
difference                    | 
 fraction: complementary to multiple
within a ratio
But the costs of operations
and incidental sign, fraction, irrational fraction, etc. may grow even faster.
To justify the costs, the power of comparison should only increase in patterns of
above-average match from prior order of comparison: AND for bit inputs, SUB for
integer inputs, DIV for pattern inputs, etc. Inclusion into such patterns is by relative match: match - ave: past
match that co-occurs with average higher-level match.
Match value should be weighted
by the correlation between input intensity and its stability: mass / energy /
hardness of an observed object. Initial
input, such as reflected light, is likely to be incidental: such correlation is
very low. Since match is the magnitude of smaller input, its weight should also
be low if not zero. In this case projected match consists mainly of its inverse
component: match cancellation by co-derived miss, see below. 
The
above discussion is on match from current comparison, but we really want to
know projected match to future or distant inputs. That means the value of match
needs to be projected by co-derived miss. In comparison by subtraction,
projected match = min (i1, i2) * weight (fractional) - difference (i1, i2) / 2 (divide
by 2 because the difference only reduces projected input, thus min( input,
projected input), in the direction in which it is negative. It doesn’t affect
min in the direction where projected input is increasing). 
quantifying lossy compression
There is a general
agreement that compression is a measure of similarity, but no one seems to apply
it from the bottom up, the bottom being single scalars. Also, any significant
compression must be lossy. This is currently evaluated by perceived similarity of
reconstructed input to the original input, as well as compression rate. Which is
very coarse and subjective. Compression in my level of search is lossless, represented
by match on all levels of pattern. All derived representations are redundant, so
it’s really an expansion vs. compression overall.  
The lossy part comes after
evaluation of resulting patterns on the next level of search. Top level of
patterns is cross-compared by default, evaluation is per lower level: of incremental
derivation and detail in each pattern. Loss is when low-relative-match buffered
inputs or alternative derivatives are not cross-compared. Such loss is
quantified as the scope * resolution of representation in these lower levels,
not some subjective quality. 
2. Forward search and patterns,
implementation for image recognition in video
Pattern is a contiguous
span of inputs that form above-average matches, similar to conventional cluster.
As explained above, matches
and misses (derivatives) are produced by comparing consecutive inputs. These
derivatives are summed within a pattern and then compared between patterns on
the next level of search, adding new derivatives to a higher pattern. Patterns
are defined contiguously on each level, but positive and negative patterns are
always interlaced, thus next-level same-sign comparison is discontinuous.
Negative patterns represent
contrast or discontinuity between positive patterns, which is a one- or higher-
dimensional equivalent of difference between zero-dimensional pixels. As with
differences, projection of a negative pattern competes with projection of
adjacent positive pattern. But match and difference are derived from the same
input pair, while positive and negative patterns represent separate spans of
inputs.
Negative match patterns are
not predictive on its own but are valuable for allocation: computational
resources of no-longer predictive pattern should be used elsewhere. Hence, the
value of negative pattern is borrowed from predictive value of co-projected
positive pattern, as long as combined additive match remains above average. Consecutive
positive and negative patterns project over same future input span, and these
projections partly cancel each other. So, they should be combined to form
feedback, as explained in part 3.
Initial match is evaluated
for inclusion into higher positive or negative pattern. The value is summed
until its sign changes, and if positive, evaluated again for cross-comparison
among constituent inputs over increased distance. Second evaluation is
necessary because the cost of incremental syntax generated by cross-comparing
is per pattern rather than per input. Pattern is terminated and outputted to
the next level when value sign changes. On the next level, it is compared to
previous patterns of the same compositional order.
Initial inputs are pixels
of video, or equivalent limit of positional resolution in other modalities.
Hierarchical search on higher levels should discover patterns representing
empirical objects and processes, and then relational logical and mathematical
shortcuts, eventually exceeding generality of our semantic concepts. 
In cognitive terms,
everything we know is a pattern, the rest of input is noise, filtered out by
perception. For online learning, all levels should receive inputs from lower
levels and feedback from higher levels in parallel.
space-time dimensionality
and initial implementation
Any prediction has two
components: what and where. We must have both: value of prediction = precision
of what * precision of where. That “where” is currently neglected: statistical
ML represents space-time at greatly reduced resolution, if at all. In the brain
and some neuromorphic models, “where” is represented in a separate
network. That makes transfer of positional information very expensive and
coarse, reducing predictive value of representations. There is no such separation
in my patterns, they represent both what and where as local vars.
My core algorithm is 1D:
time only (part 4). Our space-time is 4D, but each of  these dimensions
can be mapped on one level of search. This way, levels can select input
patterns that are strong enough to justify the cost of representing additional
dimension, as well as derivatives (matches and differences) in that dimension.
Initial 4D cycle of search
would compare contiguous inputs, similarly to connected-component analysis:
level 1 compares
consecutive 0D pixels within horizontal scan line, forming 1D patterns: line
segments.
level 2 compares contiguous
1D patterns between consecutive lines in a frame, forming 2D patterns: blobs.
level 3 compares contiguous
2D patterns between incremental-depth frames, forming 3D patterns: objects.
level 4 compares contiguous
3D patterns in temporal sequence, forming 4D patterns: processes.
(in simple video, time is added on level 3
and depth is computed from derivatives)  
Subsequent cycles would
compare 4D input patterns over increasing distance in each dimension, forming
longer-range discontinuous patterns. These cycles can be coded as
implementation shortcut, or form by feedback of core algorithm itself, which should
be able to discover maximal dimensionality of inputs. “Dimension” here is
parameter that defines external sequence and distance among inputs. This is
different from conventional clustering, were both external and internal
parameters are dimensions. More in part 6.
However, average match at a given distance in
our space-time is presumably equal over all four dimensions. That means
patterns defined in fewer dimensions will be fundamentally limited and biased
by the angle of scanning. Hence, initial pixel comparison and clustering into
patterns should also be over 4D at once, or at least over 2D for images and 3D
for video. This is our-universe-specific extension of my core algorithm. 
There
is also a vision-specific adaptation in the way I define initial match. Predictive
visual property is albedo, which means locally stable ratio of brightness /
intensity. Since lighting is usually uniform over much larger area than pixel,
the difference in brightness between adjacent pixels should also be stable. Relative
brightness indicates some underlying property, so it should be cross-compared
to form patterns. But it’s reflected: doesn’t really represent physical quantity /
density of an object. Thus, initial match
is inverse deviation of gradient. 
We
are currently coding 1st level algorithm: https://github.com/boris-kz/CogAlg/wiki. 1D code is complete, but not
macro-recursive. We are extending it to 2D for image recognition, then to 3D
video for object and process recognition. Higher levels for each D-cycle algorithm
will process discontinuous search among full-D patterns. Complete hierarchical (meta-level)
algorithm will consist of: 
-
1st level algorithm: contiguous cross-comparison over full-D cycle, plus bit-filter
feedback
-
recurrent increment in complexity, extending current-level alg to next-level
alg. It will unfold increasingly complex higher-level input patterns for
cross-comparison, then combine results for evaluation and feedback.
We
will then add colors, maybe audio and text. Initial testing could be
recognition of labeled images, but 2D is a poor representation of our 4D world,
video or stereo video is far better. Variation across space is a product of
past interactions, thus predictive of variation over time (which is normally
lower: we can’t speed-up time).
3. Feedback filters, attentional
input selection, imagination, motor action
After evaluation for
inclusion into higher-level pattern, the input is also accumulated into
feedback to lower levels. Feedback is update to filters that evaluate forward (Λ) and feedback (V), as described above but on
lower level. Feedback value = absolute value of summed input parameter -
filter-filter (opportunity cost of filter update). Default feedback is combined
level-sequentially, while more expensive shortcut feedback may be sent to
selected levels to filter inputs that are already in the pipeline, or to
rearrange levels in the hierarchy.
There is internal filter
for each compared variable of an input, and external filter per coordinate in
which the inputs are ordered. Basic internal filter is average projected match
that co-occurs with (predicts) average higher-level match, and basic external filter
is a distance to the next input of projected average value. Thus, coordinate
filter is a span of inputs skipped because they are projected to be either too
predictable or too noisy to bother with. External filters have lower resolution
/ higher scope at the same order of quantization. 
Both input and coordinate
filters discussed above are integers, but they can be of any order of
quantization. Binary filters are the least and the most significant bits of
input value and coordinate (input summation span). For coordinate filter, LSB
is pixel size and MSB is frame size. These filters are adjusted to balance
overflow and underflow. Then there are higher-than-integer filters: ratios or
coefficients, AKA weights, and so on. They adjust magnitude per input variable
type, in proportion to relative higher-level match of these variables.
Lower filters are min
values for input inclusion in higher-composition inputs, and upper filters are max
values that trigger higher-input termination: bit -> pixel -> pattern
-> pattern_of_patterns  (code starts
from pixels). 
The number of updateable filters
will increase with elevation:
1st level may
update only: 
- value bit filters:  lower: LSB, upper: word size -> MSB, and
- coord lower bit filter:
LSB, which is a pixel size
2nd level may
also update:
- coord upper bit filters,
such as frame dimensions -> coordinate MSB, causing premature P termination
- value integer filters:
lower: average match, upper: max Match -> premature average match feedback
3rd level may
add:
coord lower integer filter:
starting coordinate (next C or skip-to distance), and
coord upper integer filter:
max next C -> premature next-C feedback
Etc.
novelty vs. generality
 
Any system must have a common
fitness function or selection criterion. Two obvious criteria in cognition are
novelty and generality: miss and match. But we can’t select for both, they
exhaust all possibilities. Novelty can’t be primary criterion: it would select
for noise and filter out all patterns, which are defined by match. On the other
hand, to maximize match of inputs to memory we can stare at a wall: lock into
predictable environments. But of course, natural curiosity actively skips
predictable locations, thus reducing the match.
 
This dilemma is resolved if
we maximize predictive power: projected vs. confirmed match of inputs to
records (all records are predictions, else they are forgotten). To the extent
that new match is predictable, it doesn’t add to total projected match of the
model. But neither does noise: novelty (difference from records) of inputs that
won’t match in the future. So, match is positive in feedforward but negative in
feedback: the sign is reversed with direction. Projected match is the same as
compression, which includes skipping low-value input spans.
  
We can see this in
individual derivatives:
- higher-level match is
specific to past inputs, thus it’s a filter for future inputs, projected from
the past.
- higher-higher level match
of a match is more detached from specific inputs, thus less accurate as a
filter.
On the opposite, it
projects higher match among future inputs, independently from their match to
past inputs.
And so on, higher
derivation orders of match are increasingly positive (less filtering) for
future inputs.
So, selection for novelty
is done by subtracting higher-level projection from corresponding input parameter.
Higher-order positional selection is skipping (or avoiding processing) predictable
future input spans. Skipped input span is formally a *coordinate* filter feedback:
next coordinate of inputs with expected above-average *additive* predictive value.
Thus, next input location is selected first by proximity and then by novelty,
both relative to a template comparand. This is covered in more detail in part
4, level 3.
Vertical evaluation
computes deviations, to form positive or negative higher-level patterns. Evaluation
is relative to higher-level averages, which represent past inputs, thus should
be projected over feedback delay: average += average
difference * (delay / average span) /2. Average per input variable may also be
a feedback, representing redundancy to higher level, which also depends on higher-level
match rate: rM = match / input. 
If rM > average per cost
of processing: additive match = input match - input-to-average match * rM.
Lateral comparison computes
differences, to project corresponding parameters of all derivation orders:
difference in magnitude of
initial inputs: projected next input = last input + difference/2,
difference in input match,
a subset of magnitude: projected next match = last match + match difference/2,
difference in match of
match, a sub-subset of magnitude, projected correspondingly, and so on.
Ultimate criterion is top
order of match on a top level of search: the most predictive parameter in a
system.
 
imagination, planning,
action
Imagination is never truly original,
it can only be formalized as interactive projection of known patterns. As
explained above, patterns send feedback to filter lower-level sources. This feedback
is to future sources, where the patterns are projected to continue or re-occur.
Stronger upstream patterns and correspondingly higher filters reduce resolution
of or totally skip predictable input spans. But when multiple originally
distant patterns are projected into the same location, their feedback cancels
out in proportion to their relative difference.
In other words, combined
filter is cancelled-out to the extent that co-projected patterns are mutually
exclusive:
filter = max_pattern_feedback
- alt_pattern_feedback * match_rate. By default, match_rate used here is
average (match / max_comparand). But it has average error: average abs(match_rate
- average_match_rate). To improve filter accuracy, we can derive actual match
rate by cross-comparing co-projected patterns. I think imagination is just that:
search across co-projected patterns, before accessing their external target
sources. 
Patterns are projected in
space and time, depending on their past S-T span and a vector of input
derivatives over that span. So, pattern input parameters in some future
location can be projected as:
(recorded input parameters)
+ (corresponding derivatives * relative distance) / 2. 
Where relative distance = (projected
coords - current coords) / span of the pattern in the same direction.
Any search is defined by
location: contiguous coordinate span. Span of feedback target is that of
feedback source’ input pattern: narrower than the span of feedback source unit’
output pattern. So, search across co-projected patterns is performed on a
conceptually lower level, but patterns themselves belong to higher level. Meaning
that search will be within intersection
of co-projected patterns, vs. whole patterns. Intersection is a location within
each of the patterns, and cross-comparison will be among pattern elements in
that location.  
Combined filter is then prevaluated:
projected value of positive patterns is compared to the cost of evaluating all
inputs, both within a target location. If prevalue is negative: projected
inputs are not worth evaluating, their location is skipped and “imagination” moves
to the next nearest one. Filter search continues until prevalue turns positive
(with above-average novelty) and the sensor is moved that location. This sensor
movement, along with adjustment of its threshold, is the most basic form of
motor feedback, AKA action.
Cognitive component of
action is planning: a form of imagination where projected patterns include
those that represent the system itself. Feedback of such self-patterns
eventually reaches the bottom of representational hierarchy: sensors and
actuators, adjusting their sensitivity | intensity and coordinates. This adjustment
is action. Such environmental interface is a part of any cognitive system,
although actuators are optional.
  
4. Initial levels of search and
corresponding orders of feedback (fine to skip)
  
This part recapitulates and
expands on my core algorithm, which operates in one dimension: time only.
Spatial and derived dimensions are covered in part 6. Even within 1D, the search
is hierarchical in scope, containing any number of levels. New level is added
when current top level terminates and outputs the pattern it formed.
Higher-level patterns are
fed back to select future inputs on lower levels. Feedback is sent to all lower
levels because span of each pattern approximates combined span of inputs within
whole hierarchy below it.
So, deeper hierarchy forms
higher orders of feedback, with increasing elevation and scope relative to its
target: same-level prior input, higher-level match average,
beyond-the-next-level match value average, etc.
These orders of feedback
represent corresponding order of input compression: input, match between
inputs, match between matches, etc. Such compression is produced by comparing
inputs to feedback of all orders. 
Comparisons form patterns, of
the order that corresponds to relative span of compared feedback:
1: prior inputs are compared to the following ones on the
same level, forming difference patterns dPs,
2: higher-level match is used to evaluate match between
inputs, forming deviation patterns vPs,
3: higher-hierarchy value revaluates positive values of
match, forming more selective shortcut patterns sPs
 
Feedback of 2nd order consists of input
filters (if) defining value patterns, and coordinate
filters (Cf) defining positional resolution and relative
distance to future inputs.
Feedback of 3rd order is shortcut filters
for beyond-the-next level. These filters, sent to a location defined by
attached coordinate filters, form higher-order value patterns for deeper
internal and distant-level comparison.
 
Higher-order patterns are
more selective: difference is as likely to be positive as negative, while value
is far more likely to be negative, because positive patterns add costs of
re-evaluation for extended cross-comparison among their inputs. And so on, with
selection and re-evaluation for each higher order of positive patterns.
Negative patterns are still compared as a whole: their weak match is compensated
by greater span.
 
All orders of patterns
formed on the same level are redundant representations of the same inputs.
Patterns contain representation of match between their inputs, which are
compared by higher-order operations. Such operations increase overall match by
combining results of lower-order comparisons across pattern’s variables:
 
0Le: AND of bit inputs to form digitized integers, containing
multiple powers of two
1Le: SUB  of integers to form patterns, over additional
external dimensions = pattern length L
2Le: DIV  of multiples (L) to form ratio patterns,
over additional distances = negative pattern length LL
3Le: LOG of powers (LLs), etc.  Starting from
second level, comparison is selective per element of an
input.
 
Such power increase also applies
in comparison to higher-order feedback, with a lag of one level per order.
Power of coordinate filters
also lags the power of input filters by one level:
1Le fb: binary sensor resolution: minimal and maximal detectable
input value and coordinate increments
2Le fb: integer-valued average match and relative initial
coordinate (skipping intermediate coordinates)
3Le fb: rational-valued coefficient per variable and multiple
skipped coordinate range
4Le fb: real-valued coefficients and multiple coordinate-range
skip
 
I am defining initial
levels to find recurring increments in operations per level, which could then
be applied to generate higher levels recursively, by incrementing syntax of
output patterns and of feedback filters per level.
 
Operations per generic level (out of date)
 
Level 0 digitizes inputs,
filtered by minimal detectable magnitude: least significant bit (i LSB). These bits are AND-
compared, then their matches are AND- compared again, and so on, forming
integer outputs. This is identical to iterative summation and bit-filtering by
sequentially doubled i LSB.
 
Level 1 compares
consecutive integers, forming ± difference patterns (dP s). dP s are then evaluated to cross-compare their
individual differences, and so on, selectively increasing derivation of
patterns.
Evaluation: dP M (summed match) - dP aM (dP M per average match between
differences in level 2 inputs).
 
Integers are limited by the
number of digits (#b), and input span: least significant bit of coordinate (C LSB).
No 1st level feedback: fL cost is additive to dP cost, thus must be
justified by the value of dP (and coincident difference in value of patterns
filtered by adjusted i LSB), which is not known till dP is outputted to 2nd level.
 
Level 2 evaluates match
within dP s | bf L (dP) s, forming ± value patterns: vP s | vP (bf L) s. +vP s are evaluated for
cross-comparison of their dP s, then of resulting derivatives, then of inputted
derivation levels. +vP (bf L) s are evaluated to cross-compare bf L s, then dP s, adjusted by the
difference between their bit filters, and so on.
 
dP variables are compared
by subtraction, then resulting matches are combined with dP M (match within dP) to
evaluate these variables for cross-comparison by division, to normalize for the
difference in their span.
// match filter is also
normalized by span ratio before evaluation, same-power evaluation and
comparison?
 
Feedback: input dP s | bf L (dP) are back-projected and
resulting magnitude is evaluated to increment or decrement 0th level i LSB. Such increments
terminate bit-filter span ( bf L (dP)), output it to 2nd level, and initiate a new i LSB span to filter future
inputs. // bf L (dP) representation: bf , #dP, Σ dP, Q (dP).
 
Level 3 evaluates match in
input vP s or f L (vP) s, forming ± evaluation-value patterns:
eP s | eP (fL) s. Positive eP s are evaluated for
cross-comparison of their vP s ( dP s ( derivatives ( derivation levels ( lower search-level
sources: buffered or external locations (selected sources may directly specify
strong 3rd level sub-patterns).
 
Feedback: input vP is
back-projected, resulting match is compared to 2nd level filter, and the
difference is evaluated vs. filter-update filter. If update value is positive,
the difference is added to 2nd level filter, and filter span is terminated.
Same for adjustment of previously covered bit filters and 2nd level filter-update
filters?
 
This is similar to 2nd level operations, but
input vP s are separated by
skipped-input spans. These spans are a filter of coordinate (Cf, higher-order than f for 2nd level inputs), produced by
pre-valuation of future inputs:
projected novel match =
projected magnitude * average match per magnitude - projected-input match?
 
Pre-value is then evaluated
vs. 3rd level evaluation filter +
lower-level processing cost, and negative prevalue-value input span (= span of
back-projecting input) is skipped: its inputs are not processed on lower
levels.
// no prevaluation on 2nd level: the cost is higher
than potential savings of only 1st level processing costs?
 
As distinct from input
filters, Cf is defined individually rather than per filter span. This is
because the cost of Cf update: span representation and interruption of
processing on all lower levels, is minor compared to the value of represented
contents? ±eP = ±Cf: individual skip evaluation, no flushing?
 
or interruption is
predetermined, as with Cb, fixed C f within C f L: a span of sampling across fixed-L gaps?
alternating signed Cf s are averaged ±vP s?
Division: between L s, also inputs within
minimal-depth continuous d-sign or m-order derivation hierarchy?
 
tentative generalizations
and extrapolations
 
So, filter resolution is
increased per level, first for i filters and then for C filters: level 0 has
input bit filter,
level 1 adds coordinate bit
filter, level 2 adds input integer filter, level 3 adds coordinate integer
filter.
// coordinate filters (Cb, Cf) are not input-specific,
patterns are formed by comparing their contents.
 
Level 4 adds input multiple
filter: eP match and its derivatives,
applied in parallel to corresponding variables of input pattern. Variable-values
are multiplied and evaluated to form pattern-value, for inclusion into
next-level ±pattern // if separately evaluated, input-variable value =
deviation from average: sign-reversed match?
 
Level 5 adds coordinate
multiple filter: a sequence of skipped-input spans by iteratively projected
patterns, as described in imagination section of part 3. Alternatively,
negative coordinate filters implement cross-level shortcuts, described in level
3 sub-part, which select for projected match-associated novelty.
 
Additional variables in
positive patterns increase cost, which decreases positive vs. negative span
proportion.
Increased difference in
sign, syntax, span, etc., also reduces match between positive and negative
patterns. So, comparison, evaluation, pre-valuation... on higher levels is
primarily for same-sign patterns.
 
Consecutive different-sign
patterns are compared due to their proximity, forming ratios of their span and
other variables. These ratios are applied to project match across different-sign
gap or contrast pattern:
projected match +=
(projected match - intervening negative match) * (negative value / positive
value) / 2?
 
ΛV selection is incremented by induction: forward and
feedback of actual inputs, or by deduction: algebraic compression of input
syntax, to find computational shortcuts. Deduction is faster, but actual inputs
also carry empirical information. Relative value of additive information vs.
computational shortcuts is set by feedback.
following
parts cover three initial levels in more detail, but mostly out of date:
  
Level 1: comparison to past
inputs, forming difference patterns and match patterns
 
Inputs to the 1st level of search are single
integers, representing pixels of 1D scan line across an image, or equivalents
from other modalities. Consecutive inputs are compared to form differences,
difference patterns, matches, relative match patterns. This comparison may be
extended, forming higher and distant derivatives:
 resulting variables per
input: *=2 derivatives (d,m) per comp, + conditional *=2 (xd, xi) per extended
comp:
 
     8 derivatives
  // ddd, mdd, dd_i, md_i, + 1-input-distant dxd, mxd, +
2-input-distant d_ii, m_ii,
           /
       \
      4 der
  4 der     // 2 consecutive: dd, md, + 2
derivatives between 1-input-distant inputs: d_i and m_i,
      /
    \    /    \
   d,m   d,m
  d,m    // d, m: derivatives from default comparison
between consecutive inputs,
  /    \   /
   \  /     \
i  >>  i
 >>  i  >>  i    // i:
single-variable inputs.
 
This is explained /
implemented in my draft python code:  line_patterns. That first level is for
generic 1D cognitive algorithm,  its adaptation for image and then video
recognition algorithm will be natively 2D.
That’s what I spend most of
my time on, the rest of this intro is significantly out of date.
 
 
bit-filtering and
digitization
 
 
1st level inputs are filtered
by the value of most and least significant bits: maximal and minimal detectable
magnitude of inputs. Maximum is a magnitude that co-occurs with average 1st level match, projected by
outputted dP s. Least significant bit
value is determined by maximal value and number of bits per variable.
 
This bit filter is
initially adjusted by overflow in 1st level inputs, or by a set number of
consecutive overflows.
It’s also adjusted by
feedback of higher-level patterns, if they project over- or under- flow of 1st level inputs that exceeds
the cost of adjustment. Underflow is average number of 0 bits above top 1 bit.
Original input resolution
may be increased by projecting analog magnification, by impact or by distance.
 
Iterative bit-filtering is
digitization: bit is doubled per higher digit, and exceeding summed input is
transferred to next digit. A digit can be larger than binary if the cost of
such filtering requires larger carry.
Digitization is the most
basic way of compressing inputs, followed by comparison between resulting
integers.
 
 
hypothetical: comparable
magnitude filter, to form minimal-magnitude patterns
 
 
This doesn’t apply to
reflected brightness, only to types of input that do represent physical
quantity of a source.
Initial magnitude justifies
basic comparison, and summation of below-average inputs only compensates for
their lower magnitude, not for the cost of conversion. Conversion involves
higher-power comparison, which must be justified by higher order of match, to
be discovered on higher levels.
iP min mag span conversion
cost and comparison match would be on 2nd level, but it’s not justified by 1st level match, unlike D span
conversion cost and comparison match, so it is effectively the 1st level of comparison?
possible +iP span
evaluation: double evaluation + span representation cost < additional
lower-bits match?
 
The inputs may be
normalized by subtracting feedback of average magnitude, forming ± deviation,
then by dividing it by next+1 level feedback, forming a multiple of average
absolute deviation, and so on. Additive value of input is a combination of all
deviation orders, starting with 0th or absolute magnitude.  
Initial input evaluation if
any filter: cost < gain: projected negative-value (comparison cost -
positive value):
by minimal magnitude > ± relative magnitude patterns
(iP s), and + iP s are evaluated or cross-compared?
or by average magnitude
> ± deviations, then by
co-average deviation: ultimate bit filter?
 
Summation *may* compensate
for conversion if its span is greater than average per magnitude spectrum?!
Summation on higher levels
also increases span order, but within-order conversion is the same, and
 between-order comparison is intra-pattern only. bf spans overlap vP span,
-> filter conversion costs?
 
 
Level 2: additional evaluation of
input patterns for feedback, forming filter patterns (out of date)
 
 
Inputs to 2nd level of search are
patterns derived on 1st level. These inputs are evaluated for feedback to update
0th level i LSB, terminating
same-filter span.
Feedback increment of LSB
is evaluated by deviation (∆) of magnitude, to avoid input overflow or underflow:
 
∆ += I/ L - LSB a; |∆| > ff? while (|∆| > LSB a){ LSB ±; |∆| -= LSB a; LSB a *2};
LSB a is average input (* V/ L?) per LSB value, and ff
is average deviation per positive-value increment;
Σ (∆) before evaluation: no V patterns? #b++ and C LSB-- are more expensive,
evaluated on 3rd level?
They are also compared to
previously inputted patterns, forming difference patterns dPs and value
patterns vPs per input variable, then combined into dPP s and vPP s per input pattern.
 
L * sign of consecutive dP s is a known miss, and
match of dP variables is correlated by common derivation.
Hence, projected match of
other +dP and -dP variables = amk * (1 - L / dP). On the other hand,
same-sign dP s are distant by L,
reducing projected match by amk * L, which is equal to reduction by miss of L?
 
So, dP evaluation is for
two comparisons of equal value: cross-sign, then cross- L same-sign (1 dP evaluation is blocked by
feedback of discovered or defined alternating sign and co-variable match
projection).
Both of last dP s will be compared to the
next one, thus past match per dP (dP M) is summed for three dP s:
dP M ( Σ ( last 3 dP s L+M)) - a dP M (average of 4Le +vP dP M) -> v, vs;; evaluation / 3 dP s -> value, sign / 1 dP.
while (vs = ovs){ ovs = vs; V+=v; vL++; vP (L, I, M, D) += dP (L, I, M, D);; default vP - wide sum, select preserv.
 
vs > 0? comp (3 dP s){ DIV (L, I, M, D) -> N, ( n, f, m, d); vP (N, F, M, D) += n, f, m, d;; sum: der / variable, n / input?  
vr = v+ N? SUB (nf) -> nf m; vd = vr+ nf m, vds = vd - a;; ratios are too small
for DIV?
 
while (vds = ovds){ ovds = vds; Vd+=vd; vdL++; vdP() += Q (d | ddP);; default Q (d | ddP) sum., select. preserv.
vds > 0? comp (1st x lst d | ddP s of Q (d) s);; splicing Q (d) s of matching dP s, cont. only: no comp ( Σ Q (d | ddP)?
 
Σ vP ( Σ vd P eval: primary for -P,
redundant to individual dP s ( d s  for +P, cost *2, same for +P' I and -P' M,D?
no Σ V | Vd evaluation of cont. comp
per variable or division: cost + vL = comp cost? Σ V per fb: no vL, #comp;
 
- L, I, M, D: same value per mag,
power / compression, but I | M, D redund = mag, +vP: I - 2a, - vP: M, D - 2a?
- no variable eval: cost (sub + vL + filter) > comp cost, but
match value must be adjusted for redundancy?
- normalization for
comparison: min (I, M, D) * rL, SUB (I, M, D)? Σ L (pat) vs C: more general but
interrupted?
 
variable-length DIV: while (i > a){ while (i> m){ SUB (i, m) -> d; n++; i=d;}; m/=2; t=m;  SUB (d, t); f+= d;}?
additive compression per d
vs. m*d: > length cost?
tdP ( tM, tD, dP(), ddP Σ ( dMΣ (Q (dM)), dDΣ (Q (dD)), ddLΣ (Q (ddL)), Q (ddP))); // last d and D are within
dP()?
 
Input filter is a
higher-level average, while filter update is accumulated over multiple
higher-level spans until it exceeds filter-update filter. So, filter update is
2nd order feedback relative to
filter, as is filter relative to match.
But the same filter update
is 3rd  order of feedback
when used to evaluate input value for inclusion into pattern defined by a
previous filter: update span is two orders higher than value span.
 
Higher-level comparison
between patterns formed by different filters is mediated, vs. immediate
continuation of current-level comparison across filter update (mediated cont.:
splicing between different-filter patterns by vertical specification of match,
although it includes lateral cross-comparison of skip-distant specifications).
However, filter update
feedback is periodic, so it doesn’t form continuous cross-filter comparison
patterns xPs.
 
 
adjustment of forward
evaluation by optional feedback of projected input
 
 
More precisely, additive
value or novel magnitude of an input is its deviation from higher-level
average. Deviation = input - expectation: (higher-level summed input - summed
difference /2) * rL (L / hL).
Inputs are compared to last
input to form difference, and to past average to form deviation or novelty.
 
But last input is more
predictive of the next one than a more distant average, thus the latter is
compared on higher level than the former. So, input variable is compared
sequentially and summed within resulting patterns. On the next level, the sum
is compared vertically: to next-next-level average of the same variable.
 
Resulting vertical match
defines novel value for higher-level sequential comparison:
novel value = past match -
(vertical match * higher-level match rate) - average novel match:
nv = L+M - (m (I, (hI * rL)) * hM / hL) - hnM * rL; more precise than
initial value: v = L+M - hM * rL;
 
Novelty evaluation is done
if higher-level match > cost of feedback and operations, separately for I
and D P s:
I, M ( D, M feedback, vertical SUB (I, nM ( D, ndM));
Impact on ambient sensor is
separate from novelty and is predicted by representational-value patterns?
 
- next-input prediction: seq
match + vert match * relative rate, but predictive selection is per level, not
input.
- higher-order expectation is
relative match per variable: pMd = D * rM, M/D, or D * rMd: Md/D,
- if rM | rMd are derived by
intra-pattern comparison, when average M | Md > average per division?
 
 
one-input search extension
within cross-compared patterns
 
 
Match decreases with
distance, so initial comparison is between consecutive inputs. Resulting match
is evaluated, forming ±vP s. Positive P s are then evaluated for expanded internal
search: cross-comparison among 1-input-distant inputs within a pattern (on same
level, higher-level search is between new patterns).
 
This cycle repeats to
evaluate cross-comparison among 2-input-distant inputs, 3-input-distant inputs,
etc., when summed current-distance match exceeds the average per evaluation.
 
So, patterns of longer
cross-comparison range are nested within selected positive patterns of shorter
range. This is similar to 1st level ddP s being nested within dP s.
 
Same input is re-evaluated
for comparison at increased distance because match will decay: projected match
= last match * match rate (mr), * (higher-level mr / current-level mr) *
(higher-level distance / next distance)?
Or = input * average match rate
for that specific distance, including projected match within negative patterns.
 
It is re-evaluated also
because projected match is adjusted by past match: mr *= past mr / past projected
mr?
Also, multiple comparisons
per input form overlapping and redundant patterns (similar to fuzzy clusters),
and must be evaluated vs.
filter * number of prior comparisons, reducing value of projected match.
 
Instead of directly
comparing incrementally distant input pairs, we can calculate their difference
by adding intermediate differences. This would obviate multiple access to the
same inputs during cross-comparison.
These differences are also
subtracted (compared), forming higher derivatives and matches:
 
 ddd, x1dd, x2d  ( ddd: 3rd derivative,  x1dd: d of 2-input-distant
d s,  x2d: d of 2-input-distant
inputs)
                /       \
 dd, x1d  dd, x1d  ( dd: 2nd derivative, x1d = d+d =
difference between 1-input-distant inputs)
      /
       \   /
        \
   d       d          d      ( d:
difference between consecutive inputs)
  /   \   /
   \   /    \
i
          i
      i    i
         (
i: initial inputs)
 
As always, match is a
smaller input, cached or restored, selected by the sign of a difference.
Comparison of both types is
between all same-type variable pairs from different inputs.
Total match includes match
of all its derivation orders, which will overlap for proximate inputs.
 
Incremental cost of
cross-comparison is the same for all derivation orders. If projected match is
equal to projected miss, then additive value for different orders of the same
inputs is also the same: reduction in projected magnitude of differences will
be equal to reduction in projected match between distant inputs?
 
 
multi-input search
extension, evaluation of selection per input: tentative
 
 
On the next level, average
match from expansion is compared to that from shorter-distance comparison, and
resulting difference is decay of average match with distance. Again, this decay
drives re-evaluation per expansion: selection of inputs with projected decayed
match above average per comparison cost.
 
Projected match is also
adjusted by prior match (if local decay?) and redundancy (symmetrical if no
decay?)
Slower decay will reduce
value of selection per expansion because fewer positive inputs will turn
negative:
Value of selection = Σ
|comp cost of neg-value inputs| - selection cost (average saved cost or
relative delay?)
 
This value is summed
between higher-level inputs, into average value of selection per increment of
distance. Increments with negative value of selection should be compared
without re-evaluation, adding to minimal number of comparisons per selection,
which is evaluated for feedback as a comparison-depth filter:
 
Σ (selection value per
increment) -> average selection value;; for negative patterns of each depth,
| >1 only?
depth adjustment value =
average selection value; while (|average selection value| > selection cost){
depth adjustment ±±; depth
adjustment value -= selection value per increment (depth-specific?); };
depth adjustment >
minimal per feedback? >> lower-level depth filter;; additive depth = adjustment
value?
 
- match filter is summed
and evaluated per current comparison depth?
- selected positive
relative matches don’t reduce the benefit of pruning-out negative ones.
- skip if negative
selection value: selected positive matches < selection cost: average value
or relative delay?
 
Each input forms a queue of
matches and misses relative to templates within comparison depth filter. These
derivatives, both discrete and summed, overlap for inputs within each other’s
search span. But representations of discrete derivatives can be reused,
redundancy is only necessary for parallel comparison.
 
Assuming that environment
is not random, similarity between inputs declines with spatio-temporal
distance. To maintain proximity, a n-input search is FIFO: input is compared to
all templates up to maximal distance, then added to the queue as a new
template, while the oldest template is outputted into pattern-wide queue.
 
 
value-proportional
combination of patterns: tentative  
 
 
Summation of +dP and -dP is
weighted by their value: L (summed d-sign match) + M (summed i match).
Such relative probability
of +dP vs. - dP is indicated by corresponding ratios: rL = +L/-L, and rM =
+M/-M.
(Ls and Ms are compared by
division: comparison power should be higher for more predictive variables).
 
But weighting
complementation incurs costs, which must be justified by value of ratio. So,
division should be of variable length, continued while the ratio is above
average. This is shown below for Ls, also applies to Ms:
dL = +L - -L, mL = min (+L, -L); nL =0; fL=0;
efL=1; // nL: L multiple, fL: L fraction, efL: extended fraction.
while (dL > adL){ dL =
|dL|; // all Ls are positive; dL is evaluated for long division by adL: average
dL.
while (dL > 0){ dL -=
mL; nL++;} dL -= mL/2; dL >0? fL+= efL; efL/=2;} // ratio: rL= nL + fL.
 
Ms’ long-division
evaluation is weighted by rL: projected rM value = dM * nL (reduced-resolution
rL) - adM.
Ms are then combined: cM =
+M + -M * rL; // rL is relative probability of -M across iterated cL.
Ms are not projected (M+= D * rcL * rM D (MD/cD) /2): precision of
higher-level rM D is below that of rM?
 
Prior ratios are
combination rates: rL is probability of -M, and combined rL and rM (cr) is
probability of -D.
If rM < arM, cr = rL,
else: cr = (+L + +M) / (-L + -M) // cr = √(rL * rM) would lose L vs. M
weighting.
cr predicts match of
weighted cD between cdPs, where negative-dP variable is multiplied by
above-average match ratio before combination: cD = +D + -D * cr. // after un-weighted
comparison between Ds?
 
Averages: arL, arM, acr,
are feedback of ratios that co-occur with above-average match of
span-normalized variables, vs. input variables. Another feedback is averages
that evaluate long division: adL, adM, adD.
Both are feedback of positive
C pattern, which represents these variables, inputted & evaluated on 3rd level.
; or 4th level: value of dPs * ratio is compared to value
of dPs, & the difference is multiplied by cL / hLe cL?
 
Comparison of opposite-sign
Ds forms negative match = smaller |D|, and positive difference dD = +D+ |-D|.
dD magnitude predicts its
match, not further combination. Single comparison is cheaper than its
evaluation.
Comparison is by division
if larger |D| co-occurs with hLe nD of above-average predictive value (division
is sign-neutral & reductive). But average nD value is below the cost of
evaluation, except if positive feedback?
 
So, default operations for
L, M, D of complementary dPs are comparison by long division and combination.
D combination: +D -D*cr,
vs. - cD * cr: +D vs. -D weighting is
lost, meaningless if cD=0?
Combination by division is
predictive if the ratio is matching on higher level (hLe) & acr is fed back
as filter?                     
Resulting variables: cL,
rL, cM, rM, cr, cD, dD, form top level of cdP: complemented dP.
 
 
Level 3: prevaluation of
projected filter patterns, forming updated-input patterns
(out of date)
 
 
3rd level inputs are ± V
patterns, combined into complemented V patterns. Positive V patterns include
derivatives of 1st level match, which project match within future inputs (D
patterns only represent and project derivatives of magnitude). Such
projected-inputs-match is pre-valuated, negative prevalue-span inputs are
summed or skipped (reloaded), and positive prevalue-span inputs are evaluated
or even directly compared.
 
Initial upward (Λ) prevaluation by E filter
selects for evaluation of V patterns, within resulting ± E patterns. Resulting
prevalue is also projected downward (V), to select future input spans for evaluation,
vs. summation or skipping. The span is of projecting V pattern, same as of
lower hierarchy. Prevaluation is then iterated over multiple projected-input
spans, as long as last |prevalue| remains above average for the cost of
prevaluation.
 
Additional interference of
iterated negative projection is stronger than positive projection of lower
levels, and should flush them out of pipeline. This flushing need not be final,
spans of negative projected value may be stored in buffers, to delay the loss.
Buffers are implemented in slower and cheaper media (tape vs. RAM) and accessed
if associated patterns match on a higher level, thus project above-average
match among their inputs.
 
Iterative back-projection
is evaluated starting from 3rd level: to be projectable the input must
represent derivatives of value, which are formed starting from 2nd level. Compare this to 2nd level evaluation:
Λ for input, V for V
filter, iterated within V pattern. Similar sub-iteration in E pattern?
 
Evaluation value =
projected-inputs-match - E filter: average input match that co-occurs with
average  higher-level match per evaluation (thus accounting for evaluation
costs + selected comparison costs). Compare this to V filter that selects for 2nd level comparison: average
input match that co-occurs with average higher-level match per comparison (thus
accounting for costs of default cross-comparison only).
 
E filter feedback starts
from 4th level of search, because
its inputs represent pre-valuated lower-level inputs.
4th level also pre-pre-valuates
vs. prevaluation filter, forming pre-prevalue that determines prevaluation vs.
summation of next input span. And so on: the order of evaluation increases with
the level of search.
Higher levels are
increasingly selective in their inputs, because they additionally select by
higher orders derived on these levels: magnitude ) match and difference of
magnitude ) match and difference of match, etc.
 
Feedback of prevaluation is
± pre-filter: binary evaluation-value sign that determines evaluating vs.
skipping initial inputs within projected span, and flushing those already
pipelined within lower levels.
Negative feedback may be
iterated, forming a skip span.
Parallel lower hierarchies
& skip spans may be assigned to different external sources or their
internal buffers.
 
Filter update feedback is
level-sequential, but pre-filter feedback is sent to all lower levels at once.
Pre-filter is defined per
input, and then sequentially translated into pre-filters of higher derivation
levels:
prior value += prior match
-> value sign: next-level pre-filter. If there are multiple pre-filters of
different evaluation orders from corresponding levels, they AND & define
infra-patterns: sign ( input ( derivatives.
filter update evaluation
and feedback
 
Negative evaluation-value blocks input evaluation (thus comparison) and filter updating on all lower levels. Not-evaluated input spans (gaps) are also outputted, which will increase coordinate range per contents of both higher-level inputs and lower-level feedback. Gaps represent negative projected-match value, which must be combined with positive value of subsequent span to evaluate comparison across the gap on a higher level. This is similar to evaluation of combined positive + negative relative match spans, explained above.
Blocking locations with expected inputs will result in preference for exploration & discovery of new patterns, vs. confirmation of the old ones. It is the opposite of upward selection for stronger patterns, but sign reversal in selection criteria is basic feature of any feedback, starting with average match & derivatives.
 
Positive evaluation-value
input spans are evaluated by lower-level filter, & this filter is evaluated
for update:
combined update = (output
update + output filter update / (same-filter span (fL) / output span)) /2.
both updates: -= last
feedback, equal-weighted because higher-level distance is compensated by range:
fL?
update value = combined
update - update filter: average update per average higher-level additive match.
also differential costs of
feedback transfer across locations (vs. delay) + representation + filter
conversion?
 
If update value is
negative: fL += new inputs, subdivided by their positive or negative predictive
value spans.
If update value is
positive: lower-level filter += combined update, new fL (with new filter
representation) is initialized on a current level, while current-level part of
old fL is outputted and evaluated as next-level input.
 
In turn, the filter gets
updates from higher-level outputs, included in higher-higher-level positive
patterns by that level’s filter. Hence, each filter represents combined
span-normalized feedback from all higher levels, of exponentially growing span
and reduced update frequency.
Deeper hierarchy should
block greater proportion of inputs. At the same time, increasing number of
levels contribute to projected additive match, which may justify deeper search
within selected spans.
 
Higher-level outputs are
more distant from current input due to elevation delay, but their projection
range is also greater. So, outputs of all levels have the same relative
distance (distance/range) to a next input, and are equal-weighted in combined
update. But if input span is skipped, relative distance of skip-initiating
pattern to next input span will increase, and its predictive value will
decrease. Hence, that pattern should be flushed or at least combined with a
higher-level one:
 
combined V prevalue = higher-level V prevalue + ((current-level V prevalue - higher-level V prevalue) / ((current-level
span / distance) / (higher-level span / distance)) /2. // the difference
between current-level and higher-level prevalues is reduced by the ratio of
their relative distances.
 
To speed up selection,
filter updates can be sent to all lower levels in parallel. Multiple direct
filter updates are span-normalized and compared at a target level, and the
differences are summed in combined update. This combination is equal-weighted
because all levels have the same span-per-distance to next input, where the
distance is the delay of feedback during elevation. // this happens
automatically in level-sequential feedback?
 
combined update = filter
update + distance-normalized difference between output & filter updates:
((output update - filter
update) / (output relative distance / higher-output relative distance)) /2.
This combination method is
accurate for post-skipped input spans, as well as next input span.
 
- filter can also be
replaced by output + higher-level filter /2, but value of such feedback is not
known.
- possible fixed-rate
sampling, to save on feedback evaluation if slow decay, ~ deep feedforward
search?
- selection can be by
patterns, derivation orders, sub-patterns within an order, or individual
variables?
- match across distance
also projects across distance: additive match = relative match * skipped
distance?
 
 
cross-level shortcuts:
higher-level sub-filters and symbols
 
 
After individual input
comparison, if match of a current scale (length-of-a-length…) projects positive
relative match of input lower-scale / higher-derivation level, then the later
is also cross-compared between the inputs.
Lower scale levels of a
pattern represent old lower levels of a search hierarchy (current or buffered
inputs).
 
So, feedback of lower scale
levels goes down to corresponding search levels, forming shortcuts to preserve
detail for higher levels. Feedback is generally negative: expectations are
redundant to inputs. But specifying feedback may be positive: lower-level details
are novel to a pattern, & projected to match with it in the future.
Higher-span comparison
power is increased if lower-span comparison match is below average:
variable subtraction ) span division )
super-span logarithm?
 
Shortcuts to individual
higher-level inputs form a queue of sub-filters on a lower level, possibly
represented by a queue-wide pre-filter. So, a level has one filter per parallel
higher level, and sub-filter for each specified sub-pattern. Sub-filters of
incrementally distant inputs are redundant to all previous ones.
Corresponding input value =
match - sub-filter value * rate of match to sub-filter * redundancy?  
 
Shortcut to a whole level
won’t speed-up search: higher-level search delay > lower-hierarchy search
delay.
Resolution and parameter
range may also increase through interaction of co-located counter-projections?
 
Symbols, for communication
among systems that have common high-level concepts but no direct interface, are
“co-author identification” shortcuts: their recognition and interpretation is
performed on different levels.
 
Higher-level patterns have
increasing number of derivation levels, that represent corresponding lower
search levels, and project across multiple higher search levels, each evaluated
separately?
Match across discontinuity
may be due to additional dimensions or internal gaps within patterns.
 
Search depth may also be
increased by cross-comparison between levels of scale within a pattern: match
across multiple scale levels also projects over multiple higher- and lower-
scale levels? Such comparison between variable types within a pattern would be
of a higher order:
  
5. Comparison between variable types
within a pattern (tentative)
 
 To reiterate, elevation increases syntactic
complexity of patterns: the number of different variable types within them.
Syntax is identification of these types by their position (syntactic
coordinate) within a pattern. This is analogous to recognizing parts of speech
by their position within a sentence.
Syntax “synchronizes”
same-type variables for comparison | aggregation between input patterns. Access
is hierarchical, starting from sign->value levels within each variable of
difference and relative match: sign is compared first, forming + and - segments,
which are then evaluated for comparison of their values.
 
Syntactic expansion is
pruned by selective comparison vs. aggregation of individual variable types
within input patterns, over each coordinate type or resolution. As with
templates, minimal aggregation span is resolution of individual inputs, &
maximal span is determined by average magnitude (thus match) of new derivatives
on a higher level. Hence, a basic comparison cycle generates queues of
interlaced individual & aggregate derivatives at each template variable,
and conditional higher derivatives on each of the former.
 
Sufficiently complex syntax
or predictive variables will justify comparing across “syntactic“ coordinates
within a pattern, analogous to comparison across external coordinates. In fact,
that’s what higher-power comparisons do. For example, division is an iterative
comparison between difference & match: within a pattern (external
coordinate), but across derivation (syntactic coordinate).
 
Also cross-variable is
comparison between orders of match in a pattern: magnitude, match,
match-of-match... This starts from comparison between match & magnitude:
match rate (mr) = match / magnitude. Match rate can then be used to project
match from magnitude: match = magnitude * output mr * filter mr.
In this manner, mr of each
match order adjusts intra-order-derived sequentially higher-order match:
match *= lower inter-order
mr. Additive match is then projected from adjusted matches & their
derivatives.
 
This inter-order projection
continues up to the top order of match within a pattern, which is the ultimate
selection criterion because that’s what’s left matching on the top level of
search.
Inter-order vectors are ΛV
symmetrical, but ΛV derivatives from lower order of match are also projected
for higher-order match, at the same rate as the match itself?
 
Also possible is comparison
across syntactic gaps: ΛY comparison -> difference, filter feedback VY
hierarchy. For example, comparison between dimensions of a multi-D pattern will
form possibly recurrent proportions.
 
Internal comparisons can
further compress a pattern, but at the cost of adding a higher-order syntax,
which means that they must be increasingly selective. This selection will
increase “discontinuity” over syntactic coordinates: operations necessary to
convert the variables before comparison. Eventually, such operators will become
large enough to merit direct comparisons among them. This will produce
algebraic equations, where the match (compression) is a reduction in the number
of operations needed to produce a result.
 
The first such short-cut
would be a version of Pythagorean theorem, discovered during search in 2D (part
6) to compute cosines. If we compare 2D-adjacent 1D Ls by division, over 1D
distance and derivatives (an angle), partly matching ratio between the ratio of
1D Ls and a 2nd derivative of 1D distance will be a cosine.
Cosines are necessary to
normalize all derivatives and lengths (Ls) to a value they have when orthogonal
to 1D scan lines (more in part 6).
 
Such normalization for a
POV angle is similar to dimensionality reduction in Machine Learning, but
is much more efficient because it is secondary to selective dimensionality
expansion. It’s not really “reduction”: dimensionality is prioritized rather
than reduced. That is, the dimension of pattern’s main axis is maximized, and dimensions
sequentially orthogonal to higher axes are correspondingly minimized. The
process of discovering these axes is so basic that it might be hard-wired in
animals.
  
6. Cartesian dimensions and sensory
modalities (out of date)
  
This is a recapitulation
and expansion on incremental dimensionality introduced in part 2.
Term “dimension” here is
reserved for a parameter that defines sequence and distance among inputs,
initially Cartesian dimensions + Time. This is different from terminology of
combinatorial search, where dimension is any parameter of an input, and their
external order and distance don’t matter. My term for that is “variable“,
external dimensions become types of a variable only after being encoded within
input patterns.
 
For those with ANN
background, I want to stress that a level of search in my approach is 1D queue
of inputs, not a layer of nodes. The inputs to a node are combined regardless
of difference and distance between them (the distance is the difference between
laminar coordinates of source “neurons”).
These derivatives are
essential because value of any prediction = precision of what * precision of
where. Coordinates and co-derived differences are not represented in ANNs, so
they can't be used to calculate Euclidean vectors. Without such vectors,
prediction and selection of where must remain extremely crude.
 
Also, layers in ANN are
orthogonal to the direction of input flow, so hierarchy is at least 2D. The
direction of inputs to my queues is in the same dimension as the queue itself,
which means that my core algorithm is 1D. A hierarchy of 1D queues is the most
incremental way to expand search: we can add or extend only one coordinate at a
time. This allows algorithm to select inputs that are predictive enough to
justify the cost of representing additional coordinate and corresponding
derivatives. Again, such incremental syntax expansion is my core principle,
because it enables selective (thus scalable) search.
 
A common objection is that
images are “naturally” 2D, and our space-time is 4D. Of course, these empirical
facts are practically universal in our environment. But, a core cognitive
algorithm must be able to discover and forget any empirical specifics on its
own. Additional dimensions can be discovered as some general periodicity in the
input flow: distances between matching inputs are compared, match between these
distances indicates a period of lower dimension, and recurring periods form
higher-dimension coordinate.
 
But as a practical shortcut
to expensive dimension-discovery process, initial levels should be designed to
specialize in sequentially higher spatial dimensions: 1D scan lines, 2D frames,
3D set of confocal “eyes“, 4D temporal sequence. These levels discover
contiguous (positive match) patterns of increasing dimensionality:
1D line segments, 2D blobs,
3D objects, 4D processes. Higher 4D cycles form hierarchy of multi-dimensional
orders of scale, integrated over time or distributed sensors. These higher
cycles compare discontinuous patterns. Corresponding dimensions may not be
aligned across cycles of different scale order.
 
Explicit coordinates and
incremental dimensionality are unconventional. But the key for scalable search
is input selection, which must be guided by cost-benefit analysis. Benefit is projected
match of patterns, and cost is representational complexity per pattern. Any
increase in complexity must be justified by corresponding increase in
discovered and projected match of selected patterns. Initial inputs have no
known match, thus must have minimal complexity: single-variable “what”, such as
brightness of a grey-scale pixel, and single-variable “where”: pixel’s
coordinate in one Cartesian dimension.
 
Single coordinate means
that comparison between pixels must be contained within 1D (horizontal) scan
line, otherwise their coordinates are not comparable and can’t be used to
select locations for extended search. Selection for contiguous or proximate
search across scan lines requires second (vertical) coordinate. That increases
costs, thus must be selective according to projected match, discovered by past
comparisons within 1D scan line. So, comparison across scan lines must be done
on 2nd level of search. And so
on.
 
Dimensions are added in the
order of decreasing rate of change. This means spatial dimensions are scanned
first: their rate of change can be sped-up by moving sensors. Comparison over
purely temporal sequence is delayed until accumulated change / variation
justifies search for additional patterns. Temporal sequence is the original
dimension, but it is mapped on spatial dimensions until spatial continuum is
exhausted. Dimensionality represented by patterns is increasing on higher
levels, but each level is 1D queue of patterns.
 
Also independently
discoverable are derived coordinates: any variable with cumulative match that
correlates with combined cumulative match of all other variables in a pattern.
Such correlation makes a variable useful for sequencing patterns before
cross-comparison.
It is discovered by summing
matches for same-type variables between input patterns, then cross-comparing
summed matches between all variables of a pattern. Variable with the highest
resulting match of match (mm) is a candidate coordinate. That mm is then
compared to mm of current coordinate. If the difference is greater than cost of
reordering future inputs, sequencing feedback is sent to lower levels or
sensors.
 
Another type of empirically
distinct variables is different sensory modalities: colors, sound and pitch,
and so on, including artificial senses. Each modality is processed separately,
up a level where match between patterns of different modalities but same scope
exceeds match between unimodal patterns across increased distance. Subsequent
search will form multi-modal patterns within common S-T frame of reference.
 
As with external
dimensions, difference between modalities can be pre-defined or discovered. If
the latter, inputs of different modalities are initially mixed, then segregated
by feedback. Also as with dimensions, my core algorithm only assumes
single-modal inputs, pre-defining multiple modalities would be an add-on.
  
7. Notes on working mindset and awards
for contributions
  
My terminology is as
general as the subject itself. It’s a major confounder, - people crave context,
but generalization is decontextualization. And cognitive algorithm is a
meta-generalization: the only thing in common for everything we learn. This
introduction is very compressed, partly because much the work is in progress.
But I think it also reflects and cultivates ruthlessly reductionist mindset
required for such subject.
 
My math is very simple,
because algorithmic complexity must be incremental. Advanced math can
accelerate learning on higher levels of generalization, but is too expensive for
initial levels. And minimal general learning algorithm must be able to discover
computational shortcuts (AKA math) on it’s own, just like we do. Complex math is
definitely not innate in humans on any level: cavemen didn’t do calculus.
 
This theory may seem too speculative,
but any degree of generalization must be correspondingly lossy. Which is
contrary to precision-oriented culture of math and computer science. Hence,
current Machine Learning is mostly experimental, and the progress on
algorithmic side is glacial. A handful of people aspire to work on AGI, but they either lack or
neglect functional definition of intelligence, their theories are only vague
inspiration.
I think working on this level
demands greater delay of experimental verification than is acceptable in any
established field. Except for philosophy, which has nothing else real to study.
But established philosophers have always been dysfunctional fluffers, not
surprisingly as their only paying customers are college freshmen.
 
Our main challenge in
formalizing GI is a specie-wide ADHD. We didn’t evolve for sustained focus on
this level of generalization, that would cause extinction long before any
tangible results. Which is no longer a risk, GI is the most important problem
conceivable, and we have plenty of computing power for anything better than
brute-force algorithms. But our psychology lags a light-year behind technology:
we still hobble on mental crutches of irrelevant authority and peer support,
flawed analogies and needless experimentation.
 
Awards for contributions
 
I offer prizes up to a
total of $500K for debugging, optimizing and extending this algorithm: github.
Contributions must fit into
incremental-complexity hierarchy outlined here. Unless you find a flaw in my
reasoning, which would be even more valuable. I can also pay monthly, but there
must be a track record.
Winners will have an option
to convert the awards into an interest in all commercial applications of a
final algorithm, at the rate of $10K per 1% share. This option is informal and
likely irrelevant, mine is not a commercial enterprise. Money can’t be primary
motivation here, but it saves time.
 
Awards so far:
 
2010: Todor Arnaudov, $600 for suggestion to
buffer old inputs after search. 
2011: Todor, $400
consolation prize for understanding some ideas that were not clearly explained
here.
2014: Dan He, $600 for pushing me to be more specific and
to compare my algorithm with others.
2016: Todor Arnaudov, $500 for multiple
suggestions on implementing the algorithm, as well as for the effort.
Kieran Greer, $375 for an attempt to
implement my level 1 pseudo code in C#
2017:
 Alexander Loschilov, $2800 for help in
converting my level 1 pseudo code into Python, consulting on PyCharm and SciPy,
and for insistence on 2D clustering, February-April.
Todor Arnaudov: $2000 for help in
optimizing level_1_2D, June-July.
Kapil Kashyap: $ 2000 for stimulation
and effort, help with Python and level_1_2D, September-October
2018: 
Todor Arnaudov, $1000 mostly for effort and
stimulation, January-February
Andrei Demchenko, $1800 for conventional refactoring in
line_POC_introductory.py, interface improvement and few improvements in the
code, April - May.
Todor Arnaudov, $2000 for help in debugging
frame_dblobs.py, September - October.
Khanh Nguyen, $2700, for getting to work line_POC.
2019:
Stephan Verbeeck, $2000 for getting me to return to using minimally-coarse gradient and
his perspective on colors and line tracing, January-June
Todor Arnaudov, $1600, frequent participant,
March-June
Kok Wei
Chee,
$900, for diagrams of line_POC and frame_blobs, December 
Khanh Nguyen, $10100, lead debugger and co-designer, January-December
2020:
Mayukh Sarkar, $600 for
frame_blobs performance analysis and porting form_P to C++, January
Maria Parshakova, $1600,
team developer, March-May 
Khanh Nguyen, $8100, team
developer
Kok Wei Chee, $14200, team
developer
2021:
Many thanks to Chris Sun for
his efforts to find collaborators!
Kok Wei Chee, $22000, lead
developer, January-December
Khanh Nguyen, $5000, team
developer, April-October
Alex Pitertsev, $1000:
mostly visualization via dfs, July-August
Kelvin Spacey, $1840: port
to dataframes, various, May-July
Yura Guruel, $1000: various,
May-July
Aqib Mumtaz and Ayesha Ali,
$1400: audio interfacing for 1D alg, April-May
2022:
Kok Wei Chee, $25300: lead
developer, January-December 
Alex Pitertsev, $2000 for
porting line_comp to Julia, June

 
 
145 comments:
I wish you had illustrations to make this easier to follow. I think what I'm missing is sufficient context to follow your initial development of your ideas.
I am not sure about illustrations Aaron, they may mislead rather than explain. The only macro "architecture" I have is a hierarchy of generalization. There are micro- hierarchies of derivation within each level, but they have variable / incremental depth & dimensionality. I don't think you're missing context, - there isn't any. Generalization is decontextualization, & nothing's more general than GI. It comes with the territory, - you pay for an altitude with a lack of air:). But I'll think about illustrations, & please let me know if you have specific questions.
Alper E:
I need an example (such as character recognition, or pattern recognition) in relation to which I can think on the algorithm. You already mention "pixels across a 1 dimensional scan line". Do you have a more elaborate example and a description of the problem you are trying to tackle with this alogrithm?
Thanks,
Alper
All of this is pattern recognition, Alper, according to my definition of a pattern (haven't seen any constructive alternative to that). It's just that for me a pattern can be of any level of complexity, & you are probably thinking of something specific. For example, a character is a contiguous 2D pattern. So in my scheme it would be formed on the second level, out of 1D segments (even of length = 1) contiguously matching in 2nd D (across scan lines). "Contiguous" means that minimal d_coordinate between any constituent pixels of the segments = 1, in both dimensions. These 2D patterns (characters) will be automatically compared & recognized on higher levels: 3D & higher contiguous or 2D & higher discontinuous search. This comparison will be between same-type variables of both comparands (multi-variate patterns). These variables will include presumably-identical brightness & line-width (angle (d_1D_coordinate) -normalized 1D segments' length), & character-specific length (angle-normalized 2nd D height). This 2D length will mediate constituent sub-patterns: 2D-contiguous sequence of angles & 2D lines, which will be compared across patterns if their macro-patterns (above) match. Overall match will be a sum of matches between individual variables & sub-patterns, each possibly weighted by feedback.
This is a very rough description, I don't design this algorithm for character recognition, or any specific task. It should simply recognize anything recognizable, on corresponding level of search & with sufficient resources.
My "problem" is efficient search, where efficiency is resource allocation in proportion to predictive value, on all levels of representation. Predictive value is past_match - system-average_match, both projected by co-derived partial misses: feed-forward & feed-back vectors.
Alper E said:
It now becomes a little bit more clear (but, only a little bit :)
Do you have a description of your algorithm in a formal language (you had mentioned a pseudo language). I really need a formal description of it, because many of the terms you use do not look clear to me. Moreover, it'd be fine if we stick to a simple example, like character recognition.
How would you propose to proceed?
Alper
That pseudo is my own very high-level language, I sent you a sample. I tried more formal descriptions, but it's very awkward to think in these terms, I need more rhyme & (lossy) compression. None of that pseudo is about 2D, though. I currently work on the core algorithm, which is 1D: time-only. Spatial dimensions are empirically-specific add-ons to that core algorithm. Still worth thinking about, might be suggestive beyond spatial specifics. So, we can talk in terms of character recognition. But email is a low through-put medium, especially as I am a slow typist. For me to initially explain my terms, it should be question-answer in real-time, that's why I suggested skype or conference.
I think since so little is know about the algorithmic structure of the human brain you had better leave it to evolutionary algorithms to forge an AI. It would seem from the evolutionary time-line here on earth that creating higher intelligence is a trivial matter compared to perfecting the biochemical basis of life. I think you give an evolutionary algorithm as facile a medium to work with as possible and then, simply let it work. The medium of which I speak is probably some form of patterned arithmetic structure where where mutations do not generally have catastrophic results. And where there is an extremely high level of malleability with regard to what is constructed.
>I think since so little is know about the algorithmic structure of the human brain
I know enough to realize that it's a misshapen kludge,- a worthy product of dumb & blind evolution.
> you had better leave it to evolutionary algorithms to forge an AI.
We did, for 3B years.
Now that we have brains, why not use them?
> creating higher intelligence is a trivial matter compared to perfecting the biochemical basis of life.
That "biochemical basis" is all "arithmetic" too. If you design an algorithm to maximize its own reproduction, then you're starting from 3B years ago.
And if you want to maximize intelligence, then you need to define it functionally, which is where I start in this intro.
> The medium of which I speak ...
You can't get a *general* intelligence to evolve in a severely *constrained* medium.
Laszlo Abraham:
Hi, interesting "project". Read your article (and the indicated literature). In my opinion, your approach to this domain is too philosophical. You set a so high level of generality that keeps you out from any concrete (usable) discovery / solution .
The problem is not to understand your ideas. Any interested scholar, with a sufficient culture in the domain, at a moment, obviously arrives to these conclusions about intelligence and pattern matching. The problems are starting only from here.
If you want to say something new (usable for others too) you must dive in the details. You must split the general problem in small, maybe banal and trivial parts, until the solution is feasible and achievable. This is what Plato names diaeresis. From your level of generality this is impossible. You are blocked there. You want a too big junk with one bite. Philosophy will never solve any concrete problem.
Keep in mind what Henry Ford said: "Nothing is impossible if we could disassemble it in small (simple, solvable) parts". Every serious researcher in the domain is doing this. do not disdain them. This not means that they all miss the general view you just discovered. This means they want more: DO something, even if they could not solve alone all the problems. This is a cooperative game.
Of course we must see the forest, not only the trees. It is important, it will guide us. But, in the same time, never could make a fire with the "forest", only with woods.
Told you all these because (it seems) you traversed the same stages as I did years ago. My critics want to be constructive, want to help you. Don't be offended.
I am working in this domain from more then 20 years. Have some ideas, what and how to do. Never had enough time (money) to work only on these ideas. Would be very happy if could realize something together, but ONLY if you don't want the Moon from the sky in one step, and we have a clear (partial) target every moment.
Waiting for your reply,
I think it's a solvable problem, Laszlo: we have a solution between the ears. And I did define it, from the very beginning. It's true that no solution is final, - there is no limit to how efficient / scalable a search can be.
But there must be a strictly incremental-complexity progression of improving scalability without losing generality, as I outlined. Small steps, as you say.
The difference is that I want these steps to implement first principles, & you want them to solve specific problems. That's a common dilemma in such work, - scalability vs. quick fixes.
You say any reasonable scholar would agree with my conclusions, - name one? "Indicated literature" in my intro is precisely to explain fundamental differences between my approach & anything else I came across.
If you can show me one work that follows roughly the same progression as mine, - that's a big award right there. Even something as basic as defining a match: absolute, projected, relative, - as in part 2? Or *one* approach to clustering that proceeds with incremental spatio-temporal dimensionality, even if their definition of "distance" is not fully consistent with mine. Hello?
This is not philosophy, never mind Plato or Ford.
Re "partial targets, in my sense, I am currently trying to better define mean match, as a criterion for extended search.
Basic definition is: value of current-level projected match that co-occurs with (thus predicts) a mean projected higher-level match.
But I think that mean current-level projected match must be adjusted by projected consequences of its own feedback?
Would be interested in your ideas, of course. Can't promise to accept them, but if I don't, I will tell you exactly why.
Laszlo Abraham said:
Hi Boris,
Unfortunately, you feel offended. Sorry for this. Have no time to make a thorough (statement level) analysis of your theory. I think I understood the main message. Sorry, but I don't saw any original idea in it. You are using other words for the known things. This is not improving our knowledge in the domain. Additionally, the problem you "defined" is existing, at least, from 1676 (see "Characteristica universalis" of Leibnitz).
Since this domain (AI) existed long before we started to study it, and many smart people contributed to it with concrete and relevant results, it is an elementary expectation to use the already consecrated terminology when we are discussing about these things. For example, your "incremental spatial-temporal dimensionality" is the well known phase space in physics. Searching patterns in it is the identification problem in the dynamic systems theory, and the analog problem in the abstract notions world is modeled very well with the causal (knowledge) graphs. If you want to introduce a new terminology, you must first argument its necessity. It is very good if you want to discover some new things, but this is not a linguistic problem: renaming the notions and methods is not helping us at all. Let's speak the same language!
When you try to build a new theory, you must know almost everything about what others done in this direction, and respect their results. In other case, you will prove only your ignorance, if some will take the time to decrypt what you want to say. Also, when you expose your new theory, you must offer links to the already known and generally accepted facts, and your article must contain relevant citations. In academic circles this is an elementary expectation too. At least, you must explain why your theory is necessary, what improvements it brings, and what are YOUR ORIGINAL RESULTS (as Einstein did in 1904). If you want to advocate for a new method (theory), first you must have results, Maybe you're right, maybe your idea is genial, anyhow nobody will believe you till you prove it. So, let's start with the small problems!
Regarding the examples you ask: at a popular science level Kurzweil's, Hofstadter's and Mero's books contain all your ideas about the link between intelligence and pattern matching (and much more). At a scientific level, conforming to my knowledge, Schmidhuber's and Hutter's works are the examples which are working with the same notions / methods, and HAVE SPECTACULAR RESULTS. For example, google for the Hutter Price. This is a very good example how an award promising theoretical problem most be exposed.
Additionally, you could take a look on Judea Pearl's books, and the domains of knowledge graphs, chat robots ontology and the Loebner Price. In my opinion, you rediscovered the contextuality of the knowledge. Bravo, how we go further?
Of course, you are free to publish anything on your website, even on Elance. The risk is only to receive unpleasant comments from old dinosaurs as I am. Again, sorry if offended you, don't bother you more. Except if you really want to put at work the contextual model of the world.
Thank you for your attention,
Laszlo
As you admitted Laszlo, you didn’t get very far in my intro. Just recognized some basic points you already knew, & ignored parts you didn’t understand, - most of it.
Which means that you can’ judge whether it’s original. For example “phase space” has nothing to do with my incremental dimensionality, see part 3.
And other suggestions are pure fluff: “philosophy” as you say.
Except for two: I don’t use accepted terminology & build on other people’s work.
Well, there are several fields involved, all significantly off of my focus. And each has its own terminology, usually incoherent & mind-polluting.
Like I said, generalization is a reduction. You can spend many life-times reading related work, only to realize that it’s not related enough.
Or, you can define & describe the problem in general English terms & phrases that fit it best. Then you google them, & see that my site is on top of the list.
I don’t think the problem here is my terminology, it simply takes a while to acquire appropriate context (in my case, ~ 27 years).
I am dully impressed that there are plenty of smart people in the world, many smarter that I am. But, as with anything else, intelligence is not what you got, it’s how you use it.
And almost all use it mainly to succeed in life, according to societal standards. Yes, even in science: it’s all about peer review. Which is great, but only if you’re lucky enough to have peers.
Anyway, thanks for trying.
Dan He said:
I briefly read your blog, very interesting work. I have the following thoughts.
The proposed method is contrary to deep learning. In deep learning, the number of features in the model presentation is reduced layer by layer, starting from the full set of features. In your model, the number of features increases layer by layer, starting from single features. However, your model also allows cross layer communications from top layers down to the bottom layers, which is similar to deep learning. I think this is indeed how the brain works: when we learn something, we often jump forwards and backwards to connect different levels of details and information. I think this process should be developed in a more systematic way.
Human brains learn new knowledge, or patterns, differently. Someone may pay attention to the big picture first, someone may pay attention to details first. Someone may pay attention to shape first, someone may pay attention to color first. Someone may pay attention to red color first, someone may pay attention to blue color first. Thus there should not be a single "best" order of the dimensions to be selected, alternatively speaking, the incremental growth could start from different dimensions. Instead of fixing the order of the dimensions to be selected as proposed by the blog, we could use a merge-based strategy: we first select a set of size-1 dimensions that are significant enough, assuming we select dimensions A,B,C,D. Then we merge these significant dimensions into size-2 dimensions, AB,AC,AD,BC,BD,CD and then select from them the significant size-2 dimensions. Assuming we select AB,AC,BC,BD,CD. Then we merge these significant size-2 dimensions into size-3 dimensions, ABC,BCD. We repeat the process till no more dimensions can be merged. The advantage of the merge-based strategy is we allow different start points in the search space in a systematic way, which models the multiple possibilities of the brain functions.
One question I have is: the dimensions for the incremental selection seems to be different from the features of the data. For example, an image may have thousands of pixels and each one is considered as a feature. But it seems your dimension here is the 2-D space. Then when you make comparisons, how exactly you handle the features of the data, which is critical for the scalability of the method?
Finally, I would suggest you have a section of "concept definition" and "table of annotations". The current writing is kind of hard to follow as the definitions of some concepts and annotations are missing, or at least not easy to locate.
These are all my brief ideas, please let me know if you think we can talk more.
Thanks
Dan
Thanks for your comments Dan,
> In deep learning, the number of features in the model presentation is reduced layer by layer, starting from the full set of features...
"Features" in NNs correspond to two different things in my model:
1) Positionally distinct inputs / patterns, ordered along S-T dimensions within input flow: 1D length ) 2D width ) 3D depth ) 4D duration. These dimensions are defined by explicit coordinates in my model, but are not easily represented & compared in a NN, which I think is one of their basic flaws. The number of such patterns is reduced with elevation, just as it is in deep NN, because input patterns to higher levels are S-T contiguous spans of lower-level patterns.
2) Variables internal to these patterns, such as different original input modalities, & matches / differences derived from them by lower-level comparisons. Again, there is no such internal variables in NNs, instead, they are distributed across network / brain, & shared among multiple "patterns" or "cognits" in terms of J. Fuster (which are comprised of long-range connections to such variables). This is another basic flaw of NNs because such long-range connections are very expensive, & must be activated all at once. The number of these variables per pattern is increasing with elevation in my model because they are necessary to represent increasingly complex relationships discovered during lower levels of search. This is also true for neocortex, as the input flow proceeds from primary to association cortices. Again, in my model such incremental syntax is localised within a pattern, but is distributed across "neocortex" in neural models.
I tried to explain these things in parts 4 & 8, don't know if you made it that far.
> Someone may pay attention to the big picture first, someone may pay attention to details first...
"Attention" is a feedback from higher levels, you don't initially have it because higher levels are "empty", - they fill-in with generalized outputs of lower levels. A designer determines the order of feedforward, which can latter be altered by self-generated feedback.
> One question I have is: the dimensions for the incremental selection seems to be different from the features of the data. For example, an image may have thousands of pixels and each one is considered as a feature. But it seems your dimension here is the 2-D space.
It's 4D space-time, see above.
> Then when you make comparisons, how exactly you handle the features of the data, which is critical for the scalability of the method?
I compare same-type variables between different patterns: their syntax must "synchronized" before comparison.
> Finally, I would suggest you have a section of "concept definition" and "table of annotations". The current writing is kind of hard to follow as the definitions of some concepts and annotations are missing, or at least not easy to locate.
This would depend on reader's background, which I don't know. That's why I use general English terms, usually in their original meaning. I know it's confusing for a specialist, but my subject is *general* intelligence :).
Thanks again Dan, hopefully we can get into procedural details.
Daniel Michulle said:
Hi there,
I am reading your description on the web site and will put down some of my thoughts:
1. "Incremental": I am not sure whether incremental approaches to pattern discovery work since this requires sequential processing. That is, unless you mean 1a or 1b.
1a: Attention: Attention is a sequential process that looks for "interesting things". This would mean that you need a guidance system for attention that I presume should have the same structure as the cognitive apparatus you intend to create. This can be just another level in the hierarchy.
1b: Hierarchical increments which you already included by stating you want a hierarchic system.
2. Feedback: Feedback must be two way in your solution.
2a Either you are able to do this fully parallely or
2b you define an artificial discrete state and update it in artificial sequentially without any input/output occurring during this time or
2c or you select just some of the processing units in your cognitive apparatus to update and try to stay real-time
Anyhow, control flow seems a critical design decision.
PS: Found this today, nothing scientific but maybe a useful analogy: http://nautil.us/issue/12/feedback/ants-swarm-like-brains-think
3. Homoiconicity
In order to be efficient at implementing your Cognitive Apparatus (CA), you would do well in restricting input to some sort of data structure representable by modern day PCs. Bit-wise is an option but w.r.t. efficiency I presume that "array of double" is a better format. You can downscale later.
Also, in order to be able to create your CA in a finite amount of time, I presume it is much easier to define the output of any functional unit in your CA to be of the type of your selected input. In this way you can compose your CA of different units.
Still, you could later substitute some units by others without sacrificing composability.
Given these thoughts, its no wonder why often people like deep learning based on ANNs because they are homoiconic and composable.
4. Parameters
Your system will have a huge number of parameters and you will need to search for optimal parameters. You don't want to do this by hand so you'll need a meta-algorithm.
Up to now I just found one "universal algorithm" that works context free: Evolutionary algorithms where competition leads to improvement (you dont know how good your CA will be but if you have 10 you at least have a measure of how good it is compared to peers)
Moreover, if you like challenges (and it certinaly seems so), you can use genetic programming to find components. Also, you will obtain a measure for the overall cost of the system which may prove useful in finding:
5. local computation cost
Since you mention search cost, you will need an intrinsic measure of cost. I have no idea how you could implement that but brains do this via its neurotransmitters and vesicles. These are only slowly recharged and you cant spend more than you have. The disadvantage is that you need to model time in this case.
6. Background: I am a big fan of LISP. I encourage you to realize any of your ideas in Clojure because of homoiconicity and representation of logic (core.logic) and the huge amount of libraries in Clojure and Java.
I myself worked on two different AI "leaps":
6.1. A General Game Playing agent: An agent that is able to play any game (deterministic games with complete information, e.g. Chess, Checkers, Connect Four, Tic-Tac-Toe, Rock-Paper-Scissors)
6.2 A generic time series forecasting mechanism that takes any kind of time series data and is able to extract all the info for forecasting, including selecting its internal model, time sieres transformations, outlier detection , ...
I founded a company in 2012 for promoting the system and we have modest success right now.
That being said, I am a big fan of this kind of ideas but I lack time.
Hope you find this interesting!
BR, Daniel
Thanks for your comments, Daniel,
>1. "Incremental": I am not sure whether incremental approaches to pattern discovery work since this requires sequential processing.
Any pattern discovery is an iterative comparison to other inputs within a search range. I tried to explain how search should incrementally / hierarchically expand, let me know what's unclear.
> 1a: Attention: Attention is a sequential process that looks for "interesting things". This would mean that you need a guidance system for attention that I presume should have the same structure as the cognitive apparatus you intend to create. This can be just another level in the hierarchy.
Attention is simply another term for feedback down the hierarchy, which I also tried to explain.
> 2. Feedback: Feedback must be two way in your solution.
2a Either you are able to do this fully parallely
All levels should process in parallel, both feedforward & feedback.
> Anyhow, control flow seems a critical design decision.
"Control" is also another term for feedback.
> 3. Homoiconicity: In order to be efficient at implementing your Cognitive Apparatus (CA), you would do well in restricting input to some sort of data structure representable by modern day PCs. Bit-wise is an option but w.r.t. efficiency I presume that "array of double" is a better format. You can downscale later.
All I need is bit-mapped images.
> I presume it is much easier to define the output of any functional unit in your CA to be of the type of your selected input.
Basic units in my approach are levels, syntactic complexity of their inputs & outputs is incremental with elevation, as I tried to explain.
> 4. Parameters: Your system will have a huge number of parameters and you will need to search for optimal parameters. You don't want to do this by hand so you'll need a meta-algorithm. Up to now I just found one "universal algorithm" that works context free: Evolutionary algorithms where competition leads to improvement (you dont know how good your CA will be but if you have 10 you at least have a measure of how good it is compared to peers)
Input parameters are also altered by feedback, not at random as in GAs, see part 1.
> Since you mention search cost, you will need an intrinsic measure of cost. I have no idea how you could implement that
I use "opportunity cost": average projected match per search step. This is the most basic form of feedback, see part 2.
> The disadvantage is that you need to model time in this case.
I use explicit 4D space-time coordinates, see part 4.
> I encourage you to realize any of your ideas in Clojure because of homoiconicity and representation of logic (core.logic) and the huge amount of libraries in Clojure and Java.
My approach is expressed in its own language. The core algorithm is supposed to be self-contained, I don't plan to use any external libraries for it.
Sorry for being brisk. It's just that I've been repeating these things forever, don't know how to make it any clearer...
Dan He said:
Hi, Boris:
Sorry for the late response. I was traveling the last couple of weeks.
SO your pattern is positionally distinct patterns. What about other types of patterns? Is your model generic to all types of patterns or only to this type of specific patterns?
Also if the dimension is only 4-D, does incremental really help? I am
not sure if my understanding is correct, but it seems you will only have 4 layers?
Thanks
Dan
No problem Dan.
"Positionally" includes temporally distinct, - each level gets a temporal sequence of inputs, each with distinct temporal coordinate. In that sense, any type of pattern is positionally distinct. Initial "patterns" are pixels in 1D scan line, - temporal sequence that is mapped into spatial coordinate on a higher level.
On the other hand, all constituent variables co-occur in a pattern, thus are not positionally distinct. On a given level of differentiation, of course. A pattern may contain multiple levels of partial differences / derivatives relative to other patterns, which are positionally distinct.
Incremental dimensionality is: 1D 1st level, 2D 2nd level, 3D 3rd level,TD 4rth level.
This 4-level cycle repeats indefinitely, with patterns of hierarchically greater span / scale.
Dimensions in cycles of different order of scale are not necessarily aligned with each other.
Anyway, our 3D space is empirically specific & this dimensionality should be learnable.
My core algorithm is 1D, - time only.
The point of having incremental dimensionality is that it lets you filter-out weaker lower-D patterns before adding the costs of higher-D coordinates.
Dan He said:
If the level could be infinite, what do you think that we don't fix the order of the incremental search, instead we increase the levels by merging the lower levels? I think this way we would still consider all possible
combination of levels while pruning the search space effectively.
Dan
Not sure I understand you, Dan.
"Level" is a span of search before selection = elevation.
(It's actually more complex, see part 3, somewhat out of date)
Such span is "increment" in incremental search.
That span ends when the benefit of selection exceeds its cost.
The benefit of selection is (cost - benefit) of comparing sub-critical
(cost > benefit) patterns, which should be pruned-out.
This point is reached because the cost of comparison is fixed,
but the benefit (average match) declines with the distance
between comparands (if the data set is not random).
That point doesn't depend on total memory of the system,
so we fix the number of inputs per level, not the number of levels.
I am glad you are thinking about it Dan, but these ideas must be justified
:).
Dan He said:
Do you know Apriori algorithm? The idea is justified there. It's just a
matter whether it can be applied
here.
Dan
Looking at wiki, I don't see a justification.
Anyway, in breadth-first all data is available at once.
In real-time learning, inputs are searched in the order
they come in, & data set is never complete.
It's neither breadth- nor depth-, but proximity- first.
Dan He said:
Why we have to follow the input order? I mean we could cache the input
within a certain short amount of time window and the apriori algorithm can
then be applied, right?
Dan
We have to follow proximity between inputs because it's
the most basic predictor of similarity / match.
This may not sound obvious to you because you are used to
dealing with heavily pre-processed data. Compressive
pre-processing (starting with image-compression transforms)
already exploited this proximity->similarity correlation,
& you can't do that again.
I start from raw images because the algorithm has to scale
down in order to scale up. Generality / scalability means that the same principles should apply on all levels, only the syntax of input data is getting more complex.
"Time window" is my search level, & yes, it's breadth-first
within it. But apriori algorithm searches for binary match between
combinations among adjacent inputs, & I start by searching for
gray-scale match between individual inputs.
Again, this is because I start from the beginning, while the algorithms you are used to are designed to deal with preprocessed / human-generated symbolic data. That's why they don't scale.
Dan He said:
I don't think Apriori can only deal with binary match. You can use any matching criteria to determine if a candidate should be selected or not.
Also when you include a new dimension, it's either include or not, right?
That's the same thing for Apriori. Even if you allow partially include, you could still do the same thing in Aprior, with partial membership, something similar to fuzzy logic. I just don't see why you can not apply Aprior here.
Dan
I guess you can morph any algorithm into any other, if definitions are fuzzy enough.
That's why I work on the level of operations, rather than wholesale algorithms. What exactly do you suggest to change / add to my algorithm?
Dan He said:
Well, my suggestion is:
Instead of sequential incremental search, we might be able to do parallel incremental search from multiple dimensions, where we increase the dimensions in parallel by merging the dimensions of the current level to generate the candidates for the next level. At each level, the candidates are selected using any user specified rule.
Dan
With incremental dimensionality, a 2nd D (vertical) level
receives inputs (1D patterns) that were compressed &
selected on a lower level. And so on.
In your version, multiple levels would process in parallel
the same uncompressed & unselected inputs, which is
a horrible waste.
Incremental selection is my core principle: it's what
should make the algorithm scalable. Nice try :).
> At each level, the candidates are selected using any user
specified rule.
The algorithm is supposed to be general: autonomous &
unsupervised. Selection criteria is what it's all about, there's
no user in the loop.
BTW, I'd like to have this discussion in comments on my
intro, someone else may find it useful. I'll copy old messages
there, If you don't mind.
Dan He said:
Only on 1D the inputs will be uncompressed & unselected. Since 2D, we will be merging the dimensions from the previous level so it won't be uncompressed. At each level we make selections and we only merge the dimensions that are "significant".
The selection is generic. It doesn't need to be user defined. Whatever method you used to select a dimension can be used here.
Yeah, you can publish these discussions.
Dan
You said:
> parallel incremental search from multiple dimensions,
> where we increase the dimensions in parallel by merging
> the dimensions of the current level to generate the
> candidates for the next level.
Do you mean expanding search in multiple additional
dimensions (& adding corresponding coordinates) per
single additional level, then?
If so, I already answered that in May 11nth message:
"The point of having incremental dimensionality is that it
lets you filter-out weaker lower-D patterns before adding
the costs of higher-D coordinates."
This is an instance of "incremental syntax", - another core
principle in my approach.
I don't know what you mean by "merging" dimensions?
In my model, lower-level variables are selectively
represented or merged with related variables on a
higher level, see part 5.
Although that part is a bit messy & out of date, sorry.
Dan He said:
If you have ABCD 4 dimensions, and at current level, say 2 dimension level, you have AB and BC, both are "significant", then the next level is 3 dimension, you merge AB and BC and you obtain ABC, which is 3 dimension.
That's what I mean by "merging". You merge two n-dimensions to obtain one n+1-dimension, where the two n-dimensions overlap by n-1 dimensions.
Apriori algorithm then guarantees that you won't miss any significant dimensions and you don't need to do exhaustive search.
Dan
I thought we agreed to use "dimension" to mean Cartesian
coordinates + time |the order of input, for clarity?
You are talking about dimensions of combinatorial search,
while mine is proximity-ordered. As in, proximity between
comparands, rather than proximity in expanding the number
of constituent variables within a comparand.
I have multivariate patterns, but their variables aren't included from outside, they are *derived* through lower-level comparisons to a fixed syntax.
When I compare these patterns, I don't compare their variables in all possible combinations, I only compare variables of the same type between different patterns.
So, the difference is that variables in my patterns ("dimensions" in your terms) are added to represent derivatives from comparing & compressing lower-level patterns. While in apriori algorithm, they *are* lower-level patterns, combined into higher-level ones if they co-occur.
My model is finer-grained, thus should be more compressive.
Dan He said:
For proximity-order, you can still cache the input for a certain amount of time, right? Then we can search in parallel? If not, what do you exactly mean "proximity order"?
So why Apriori can't be applied on the variables in your patterns? And also when your patterns grow to higher-level, eventually Apriori can be applied,right?
Dan
> For proximity-order, you can still cache the input for a certain
> amount of time, right? Then we can search in parallel?
Yes, that;s how you get fuzzy patterns (~fuzzy clusters).
Each input member is compared to a sequence of old inputs
(template), possibly in parallel, forming a sequence of
derivatives that overlaps those of adjacent inputs.
This overlapping / redundant representation is far more
expensive than unique representations produced by strictly
sequential search.
> why Apriori can't be applied on the variables in your patterns?
Because individual variable's match is "merged" with those of
co-derived variables, - a group is defined by derivation rather
than simple proximity. That's one of the difference between
variables & patterns: the former are ordered / merged
by derivation syntax rather than input sequence.
Dan He said:
What's your definition of "derivation" and "derivatives"? You compare the
inputs with the templates,
and you call those overlapped ones as "derivatives"?
I still don't get the point that you eventually need to increase the
dimensionality or whatever, and whatever you need to increase, can be feed
into the Apriori algorithm, right? Even if the derivatives depend on the
sequence of inputs, if we can cache the inputs, we could obtain the
sequence of derivatives, right?
Dan
Derivatives are not inputs, they are matches & misses
produced by comparing inputs. They replace original
inputs to achieve lossless compression. I described
this process in part 3, but it's messy & out of date.
I am re-writing it, will let you know when it's done.
Dan He said:
Ok, so please let me know when you have it updated. Also if you could
provide a concrete running-example, that would be very helpful. The current
description contains lots of definitions and it's hard to map them into
some math models.
Dan
Hi Dan, sorry it took me so long.
I was working on something else, & also,
it’s hard to explain a work-in-progress.
Anyway, I just posted a bunch of updates,
especially in part 3, take a look.
Obviously, I am not done yet.
So, I can’t select | exclude variables depending on
their co-occurrence, as in apriority algorithm,
because then syntactic order will be lost.
The syntax is fixed for each search & differentiation
level, & variable types can be deleted / excluded only
by a level of syntax: a group of variables with a
common root (variable from which they are derived).
Such as dP: i: last input, L: same-d-sign length,
D: summed ds, M: summed ms, Q(d): discrete ds.
See part 3.
So, match is combined within a group only.
If it is below average for a corresponding root
variable, the group is deleted, & derivation flag
at the root variable is set to = 0.
> The current description contains lots of definitions
> and it's hard to map them into some math models.
It’s hard because they don’t fit, at least those I know
(not very many). That’s why I start from the scratch.
Terminology & methods must fit the problem,
not the other way around.
Anyway, it’s good that you pushed me to get into
details, here is $200 as a token of my appreciation :).
Dan He said:
Thanks
Also if I fully understand your method, I believe there should be more I can contribute.
Dan
Right, I hope you will understand it better, let me know if any questions.
Regarding the third paragraph - in other words our mind understands the experiences in our life by remembering patterns and later associating them with different patterns, right? I mean, this is what I have been theorizing myself and you have reached conclusions very similar to my own.
For example let's take an apple. When you read that word, you suddenly remember the smell, look, taste and feel of this entity - patterns from different sensory inputs.
Yes, but the key is to formally define patterns, and the algorithm to discover them. Do you have any specific ideas or comments on my algorithm? Thanks.
Hi, Boris, a few comments from a partial reading:
1. Noise IS a pattern and IS predictable - that's why it's boring
2. Higher levels are not always smarter
3. Lower levels of cognition are not unconscius
4. Causality, causes and correlations
...
1. Noise IS a pattern and IS predictable - that's why it's boring
Boris: "Novelty can’t be a primary criterion: it would select for noise and filter out all patterns, which are defined by match."
Todor:
I think that's a common, partly superficial, cliche. Noise, if having a stable or predictable distribution, is also a pattern, actually a very simple one, and it is recognized as some kind of distribution or remembered/classified otherwise (with a generative capabilities), depending on the system.
Noise's representations are "boring" (for watching, like TV static) not because they are too "unpredictable", but the opposite. Respectively it is not true that noise is "uncompressible".
It is compressible - to a noise distribution which is precise enough, since or if the cognitive system cannot remember the specific details, thus it wouldn't care or notice in real time if the specific low level values are slightly altered to the original.
If one tries to compress noise as specific values exactly, then she treats that input not as noise.
So in a series of frames with TV static only the specific values of the lowest level representation are "unpredictable", but the distribution is "the same", given the resolution of perception - in abstract sense of resolution, - that goes also for other details such as the same screen/dynamic area (physical coordinates), the same average brightness - applied in many levels/resolution of aggregation (and matching between each other) etc.. Also noise is locally self-predictive, self-matching after General Processing** or adjacent comparison, which makes it boring too.
Noise matches the predictions too match and watching the "static" on a TV is somewhat similar to watching the wall.
That goes for audio noise as well, it immediately starts to sound like a boring periodic pattern and different spectra of noise are recognized - white, pink, low frequency, middle frequency, high frequency* and if one is given synthesizing tools to play with, she can recreate the noise that she has heard, i.e. it is a pattern.***
* These are "generalized", higher level labels, but they are just pointers to the representations. People with good musical ear or with "absolutist" audio sense could remember and recreate or even tell the frequencies, just like they could replay music by ear. See 3. below.
** ...
*** Another theory is that the apparent local unpredictability overdrives brain processing, the system suffers cognitive overload due the unsuccessful attempts to compress, but I think that's more likely true only for the "garbage input", confusing ones, ones with too many recognizable details, like pictures of dumping grounds, general "disorder"; that is, something IS recognized, some part are "meaningful", but there's an overload in making sense of the whole picture. In the case of noise, the properties are highly local and homogeneous.
2. Higher levels are not always smarter
Boris: "... Intelligent variation must be driven by feedback within cognitive hierarchy: higher levels are presumably “smarter” than lower ones. That is, higher-level inputs represent operations that formed them, and are evaluated to alter future lower-level operations. ..."
Todor:
I don't think that's always true (only "presumably" and sometimes), the "cleverness" of the higher level depends on the specific environment and the history of the experience and the biases, for how long the higher level was relatively "stable", how "stable", how "stability" is measured, how stable are/were the lower levels etc. In a more dynamic environment the lower levels might be "smarter", because they would adapt and react first and higher level plans might go completely wrong, with (correct) prediction value = 0, like with the branch prediction in the CPUs or the cache misses.
There's also something that Kant and Schopenhauer call "facculty of judgment", which is the glue between different levels of generalization and the capability to recognize and classify correctly the lowest sensory input to correct respective higher level representations.
The higher level is given higher *power* in longer term, because it may push the agent to this or that path for a bigger spatio-temporal radius, however whether this is more "predictive" in the long run depends on the whole Universe of things. "Smart" leaders often send their followers to hell and apparently wise maxims have kept people in darkness, and the world is occupied with other agents of which there are limitations for predictability. Their intentions are both partially hidden/unattainable or their goals and behavior is flexible and may change in reaction to your predictions/actions.
Brain is not strictly cognitive and the prefrontal cortex integrates too many inputs, but the world is also not just cognitive, and the top-down tyranny in general is prone to and causes corruption, conservatism, political tyranny, pseudo-democracy.
3. Lower levels of cognition are not unconscius
Boris: "All our knowledge is ultimately derived from senses, but lower levels of perception are unconscious. Only generalized concepts make it into human consciousness, AKA declarative memory, where we assign them symbols (words) to facilitate communication."
Todor:
The artists of all arts and the athletes, especially in technical sports are supposed to operate masterfully with "lowest level" stuff, that is simple actions or perceptions. Making a concept verbal might be just labeling, a quick addressing. A talented artist knows and operate these concepts (patterns) even without giving them a verbal name. Lower levels are "unconsious" only for the ones who do not understand/recognize or cannot control their actions with an appropriate/sufficient resolution of causality and who lack introspective capabilities. There are degrees in that.
If consciousness is the ability to know, to alter, to select, to address etc. these lower level components, then one who does is conscious about them.
Sure, you may call these "general representations" as well, but for example the process of drawing and painting the spectrum of all different specific basic trajectories/forms/shapes/gradients etc. is as specific (low level) as possible for a motion/action/transformations performed with the respective tools of the trade, and the artist can control each pixel, color, shade etc.
In comments to a book (200, 210 of http://razumir.twenkid.com/kakvomu_notes.html#210, http://razumir.twenkid.com/kakvomu_notes.html#200) I discuss "The Hypothesis of the Deeper Consciousness" of mine (in Bulgarian).
Jeff Hawkins' hypothesis about the motion of the representations lower in the hierarchy through training is related to that.
For example, there's a common confusion regarding creativity. A guitar player who's improvising at a high speed might appear as "unconscious" to the audience ("instinctive"), but she's not. She is not "unconscious" about the future tonal possibilities, respectively her motor actions (predictions), she knows what would sound well, respectively what motions she could and want to apply etc., thus she chooses this or that, and she has to control the motions at the maximum possible resolution. She expresses her predictions and what she feels/cognizes through these motions, because speech is inappropriate and too slow, but that doesn't mean that she doesn't understand (cognize) her actions and their very components.
4. Causality, causes and correlations (comment to a G+ topic on the "causes")
Causality as a concept is useful for benchmarking the system in a simulated or in the real world. The causes are defined there as physical laws or others, the correlations are found by the system and compared. They are marks of the maximum or the target resolution of perception and causation (alteration, change), achievable in the given environment. The "God" of the cognitive algorithm aims at making his pupil reach to that maximum possible resolution or to the wanted one, according to given point of view and premises.
Yes, causality is a matter of degree and POV, especially when looking at the world not literally at a full resolution (in that case it's a literal record) and when asking questions like "Why this car hit that man, who was crossing the street?" or "Why that car has parked here and not there?". The "Why" question requires target resolution/ limitations/ bandwidth/ agents/ attention[span]/ radius[space-time]/ motivation ... and other premises to be given, before answered. If all is run in the same pool, the premises are also based on the primary causes, but are at different levels and require some kind of Understanding that selects, encodes and decodes different types and levels of "causes". The physical laws do not understand life or human thoughts.
Regarding the claim that we don't know more of "correlations" - that's somewhat Locke-Hume-Mach-style of thought, - in opposition to Kant-Schopenhauer-Marx-Engels-Lenin. Usually that's said when the philosophers forget about the existence of their bodies. Through the experience we see too clearly that our bodies are built of a big chunk of these same "correlations" of the physical laws and are result of series/networks of them, therefore our thought and action processes are expressions of these very "correlations".
The physical laws are implanted in us and we are directly connected to the causal chain with a concept/something that was once called "Will" by Schopenhauer, related to "The Thing in Itself" of Kant and to the "Will To Power" of Nietsche.
Sure, technically, for the implementation, it doesn't matter whether you call it "strong patterns" or parts of the "causal network".
Todor,
1). You should know by now that by “pattern” and “noise” I mean components (not categories) of real inputs. You are saying that all noisy inputs have *some* sort of pattern in them. That’s nothing new, I am talking about majority-patterns: spans of above-average match, and majority-noise: spans of below-average match.
2). We've had this discussion before. The only feasible “glue” between levels of generalization is even higher level of generalization.
3). Yes, so what?
4). "Sure, technically, for the implementation, it doesn't matter whether you call it "strong patterns" or parts of the "causal network"."
It does matter, “pattern” is constructive and “causal” is distracting.
Todor,
In general, I don’t want to argue about part i, it is mostly suggestive. I know you love to go on a tangent, but it serves no constructive purpose. What I really need to figure is the process of forming incrementally higher orders of feedback: part 4. Feedback is the most fundamental concept in my approach. I have three basic orders so far: templates: past inputs for comparison, filters: averages of past inputs for evaluation, and expectations: back-projected past inputs for prevaluation (which allows for skipping whole spans of future inputs, rather than filtering them one-by-one).
The first two are straightforward: templates are not modified at all, and filter = higher-level cumulative match / (higher-level span / lower-level span).
But back-projection involves re-integrating co-derived differences, which doesn’t fit the “pattern” of simple span-normalization formed by the first two orders. And I need a common “meta-pattern” to derive subsequent orders of feedback.
Also, averages of past inputs should be compared, and input-to-average match should contribute to ultimate filtering by cumulative past match.
Any ideas? BTW, this is getting to be too much for comment section here. Let me know if you plan to continue, I will start a new “discussion” post.
Yes, I do plan to give a try of the new material.
1. OK, I mean also that the above-average match sounds a bit dogmatic to me for the earliest stages, the "child" machine may explore more and collect more low level data.
Also, is your system supposed to have external/additional/"lower"/interface drives to push its development, to help it avoid local minima, obstacles, lack of immediate predictive value?
That's both "emotions" and a "teacher/mentor", and is related also to early navigational/exploration behaviors, like babies scanning the area with their eyes when put in the dark (search for contrast, novelty seeking), or changing the direction of search/scan/coordinate adjustments when reaching the end of the coordinate space/dimension/... These may emerge from the principles of the cognitive algorithm, but may be partially "external", driven in case the environment is not "good"/stimulating enough or that kind of knowledge is inappropriate/inefficient to learn through "free" exploration.
Yes, selectivity will increase with experience, via feedback from higher levels, which are added incrementally. All filters (averages) are initialized at 0. I just wish you get over your holistic, analogic, qualitative, anthropomorphic thinking, - it's not constructive.
Supervision and reinforcement can be added at any point, but that's trivial, not an interesting part. Real work is in designing scalable unsupervised learning algorithm.
I should post an update soon, mostly on parts 3 and 4.
I have a preliminary compete AGI theory:
https://groups.google.com/forum/#!topic/artificial-general-intelligence/UVUZ93Zep6Y
https://groups.google.com/forum/#!forum/artificial-general-intelligence
Thanks Keghn. You don't go into detail, but rely on ANN. I stated specific objections to using ANNs in the first section.
Yes. I do unsupervised learning with Kolmogorov complexity and Frieze logic
patterns. I fist did clustering with this logic in a non ANN way. And then
later found out a way to do image clustering with NN, with the above logic. And
then temporal clustering.
http://www.scientificcomputing.com/articles/2015/09/patterns-are-math-we-love-look
https://en.wikipedia.org/wiki/Frieze_group
The AGI NN Brain is a cascading LSTM:
https://www.reddit.com/r/alife/comments/3dkcdv/agi_brain_a_cascading_rnn/
For me it is easier to do it in math and programming in openCV then NN. Then later on i would make the conversion to nn.
So if you do not like NN i will not talk about it any more, from here on out.
My system is take raw input into a giant ring buffer and look for repeating patterns. Then hit a pattern with a output. In hope of improving the pattern.
This is done in evolutionary way. Staring from a micro movements then up.
My engram snap shots are jpg images of edge detectors, and the bottom half of
the image is meta data of arm, leg position. Sound track and other info. like
kalman filters
Each snap shot is taken every 1/20 of a second. I use differential compression.
The fist image is the reference and any and only changes will be in the next
snap shots.
These engram snap shot are formed into graph theory maps of the world:
https://en.wikipedia.org/wiki/Graph_theory
My theory is very detailed. I only post little details at a time.
So if there is any thing that resonates with you then i will talk more about it.
Because everyone want to do AI in their style. Bill Gate can fund AGI and have
it next year. But he want to have one of his kids to develop it. So he will
fund them whether they have a working theory or not.
It's good that you are working on it Keghn, but I don't think you have a theory yet. Otherwise you would have a coherent write-up, one simply doesn't happen without another.
Well, if these brief description of my work does not resonate with you then
we are different ball parks. Your work does not resonate with me.
An thus it is not in my interest to give the full description of my work to
some one who will find it uninteresting.
Filter is used many times in different contexts and it somewhat sounds as something abstract/undefined.
You mention "average" and it's supposed to be a "selection criteria", however it'd be more clear for beginners if it's emphasized that "f." is just a value/scalar? to which input/variables (values) are compared.
Depending on the outcome, the evaluated item is either selected or skipped, i.e. the f. are borderlines/limits.
That's because filtering (feedback) is incrementally complex, starting from bit filtering by input and coordinate LSB and MSB.
Average is the second-order: integer filter. But I start explanation with average because it's easier to understand and bit filtering is pointless with canned images.
It's actually dual hierarchy: of filter resolution ) complexity, and of feedback range ) depth. I cover it in part 4, but the only way to make it fully explicit is code.
BTW, it's not necessarily selected or skipped, rather included into positive or negative pattern. What is happens to these patterns then depends on the level of processing.
BK: "However, average match in our space-time is presumably equal over all four dimensions."
TA: Is it? It does in a mathematical random environment or may be assumed because of the spherical propagation of gravity and other forces? IMO initially it's more likely equally *unknown* to an Alg. Also, in real environments, it doesn't feel equal in *all* dimensions to me.
It's equal if you define match as some average that covers all four dimensions and treat the samples as in one common bucket, thus they are defined as equal, but that requires initial expectation of these dimensions.
Also aver.match is very coarse and not meaningful. What's the average match (to what average) for a starry sky - or the Universe taken as a "photo"?
Within the area of a star or moon, there is continuity of high intensity. Within the dark areas there's continuity of lacking intensity. The global average match to some "average" is perhaps ~ 0,000...01, because the most is dark.
That's one of the early inputs humans have systematically searched for patterns.
Besides the 4D (time) is different. The CogAlg is not supposed to think in relativistic terms, but the relativistic 4D is "space-time", where time is not an independent dimension.
In Newtonian (and probable practical) treatment, the outcomes in "time" are generally irreversible and "unique", while the spatial are "more reversible" and repeatable/"randomly" accessible (from the POV of the operating mind).
Furthermore, "time" is more "introspective" and "abstract" than the others, unless it's built-in - thus not discovered, but just mapped to specific inputs.
It's either built-in as "different" (or assumed as a position in the input queues), or it's discovered as some kind of pattern within the other spaces (dimensions), in the memory, and that pattern (model) is called "time" and is then also not independent from the other dimensions.
While on the other hand the positional dimensions in a simple virtual implementation are directly mapped to some input/or feedback,coordinates/variables.
Yes, space-time is an add-on, not part of my core algorithm. As are dimensions in general, core algorithm is 1D only. I have 4D built-in as an implementation shortcut, discovering it may not be easy.
Whatever specific anisotropy there is among individual dimensions is another layer of add-ons, this one to be learned by the algorithm itself.
As for averages, they are selected feedback of individual higher-level patterns, as fine or coarse as the patterns themselves.
Hi
I have completed my AGI theory. Well pretty much.
BK: "Crucial differences here is that conventional clustering methods initialize centroids with arbitrary random weights, while I use matches (and so on) from past comparisons."
T: Is it so crucial, I think it doesn't matter - at least in the simple K-means, the center should converge after scanning all patterns, the "past comparisons" are the differences ("the distance") to other patterns within the set which adjust the center. The number of buckets/sets matter, though + the specific empirical values, they define how the borderline cases are spread within flat/one-stage (non-hierarchical) clustering.
Todor, what's crucial here is that randomizing anything is idiotic.
There is no reason to introduce artifacts and then waste resources on eliminating them when you can start with real data.
Biology simply can't help it: wetware is horribly noisy, but in software randomizing is a proof of intellectual laziness.
Where did you get that it's randomized as a rule? It's rather averaged as well, or probably randomized when the algorithm is supposed to adjust and converge (or it doesn't matter) during scanning ("comparing") all inputs, thus the starting point is of little importance in the long run, especially on the low level.
You also set the averages in the middle of the input range, but it's your statistical guess based on "real data" (isn't it random what your first inputs and thus pattern would be?), and will not start well with some pictures, as well your guess that average match is equal in all dimensions.
Randomizing (within reasonable ranges) is not idiotic, in some algorithms it increases the performance, because the structure of the input is unknown and is not predictable from the local values, it has to be processed ("compared) entirely in order to decide, and there are border cases where the algorithm could perform badly if scanning pedantically. The initial input is unknown and doesn't have "predictive value", you also guess it has, and when it will become actually and meaningfully predictive depends on the specific data set.
Traversal of the whole set is expensive and for big amounts of data and long run it is cheaper and more productive to make guesses and randomly or cyclically change the directions of traversal etc.
There are limiting conditions, e.g. sorting, thus the end result will converge to the desired "pattern".
From video, I like using sub features of a outline of a object, generated from a
edge detector. Then pair each sub feature with a weight. Then do clustering by
selecting two object outline and then iteratively change the weight until they
are the same. The distance is in N space is the amount of change i had to apply.
Well, random *should* mean produced by pseudo-random algorithm, quantum noise, etc.
Average or middle of the range is not random, there is a reason for it.
To be consistent in my algorithm, all filters would be initialized at 0, because the contents of higher levels are 0. But the contents of *my* higher levels are not 0, so I can make a guess. Direction change is not random either, it is a coordinate filter feedback.
But you might be right, I didn't get into details because there are more important things wrong with clustering. It wouldn't surprise me, their use of "distance" and "dimension" is also misleading, there is a difference between internal and external parameters.
Anyway, that part is peripheral, I haven't edited it in years.
The way i see it when a AGI device turn on the first thing it does is to sample
the in coming data at the highest level detail, down around at the Brownian motion:
https://en.wikipedia.org/wiki/Brownian_motion
AT this level there should be very little repeating data. Only the little that is
allowed by laws of chance.
To Low of resolution and every thing matches and look alike. So the sweet spot
is some were in between. Now it come down to temporal repeatable pattern that
can be done, at the highest level of resolution. That can make self reward happen
and avoid anti reward. In coming reward and anti reward data are the only data that are true values and and set at one true resolution.
Reward are based an energy management and anti reward are base on damage to a AGI
body, for example.
https://phys.org/news/2017-09-monkey-metacognitive-illusion-monkeys.html
Keghn, you should do a coherent write-up on your own site.
Yes, it really need to be re done.lot of ideas on simplification. but still will be a tough read.
https://www.researchgate.net/profile/Lane_Friesen
I noticed that you are what's known as the Teacher cognitive style like me and Lane Friesen. So if you can imagine someone with your same level of focus in a different domain integrating the various structures of thinking I thought you would find his work immediately useful. The added benefit is since it's a personality theory it actually provides the framework for a division-of-labor and contextualizes your role in the evolutionary makeup of consciousness.
Interesting Brandon, I think most readers here would disagree that I am a good teacher :).
Teacher is a task rather than cognitive style, it depends on subject and audience. And I am yet to find audience sufficiently interested in my subject.
Cognitive style (in my terms “bias”) would be generalist vs. specialist, I have a post on that: http://cognitive-focus.blogspot.com/2014/10/cortical-trade-offs-specialist-vs.html
Sorry, I can’t imagine someone with my style working in a different domain. Generalization is a reduction, so there is only one domain left on the top level, and that is my subject.
The problem with neuro-psy work is that it’s not selective enough. You are looking at phylogeny and ontogeny- specific brain / mind, with a huge number of kinks and artifacts. The only universal function in all that is generalization, and I am designing algorithm optimized for it. So, mine is a meta-generalization.
Hence the problem with “division of labor”: it has to be top-down, and there is a dispute as to who is on top :). As I mentioned at the conclusion of the post linked above, competence of a generalist is much harder to evaluate than that of a specialist.
1. There is a catalog of human thought as one of the publications on Lane Friesen's page. Enumerated circuits spanning the entirety of human thought is the exact opposite of neuro-psy not being selective enough.
2. The terms 'Teacher' and 'Cognitive Style' have internal meanings from the framework of Cognitive Styles (aka MBNI) that correspond to specific groupings of personality traits, localized brain regions and their associated functions, hippocampal circuits, and the components which make up a 'complete thought' (it divides labor).
3. Look how simply Cognitive Styles explains your reaction to my post
https://web.archive.org/web/20020803113646/http://209.87.142.42:80/y/a2/ctable4.htm
Notice how you didn't leave room for the possibility of "Theories A and B are close if in some domain (their intersection) they are alternative ways of describing the same input." What you attempted was to say this theory has a smaller domain than your theory and is only a special case. This is highly doubtful seeing as I anticipated you being a Teacher person and this characterization of stereotypical Teacher behavior was written more than 15 years ago.
4. What you are describing as kinks and artifacts are simply facts. These facts together form an associative web and become the Big Picture for Perceiver persons.
https://web.archive.org/web/20020607210820/http://209.87.142.42:80/y/a2/c7.htm
Notice how there is no fundamental difference between a big picture that grows in comprehensiveness versus a theory that grows in domain, except for the fact that you follow the predicted cognitive division of labor.
5. There isn't anything coercive or authoritarian or even hierarchical about a division-of-labor
"...the Mercy may want loyalty and mutual submission in a friendship. The Perceiver desires commitment to principle that is demonstrated in ‘personhood.’ The Facilitator can look for discretion, wisdom, kind-heartedness and good motives. The Contributor easily appreciates dependability, skill, common goals and a self-contained efficiency. The Teacher wants others to share and value his understanding. The Server looks for those who appreciate his help; the Exhorter in contrast wants others to help him in doing what he thinks that they want to do—in return, he gives hope and excitement."
You want others to share and value your understanding
"Who then is the leader in society? Our profile indicates that it’s not the Facilitator. Should it be the Perceiver? No, for he cannot think apart from a set of foundational assumptions. The ‘leader’ must be an ‘abstract system of understanding,’ leading to the rule of law, which in turn will develop an intelligent electorate. Then, democracy can work!"
Or stated for the individual
"The Extraverted Thinking ‘plan object’ is to enable Perceiver ‘beliefs’ to coalesce into Teacher ‘understanding’ of intelligent Server ‘actions’ under various Extraverted Sensing contingencies,1 so as to generate Mercy ‘happiness.’"
Stated alternatively
"If you trusted the laws of cause and effect to the extent that you stopped trying to uphold everything together on your own, then you as a generalist could really become a source of inspiration for others. Selective and intelligent withdrawal on your part would encourage others to do their best and be their best, in those areas where they can excel; it would facilitate their interaction. What would it take for you to do this? A realization that people are different according to a very solid and predictable pattern would be enough. That’s a changeless principle."
With further suggestion
"I would suggest that the principles which govern the operation of the mind could comprise a good initial set of axioms"
6. Here is the theory of theories
http://www.megafoundation.org/CTMU/Articles/Theory.html
http://www.megafoundation.org/CTMU/Articles/Nexus.html
7. I still maintain this theory is immediately useful to you
I am sorry Brandon, but you are commenting on a wrong post. And it would be off-topic even on a right one. You are not addressing anything in my posts, pushing your "theory of theories" instead. Which I don't find very interesting. This is a perfect example of not being selective enough.
I will address the only relatively specific point that you made:
"You as a generalist could really become a source of inspiration for others. Selective and intelligent withdrawal on your part would encourage others to do their best"
This is extremely naive. There are two things that inspire people: success and insight.
Insight can only be understood by those who work on the same problem, on the same level. And if I withdraw, then there will be no more insight.
Success in computer science means working code. No one really cares about theory in that field, and these are the people I need to inspire. So, this intro is peripheral, the only two ways I can get things done is coding on my own and paying.
That's real life for you.
1. UML and other coherent ways of modeling a problem domain are used extensively in the Computer Science field.
2. THEORETICAL Computer Science is the field that cares about theory.
3. What's naïve is not being able to assemble a team when people on the Autism spectrum are known for failing to properly construct a theory of mind in others.
4. Why is your curiosity relevant to whether or not a theory provides new success or insight in a problem domain? You simply don't know what you don't know, simply refusing inputs that you didn't anticipate is obviously the wrong approach and shows you're not working from the most general action context.
5. Welcome to real life
Theoretical Computer Science will never consider any theory without working code.
That's a written rule in any computer science publication.
You are advising me on how to work without understanding the first thing about the work I've already done. If you did, you would comment on substance.
Grow up, dude. I wouldn't be here if I wasn't trying to help.
https://en.wikipedia.org/wiki/Universal_Systems_Language
You are not helping, nothing you posted is relevant enough. Yes, that includes USL.
Basically, you have a dirty mind. Nothing special, it's a very common problem.
But a fatal flaw for my terminally reductionist subject.
So, I appreciate your trying to help, but this discussion seems hopeless.
If you know someone who can work on this project directly, I am in debt for life.
Otherwise, best regards.
So here we have the extremely unique situation of a FEMALE generalist with real world success, insight, and implementation that isn't able to add anything that you haven't already anticipated simply because you said so. Why can't you simply be wrong?
You are not even trying to understand what I am doing here.
You are not a mind reader
Then it's not working. I appreciate your effort, but this is a subject at which almost everyone fails.
Any outside observer can judge for themselves whether this is an accurate picture of your personality traits or not:
"This person considers himself to be very ideological, consistent, principled, and is very conservative in this. Becomes irritated by those who criticize his ideas. He lives by the "wholeness" of the internal situation. Often able to see "through" things, to the inner essence of something or someone. Romantic and idealist. Lives by his internal harmony, tranquility, serenity, is able to draw inspiration within himself, and gets annoyed by those who try to disturb it. Generally does not like when people try to look inside of him, gets frustrated and angry when this happens. Strives to be inwardly calm in all situations and internally consistent. "Fluid like a river": involuntarily adjusts himself to the interlocutor in conversation by taking form of consciousness that is best fitted for the situation. By this he isn't playing a role, his consciousness is simply multifaceted and he is directed by his inner "wholeness". That is, he simply presents a version of himself. Communicating with you, he always feels your moods as if he is living through them together with you, adjusts himself to this. Loves to introspect and to meditate. In case of failure, can make a qualitative self-analysis. Being present in some place he as if tunes himself out, tries to become invisible like a chameleon, especially if he perceives it as a threat to his inner tranquility: for example, in the workplace so that no one bothers him. Can even hide it in some clever way: arrange a barricade of folders so that behind them he is not visible. Does not like restless, internally discordant individuals, as their state can get transmitted to him, will try to escape from their company at any price. This is especially funny in a situation where a male representative of this type flees from ladies, and they pursue him like prey, because they feel that he has something that they so desperately need: inner peace. But for him this inner "wholeness" is not the product but material for inner consumption, so he can only share this with a small number of people, but sometimes someone might snatch a piece - this makes him very angry. Often, especially in circle of family, he becomes a critic, since deviation in behavior away from his principles turns him aggressive. If in another situation he will somehow restrain himself, at home he may allow himself to explode with anger."
Unsupervised truth
I have had a thought of using GANs to do transmissions and self learning at the same time, with a one pixel Generating Pokemon Adversarial Network system, GANs.
There is problem in neuroscience of what is the meaning of "Same" or "equal" is, at the level of a neuron.
GANs are make up of two NNs. The detector and the re generator NN, that regenerate what is the detected NN has seen.
The detector NN detects some random color and then turn to the
output of generator NN.
A standard value is inputted into the generator NN from the detector NN that
indicates training or detector activation. When the detector NN
sees its detection, on the output, of generator NN it fire of it activation that
lock the weight of the generator NN. Complete dynamic random noise is used for the
weights values.
If you had a alternating row of detector NN (DeNN) and re generator NN (ReNN). The information could be past along in a daisy chain fashion, for a global reference.
Like so:
DeNN, ReNN, DeNN, ReNN, DeNN, ReNN, DeNN, ReNN, DeNN, ReNN.............................
The weight are randomly selected and locked in place, within the Detector NN.
No training.
So that a the color and brightness it detects is a random catch.
With many one pixel GANs in the first layer of NN system there will be over lap.
The output of these tiny GANs will combined on the next
layers, of the Deep NN, to from edge detection and blob detection.
These NNs here only detect and regenerate the color of one pixel or a averaged
of small group of pixels. So it will not be too slow to auto train.
This way different GANs captures can be compared against each other.
Generative Adversarial Nets - Fresh Machine Learning #2:
https://www.youtube.com/watch?v=deyOX6Mt_As
Brandon,
This is off-topic, but maybe you could help me. Your profile is mostly correct, but all these things are easily predictable from a generalist cognitive bias. And most of them are simply different ways to avoid distractions. Because I need to focus on work, which is too abstract for anyone I’ve come across.
I do need colaborators, but my experience is not encouraging. So, I tend to treat everything as a distraction, unless proven otherwise. Maybe I could use less technical stimulation, basically cognitive nagging. Probably won’t help unless there is believable interest in my work. I can pay but don’t know how to shop for it. Any ideas?
@Brandon, aren't you irritated when somebody challenges your "theory", way of thinking, opinion and continuously insists about it? It seems you were...
(Who wouldn't be? Why don't you try to preach any researcher or an entrepreneur, even a graduate student - go and tell her "you're wrong, I'm John Doe, I say so - because ...")
This is an average reaction when somebody you believe is not qualified or doesn't understand you is popping from somewhere and pushing you, giving you very confident advices, insisting that he's right and you're wrong etc.
Most likely many wouldn't bother to answer.)
As of the problem of assembling a team - as a long-lasting "observer" and somebody who has tried to help in finding people, I'd say, that while Boris is not the kindest person, being rude has its reasons, and just being kinder or agreeing more with others wouldn't help much.
The way he defines and presents his project have some "requirements" which the "candidates" cannot, would not, or don't work hard enough to meet or only rarely "touch", given the particular conditions.
(BTW, the English speakers confuse "understanding" (to comprehend) with "to agree" (the same word). One may understand you better than yourself, i.e. to know why/how/when/because of/what about your goals/desires/... yet she may not agree (the ethical aspect), may not like you or may disregard your opinion or don't think you're worth it, thus deliberately being not-friendly (want you to go away), may not agree to change or to act as she has in order to attract you/to make you to like her/etc., may not agree to pay a given cost etc. etc. )
...
As of the teams - who and at what cost can assemble a team to do something so ambitious, hard, unclear, rebellious (it claims it's not related enough to the other methods), requiring a lot of mental efforts, probably unpaid or paid only in some cases which are not clearly defined, requiring very high qualifications and specific mindset etc.???
Most of the people can't assemble a team of TWO people (for free), three is very hard - many start-ups are of two or three or several people, not more, and happen very rarely, usually are backed with investors and are about solving specific, clear problem, which is technical, manageable and just a matter of some focused work to be solved.
Let's compare what other teams were assembled in the field of AGI and what were their backgrounds:
1. Ben Goertzel's Numenta?, AGIRI, OpenCog - he was a known prodigy? (PhD @ 22-23 years), thus contacts, he's writer and has readers of his books (more accessible than Boris writing), participating at conferences, he's popular/people like him, created the first? email list, organized the AGI conferences, has given many interviews, working in University (having students for work and help) etc.
Yet, after so many years, money invested/foundation/contacts/advertising, he and his team are running short of people, OpenCog is developing slowly and is a hard matter.
2. Jeff Hawkins - a multimillionaire engineer/entreprenneur - contacts, authority, a lot of investments, a book/bestseller etc.
3. Other foundations and companies - millions of dollars invested, a lot of highly qualified full-time researchers and developers
4. Deep Mind - millions invested, lead by a famous and commercially successful developer and researcher (Hassabis) and a famous AGI researcher - Shane Legg, previously working with another famous researcher in Universal AI - Hutter.
The above means - authority, easier to attract collaborators and investors, based on more established grounds with "official" and "scientifically solid" publications with one of the masters in the field (), Reinforcement learning etc. Then Deep Mind were bought by Google etc. Etc...
What are their powers to attract participants, what are Boris'??? He says that he "pays", but he has invested nothing substantially so far. The prizes are symbolic and the people who get involved are more or less volunteers given the complexity of the task and that the total of the prizes is in the order of a month salary in the USA for a single software developer who's coding easy and mundane well-defined algorithms.
One other issue is the notation and explanation. Boris, you have to code it yourself, because your notation and definitions are not expressed clearly or "interestingly" enough (or you yourself are not sure yet how it should be coded) and the project has not yet an environment where it will show results.
CogAlg needs another notation and representation, but I'm not focused enough to try to approach the problem myself yet...
...
However, although the minimal investments, the tendency in attracting collaborators seems progressive, so... congratulations about that.
Todor,
I have an excellent chance to attract a right person, - no one else has comprehensive and constructive theory. The problem is finding him, if such animal exists. Theory is notoriously hard to colaborate on, and it doesn’t get more theoretical than mine. “Interesting” is subjective.
It’s true that my awards have been minor so far, but you have to consider return on investment. For example, I am currently on a break from coding, clarifying fundamental principles of feedback to define 2nd level operations. If I had a coder, he would just spin his wheels. But I will need a good coder once I have a prototype.
As for my notatition, it’s hard to learn but easy to work with. Anyone can bitch, try to come up with something better. I just spent a month and $2K to have a freelancer redo level 1 notation, ended up with complete crap. By his own admission.
Kindness is a complementary of emotional detachment, which is a must for true intellectual independence. Newton and Einstein weren’t kind either. Here is Einstein’s quote:
“I gang my own gait and have never belonged to my country, my home, my friends, or even my immediate family, with my whole heart."
By a "better notation" I don't mean just slight rearrangement of the low level Python code... I think I told you in my "coder session" that IMO such optimizations and tinkering are in vain, and I'm not surprised that you're not content of that new piece of work.
I mean more "meaningful" structured representations/graphs/colours/symbols, another kind of code (another language), not a low level "conventional" code; custom algebra, diagrams, traces, IDE, simulations.
Clear model == easy coding in any language.
Besides, I think the low level representation shouldn't be coded manually in a typewriter style.
The low level executable code has to be generated from a higher level representation which is more clear and manageable, requires less mental efforts and has adjustable parameters (includes with sliders, knobs, graphs/trajectories, presets, pictures of settings ...).
That kind of representation + the IDE should generate the low level code in a variety of programming languages, prepare simulations and visualizations, provide specific debugging and comparison capabilities - between different versions of the algorithm, for running machine learning benchmarks etc.
P.S. Research is not a reliable investment to shoot for (always) having an (immediate) return with the minimal possible expenses.
Lots of words, no info. Oh, yes, you are not working yet. Just enough concentration for some valuable mentoring.
LOL! OK, I had a spark of focus for one question:
Where's the non-linearity (saturation) in CogAlg and how it's encoded? You don't mention the word in the text, and it's not visible in the code: + - / * ==> linear.
In DL and ANN the sigmoid/non-linear/... is considered a crucial part that pushed the field forward. https://en.wikipedia.org/wiki/Sigmoid_function
It might be supposed to be in the chain -,/,log, "etc." comparison operators with "etc. meaning "iteratively" (meaning recursively...) re-applying them, but it's not clear.
It might be as a mistake from overflow/underflow/loss of precision, implied somewhere.
Or you think that it will be discovered in some level somehow by linear + - / * ?
Or you assume that it's not important, "is needlessly lossy", ... ?
... ?
That's a progress, but why do I need it?
I think transfer function is a hack, arbitrary distortion of output to avoid overflow.
They use it because matrix multiplication and summation generates ridiculous numbers, but I don't do that.
That's one of the benefits of fine-grained search: overflows should be very rare. And then they trigger also fine-grained feedback to adjust bit filters (part 4, level 1, not implemented in code) to prevent further overflows.
I don't think it's only about overflow in ANN. It's also:
1. Normalization.
2. Mapping nonlinear correlations.
First ANN was the linear perceptron which was found to be a limited classifier due to its linearity: https://en.wikipedia.org/wiki/Perceptron
(Well, that linearity is one objective reason why your code looks "not interesting" so far.)
In general I assume there might be a deep philosophical reason - the observed Universe laws are non-linear - quadratic, trigonometric, exponential, Gaussian, "Brownian motion" and other kinds of noise, fractals.
Technology exploits approximately linear ranges of operation of overally nonlinear elements (transistors, capacitors).
Besides the linear part is for the simplest functions - amplification (replication, scaling). The interesting functions emerge in the nonlinear parts - saturation (e.g. wider range of operations without breaking), switching.
Also the analog amplificators, except simple ones, use compensations of a set of non-linear elements in order to achieve higer linearity, because even the resistors' which are generally passive and "most linear" also "bend" depending on the temperature; capacitors change their impedance(resistance) for different frequencies etc.
Overall, the linear correlations are represented by systems of non-linear elements, not the reverse. "Linear" in Universe is more complex/harder to represent or to sustain than non-linear. For our calculations everything linear seems better/simpler, but it's a POV of a high level of cognition, based on many layers of non-linear operations.
Therefore, what about:
Discovering linearity as a pattern of non-linear elements?
The universe is complex (non-linear is backwards), but that complexity is empirically specific.
It must learned as input-specific patterns, not coded as some god-given transfer function.
The code should not look like “universe”, and the interesting part is reasoning behind it.
I have normalization as averaging per pattern, it must not be universal.
Discovering linearity as a pattern of non-linear elements is backwards, discovered complexity must be incremental. Same for complexity of the code.
This is not an ANN, I start with fitness function rather than analogies.
This is incremental complexity of operations and code as you define it as particular cost of operations in your "Turing machine", thus this CPU-related reasoning (currently sort of outdated as of real cost of operations) is the "god-given" complexity ladder, which is also a cage.
It's also a linear view of the incremental complexity. Complexity may and should grow and fall (after reconfiguring) even in such measurements of + - operations. Systems usually solve problems inefficiently at first, as existing means allow, then learn to solve it more efficently and reconfigure and update the cost functions as well.
It's a very simple logic: complexity is a cost, it should only increase if it adds a benefit. Which starts at 0. Yes,it can decrease with the benefit too.
Operations will be costless when there's nothing left to do. You really need to practice this logic vs. analogic thing.
Actually, I just thought of another reason for transfer function.
As I said in paragraph 6, product exaggerates similarity.
So, transfer function seems to be an atheoretical hack to compensate for this exaggeration.
That does count as normalization, but you need to be specific as to what it normalizes for.
Of course, I don’t need it because I don’t exaggerate similarity in the first place.
I understand this logic and your ladder.
"Complexity" can have different measures and "zero" having different meaning (and different ladder). Also a more important measure until the algorithm actually works is the cost for the cognition of the ones who design it, the cost of representation that gives most insight and allows easier continuation of the logic; representation/model/notation that can be theoretically strong enough to prove/suggest that it will lead to discovery of partcular patterns, that will converge to somewhere etc.
Low level code with + - / * calculations with hypothetical completely unspecified data does not give suggestions about the above.
As of the second comment:
Multiplication exaggerates similarity (the result) if the samples are > 1. However if the samples are normalized, multiplication leads to smaller values. The final recognition vector is also supposed to be in the [0;1] range as a "probability".
Normalization includes keeping it always within that range, while being non-linearly-mapped in order to introduce gradients/"artifacts", central- and border cases and a basic template for non-linear functions; maybe also to make the lower and upper ranges more "stable", slower changing, i.e. low/high-enough to be decisive ("true/false") for a wider/smaller range of values - as the transfer function is adjusted.
Your explanation seems too simplified to me. "Magic" should rather be in the traversal and the adjustment of the parameters, which involves more complex maths and correlations which are spread. That includes the specifics of the filters (kernels) and the way they cover the space of possible correlations, are they orthogonal, are they enough, do their combination and interaction covers all the cases without collisions/confusion.
Similarity depends on the "node" of the kernels laying between the templates. It's not just the multiplication, since the parameters may be set anyhow so that for one input they may produce a bigger result, for others - smaller. The kernels can have the entire range of numvers: positive, negative and zeroes. It is also about how the results of the comparisons "interact/evolve" (change) through the layers, after application of different kernels.
In AI imaging, A "edge detector" can done with complexity.
By dividing the image up into boxes and estimate the complexity in each box.
Then find your edge by the difference in values between boxes.
A good edge can be found between land and rough water, or
of a sky and and a tree full of leaves.
-----
If i find something of interest, then no mater how complicated is is, it will
not be complicated to me.
Todor, none of that negates my main point:
Transfer function is a hack to compensate for rather extreme artifacts created by another hack: matrix multiplication.
Given my apriori definition of match (my fitness function), I see no reason to use either.
pre_AGI
AGI is about unsupervised learning. To do unsupervised leaning A) you look for repeating pattern in a stream of information. Such
as video and audio. Or B). Randomly create a detector then stick it in a information stream and see if it will detect anything.
If not then scramble it weights and try again. The idea is build up a large amount of detectors for a
first layer of a Neural Network.
The Next step is to look for repeating patterns in a data stream, by way of forecasting in a passive state. When all patterns
have been found then the AGI outputs, by way of motors, in hopes of finding more pattern or improving existing patterns.
When this phase is finished, all hand, arm, leg, and all other personal body movement patterns well have been learned.
Next is the movement relative to others.
In this step the AGi uses a internal 3 D internal simulator to learn of "self" and "other" or self aware.
B) is what you do when you have no clue.
If you go with plan B then select something small. It will most
likely be common. Starting with one pixel an one color. If it found in the data,
save then evolve it to something else and then go see if it exist in the a
data stream or recorded data stream. Or in a different location in the same
image.
A detector can be constructed to detect it, like NN do.
Or you could make small picture
of it and do brute force matching.
I have two AGI models.
Model "A" which is pure computer model with no Neural Networks.
And I have plan "B" model. This model is a AGI neural network model and I could
only get it to work with plan B.
Plan A model is pattern matching of a patterns that occur in two different
locations in the data.
Plan B is is having a internal doodle board. A generative algorithm. Clueless
mind generates something simple and then looks for it in the instantaneous
information coming in on its sensors. If it detects it is recorded as data.
This fist detection is temporally recorded. So fist detection's will have distance
between detection's. At fist the recorded video is just dots then it builds up
into complete video.
If you had a theory, you wouldn't need plan B.
I don't think B) is clueless - as defined with the "doodle"it's integration of the imagination from the start. However it doesn't have to be absolutely clueless in the generation and to be a separate "B", it could go together with A) - you need both recognition (~ feed-forward matching) and generation (~ feed-back adjustment/selection of particular combined "patterns").
The key is in the adjustment, modification. If the algorithm is "convergent to progress" (as it's supposed to be if it's "incremental"), it shouldn't matter from where exactly you start, although I'm a proponent of a proper education trajectory which'd make it faster. We humans do not learn "randomly" as well.
keghn, if you have it developed, would you link to something more specific and to results?
Thanks Tosh.
What you described is a General adversary network, GAN.
I have two models.
My model A is pure computer model.
In general, all pattern matching and detection are not that good.
Neural networks are almost 100% And may go beyond this in a few years.
My pure computer model was easier to do. Finnish it a year ago. Only in theory.
This year the NN model has been pretty much completed. Based on GANs. Only in
theory.
And it use unsupervised ANN.
https://www.academia.edu/37275998/A_Nice_Artificial_General_Intelligence_How_To_Make_A_Nice_Artificial_General_Intelligence
I have done some research on patterns and noticed an allpervasive feature common to all of them. You could even argue that all patterns participate in one greater grand pattern. The precepts of this pattern are that things must stay the same untill they change and when they do change,they must change but repeat aspects of the state they are changing from.The degree of utility or pleasure the human brain derives depends fully on its complexity. I will qualify that later.
What complexity means is really contrast. If you have low frequency parts, you need parts that also contain mid frequency and high in the mix. If you have a lot of discrete note based parts , you will need a lead or string to balance it. Balancing is adjusting contrast. If you have a part where most of the notes go up then you need to follow it with a part where most of the notes go down. If you have a part with quick changes you need to follow it with a part with slow changes.
Now they are exceptions, but the exceptions follow the same rule. You CAN have lots of non-contrast in a part of that song, as long as some other part of the song strongly contrasts with that one...i.e. a contrast between parts that internally are not contrasting in some respect. The larger the cohesion in a part the larger the contrast needed to balance it. An example would be a chorus and a verse, or if both the chorus and verse are cohesive, then the song would naturally need a bridge to be good.
Of course the complexity in it is not vanilla complexity as I will now qualify it. If in the media there have been lots of complex (put your own definition here) songs for a long time, (cohesion) then to contrast that cohesion, a simple (read opposite of the definition you gave for complexity) song will be the most successful.
What have we done is, we have moved from the realm of individual songs and used the same rules found within songs to describe the rules for hit songs.
What you have to understand is that that cohesion described above, of lots of complex songs in a row, depends on the definition you gave of complexity. Use another definition and you could have totally different songs in the charts that follow this rule.
That is the source of the different musical tastes in the world. We know that the posterior distribution for RNB music is different from that of SALSA, because of the different definitions of complexity.
Why then is there both RNB and SALSA in existence? Because they use contrasting definitions of complexity. Where have we seen that word “contrasting” before? It seems we can apply the same rules found within a particular song, within the class of all genres of all musical types. Observe the musical genres of the world, don’t they contrast. In fact their existence is a function of the extent to which they contrast.
Now look at something that seems totally unrelated. If you are awake for a long time won’t you need to sleep for a long time too? If you examine all aspects of your life , isn’t a balanced one encouraged , where for example , if you eat meat for a long time you need to eat veggies for a bit. Or once you crack too many jokes you need to say some more less exhausting everyday things. Now all these things exist together, so a hit song will depend on all these other complexities. If there have been too many slow songs and a fast one would have the greatest value ordinarily, and perhaps a queen dies for example .Yet one more slow song expressing the loss might instead possess greater value than any fast song. BECAUSE there is contrast in that a queen doesn’t die often.
Now the difference between all the ways this rule is expressed of contrast fully depends on our interpretation of complexity and simplicity. You must be complex .unless you should be simple. And you should be simple only if the set of those two contrasts, of the complex part that that simplicity contrasts with, is complex. This forms a hierarchy of sets of complexity, and sets consisting of sets of complexity combined with their simplistic dual .all contained in another set of complexity. Which itself has a dual and both are both contained in yet another complexity…and so on…
If we are intelligent we could choose one level of complexity and modulate the distribution of all the elements in the subsets below in order to maximise the value of that level of complexity.this type of analysis means that it should be simpler to encode patterns in your system after taking it into consideration as there is alot of redundacy that you shouldn't have to explicitly code for.this might lead you to consider fractal geometry in the formulation of the algorithm.
So a 1D patterns would have motifs within the same medium while a multidimensional pattern would have different mediums expressing the same pattern while having contrasting mediums. Reality seems to consist of this one allpervasive fractal generator
https://alternativeai.blogspot.com/2018/05/advanced-pattern-recognition.html?m=1
Yes, it all depends on definitions. I think there is a bias in your observations, you select subjectively interesting data. Such selection is a feedback to filter out predictable inputs.
So, this expectation of periodic / fractal structure is something that should be learned from previous inputs. We can't code it in: all inputs can't have above-average "structure".
But, yes, there is an inevitable alternation of positive and negative patterns, I do cover that in part 2, etc.
I find kind of interesting if you model music without NN you need to de compose
the bit stream of a wav file, into a couple different data bit stream formats.
Like spline smoothing. Because i do not like using regression algorithm.
Also FTT, like your talking abut, 1d dft, above. And the last format of
the change of amplitude from one position to the next. This last one will help when A man says "hello" and when a woman says "hello".
NN can find it but it is harder to trouble shoot in A AGI model.
Its more than a coincidence.this is fundamentally linked to the axiom of identity,wjich comes in two #contrasting parts. A=A and A != !A, note that though they contrast they use similar terms.an example of an application that uses fractals would be the following.we have an image of a person and the face is cropped out.the algorithm will be responsible for filling it in. What we will use is a linder mayer system of fractals where we take elemnts from the rest of the image and merge their representations together within the cropped space. So we select multiple points at random within the image,then move the different arms of the system into the cropped space,inverting those elements at the multiple points and merging them together to create new motifs. As it is it is unlikely to produce a face,but that is because it jas only the present image to work with.given a history of images it will have the choice to migrate elements from those images into the empty space.this means that it will likely place some sort of face, even of some one it has never seen
Keghn notes are interesting, but IMO too general and obvious regarding contrast.
DIFFERENCE
For any classification, decision or change, at least two elements are required, so they must be somewhat different. If there's only one element it will repeat forever and would be "the same" - either the same sequence, pattern, piece. Thus there must be [recognizable] differences in the input in order to partition, to remember or to recognize anything.
As of balance - yes, but I think the better term is spectrum coverage. Balance is one kind of spectrum coverage. It's especially evident in music. As Keghn says, harmony comes with counterparts, middle and high notes sound better with base. IMO one reason the symphonic orchestra to sound mighty and complete is that it has instruments that covers the whole or a wide spectrum of tones and timbres, locally for short periods or for the whole work.
I agree also that in fact there are a few essential "general patterns", as structures, up to a certain level of complexity, however of course including the generative grammar for copying, extending, modifying links, comparisons etc. - something like CogAlg code. "The alphabet and the grammar" of all patterns.
After that, it starts to repeat or adjust minor parameters and just traversals for different ranges, levels, different domains, different modalities etc. If a chunk of its representation as plain data is taken, it would map to some patterns among that basic, relatively simple, set of foundation patterns, laws, correlations, structures, chains. Well, it should be like that, if the alg. is incremental, and it's like the case with all images, sounds, texts etc. that could be represented as a sequence of their basic elements and their syntax.
SPECTRUM COVERAGE
Regarding the spectrum, it's supposed to go for all kinds of spectrum that exist in the cognitive system. That applies for spectrum of depths within the hierarchy of patterns, spectrum of resolutions, of ranges in all kinds of measures - time, space; and all kinds of ranges within appropriate limits.
The best movies with the best photography have both all kinds of visual patterns, on the lower level 1D, 2D, 3D, near-far ranges - in the same frame, in time, ... slow motion, fast motion,... Variety of shapes, planes (near, middle, background), motions etc. They also have depth in the plot - the ending concludes the beginning, high temporal range; good script/lines (working memory verbal and visual-motor range), music which cover the spectrum of scales, tones, noises (local sound "tissue", working memory musical harmonies, longer melody structure sequences, complete pieces with different tempo for different emotions and action context etc.), also spectrum of emotions as patterns: light, strong, happy, sad, ... etc. On semantic/object-level: rich enough scenery, recognizable items, different actions and motions with them etc.
I think the process of discovery, understanding etc. is also a pattern and it also have different levels and ranges: within a frame (visual scanning, noticing something, a glance ), frame-to-frame (change of the setting, a character entering, new object... etc.).
The pattern dynamics is also a kind of pattern (an entity that could be stored, recognized, analyzed, compared).
In general, a "massive spectral masterpiece" would be a rich input which ignites and involves the whole spectrum of patterns and pattern dynamics within the cognitive hierarchy.
(Some tricks for attracting the attention are "dirty" and not cognitive, though, but involving the flight-or-fight/alert reactions, though, such as abrupt and loud noises, very quick editing, explosions and other rapid changes of the whole picture etc. The specific content as well - erotics, terror, ...)
SONGS
As of the different pop songs - IMO this turns too general. Popularity of a piece of art in societies is only partially based on aesthetics or "natural" tastes. Taste of people is adjusted and trained for profit or other reasons, some entities have enough resources to try to do it and it seems they sometimes manage to succeed. I could see it by the decline in some genres, technically and artistically.
There are fashions which go by fashion cycles or campaigns or within cultural domains, of which the song "tissue" and structure itself* is not the most important aspect, regarding the changes of styles or the active songs.
* Besides that it has to be some kind of music/sequence of elements, to have some kind of varying enough and appropriate, acceptable structure and complexity for its domain.
Todor,
"I think the process of discovery, understanding etc. is also a pattern and it also have different levels and ranges: within a frame (visual scanning, noticing something, a glance ), frame-to-frame (change of the setting, a character entering, new object... etc.)."
I guess you are talking about filter patterns, which is feedback from input patterns, ultimately motor feedback.
Well, thanks for the remark, yes it seems that these phenomenons and entities map to the filter hierarchy.
Reg. terminological mapping -"hierarchical contextualizations of sensorimotor decisions" in that 2018 paper (with incremental precision): "Hierarchical Active Inference: A Theory of Motivated Control" https://www.sciencedirect.com/science/article/pii/S1364661318300226#bib0250
Good read for newbies, but it sounds to me too similar to my theoretical claims in "my theory" works from the 2000s, except the mappings to brain regions and the huge amount of references.
The word contextualization is suggestive, it's among my more recent conceptualization as well. Besides different precision of the decisions for the hierarchy levels for different time-space timescale rages/resolution, it seems that different sets of patterns would be "primed"/prefetched in different settings/modes of operation.
Reg. CogAlg, do you envision such a mapping to "mega filter patterns" which contain a library, a hierarchy of buffered filters for a given environment?
For example such a filter-switching may involve activation of a prevailing amount of patterns from a particular pattern-set, rather than patterns in another pattern set (a library, a hierarchy). It's like a switch of attention or getting yourself "in the zone".
Match within a level seems similar, a pattern with above threshold match which is selected for elevation attracts the "attention", that is additional or more selective processing than the rest of the input.
Above threshold* ("average") match/mismatch of the compared variable/pattern invokes the respective "context" of filters for that more selective evaluation.
"Reg. CogAlg, do you envision such a mapping to "mega filter patterns" which contain a library, a hierarchy of buffered filters for a given environment?
For example such a filter-switching may involve activation of a prevailing amount of patterns from a particular pattern-set, rather than patterns in another pattern set (a library, a hierarchy). It's like a switch of attention or getting yourself "in the zone"."
I think the algorithm won't be much different from that for search over input patterns.
When filter is updated, same-filter span of inputs is terminated and it's hierarchical representation (including conditionally unfolded buffers of all input levels) is cross-compared on the next filter-search level.
Filter-search hierarchy is higher-order than input-search hierarchy, but additional complexity can be unfolded recursively.
So, your library will be a pattern of filter patterns, which will sequentially access it's element patterns for comparison to an input, which might be your context or a simple clock.
If there is above-average difference between input filter and template filter, that difference will be sent back to update current filter, in the sequence of higher-level filter pattern.
BTW, would you clarify that: "B: Resolution of my inputs is always greater than that of their coordinates, while Bayesian inference and AIT typically start with the reverse: strings of 1bit inputs. These inputs, binary confirmations / disconfirmations, are extremely crude way to represent “events” or inputs."
OK, about the one-bit respresentations (similar to the embeddings, word-embeddings in ANN for NLP).
COORDINATES ARE A PART OF THE INPUT
I think practically at the lowest level the coordinates are a significant part of the input, maybe even more significant than the brightness for particular input regarding the feedback.
In simple cognition light on the right vs dark on the left may drive motion towards right etc., which is completely different feedback for otherwise matching brightness-based patterns.
At a higher level it's similar, because some kind of a match of an object on the right/left/up/down and the specific coordinates involves change of the focus or a motion/grasp etc. to the righ/left/the particular coordinates.
The process of ignoring the coordinates is an abstraction. I.e. maybe total input resolution actually includes the resolution of the coordinates and their dimensions/shape (in numpy).
INTERNAL COORDINATES (ADDRESSES)
I remember once you've made a distinction between coord./addr., but I don't remember how precisely, is it still valid? Something like coordinates - in the sensory space, where addresses - in the memory of the hierarchy (where the address "in vivo" would be a particular configuration of current input and the internal state of the filters or something).
The memory footprint (in some measures) of the internal coordinates + content (patterns + address), being smaller than of the input, seems meaningful as a form of compression.
However there's additional "maintenance" data that's accumulated per pattern. The algorithm should justify that etc., however it seems that's feasible for higher levels only. For the lowest level there's a lot of overhead.
To me it "feels" that the compression is supposed to be present mainly regarding a space of possible states of the patterns, among a family of patterns, rather than simple sensory input. It's logical since a pattern is supposed to be more than the literal input - a concept.
Simple low level input that is like a picture may occupy more memory in the hierarchy than as a sensory input, however the stored one is meaningful as a pattern if it is a representative of a class of instantiations within a range of variable parameters.
COMPRESSION OF THE SPACE OF INSTANTIATIONS
The memory needed to generate all possible instantiations at the lowest level of representation (lower than particular level), if it's feasible/possible at all, should be larger than the memory of the compressed model.
For example at a high level: the pattern (concept) of a triangle (Kant has an example regarding that, "schema"), a rectangle, a human face given particular records of experience/data set etc. A single triangle, face etc. could be "chepaer", measured as a plain byte-based memory space; however the pattern could recognize an infinite amount of triangles and faces, and what's compressed is namely that potential generative space. (As done by GANs etc.)
"I think practically at the lowest level the coordinates are a significant part of the input, maybe even more significant than the brightness for particular input regarding the feedback"
My initial comparison is between consecutive inputs, effective coordinate here is a binary before | after sign.
There is no feedback from the lowest level, it can only come from patterns derived from such comparisons.
Feedback always lags feedforward, and coordinate res. lags input res. because it's a macro-parameter.
"The process of ignoring the coordinates is an abstraction. I.e. maybe total input resolution actually includes the resolution of the coordinates and their dimensions/shape (in numpy)."
Then you lose information.
" I remember once you've made a distinction between coord./addr., but I don't remember how precisely, is it still valid? Something like coordinates - in the sensory space, where addresses - in the memory of the hierarchy (where the address "in vivo" would be a particular configuration of current input and the internal state of the filters or something".
My internal coordinate is the order of parameters packed in a pattern. That's used only in comparison between parameters of the same pattern, such computing dimensional proportions, etc.
"The memory footprint (in some measures) of the internal coordinates + content (patterns + address), being smaller than of the input, seems meaningful as a form of compression.
However there's additional "maintenance" data that's accumulated per pattern. The algorithm should justify that etc., however it seems that's feasible for higher levels only. For the lowest level there's a lot of overhead.
To me it "feels" that the compression is supposed to be present mainly regarding a space of possible states of the patterns, among a family of patterns, rather than simple sensory input. It's logical since a pattern is supposed to be more than the literal input - a concept."
Space of possible states is machine-specific, I am not dealing with that yet. Such compression could be achieved by feeding core algorithm the data on various levels of the machine.
Otherwise, "family of patterns" is simply a higher-level pattern.
COMPRESSION OF THE SPACE OF INSTANTIATIONS
The memory needed to generate all possible instantiations at the lowest level of representation (lower than particular level), if it's feasible/possible at all, should be larger than the memory of the compressed model.
For example at a high level: the pattern (concept) of a triangle (Kant has an example regarding that, "schema"), a rectangle, a human face given particular records of experience/data set etc. A single triangle, face etc. could be "chepaer", measured as a plain byte-based memory space; however the pattern could recognize an infinite amount of triangles and faces, and what's compressed is namely that potential generative space. (As done by GANs etc.)
You are talking about math: discovery of abstract computational shortcuts. Yes, the algorithm should be able to do that.
But... you are not designing algorithm,
B: "My initial comparison is between consecutive inputs, effective coordinate here is a binary before | after sign. There is no feedback from the lowest level, it can only come from patterns derived from such comparisons."
Aha, OK, you mention that in your write-up, but yet there are also the other coordinates within the sensory input space.
B: "Feedback always lags feedforward, and coordinate res. lags input res. because it's a macro-parameter."
And "micro-p." are these inputs only (the sampled single values)? And parameters (middle) - the internal variables within the algorithm?
Yes, first drive is to sample/search/compare/select input. After that the result could be applied to adjust the search.
T: "The process of ignoring the coordinates is an abstraction. I.e. maybe total input resolution actually includes the resolution of the coordinates and their dimensions/shape (in numpy)."
B: "Then you lose information."
I didn't get that. Yes, you lose information when abstracting the coordinates, I meant that was supposed to happen at a higher level. The coordinates are there at the lower levels (for a feedback such as grasping), and lost at a level where e.g. only a qualitative list of objects is created, such as "Which items are present". Isn't losing parameters from the focus of processing the solution for incremental complexity and one of the general purposes of the hierarchical processing? Complexity at each level could be kept under control, a limited amount of new variables. Each lower level keeps its parameters, but if all are cross-compared/processed, the demand would grow exponentially.
I understand that you start with the simplest (one-bit) and also that we can assume that the specific absolute coordinates within the input frame are less important, because the input frame is supposed to be a window towards another bigger space, thus the window location can be adjusted by feedback and the given pattern could travel around in a predictable way. Shouldn't that be learnable?
During the first cycles, the system may not know about that and whether/when the changes are due to the feedback or due to yet unpredictable processes in the input space.
That's discovered by comparing the feedback with the changes in the input and thus the patterns' parameters, between frames/clocks.
"B: "Feedback always lags feedforward, and coordinate res. lags input res. because it's a macro-parameter."
And "micro-p." are these inputs only (the sampled single values)? And parameters (middle) - the internal variables within the algorithm? "
Not sure if that's what you meant, but higher-level inputs are hierarchical and unfolded top-down. So, their parameters are micro-micro.., etc.
T: "The process of ignoring the coordinates is an abstraction. I.e. maybe total input resolution actually includes the resolution of the coordinates and their dimensions/shape (in numpy)."
B: "Then you lose information."
"I didn't get that. Yes, you lose information when abstracting the coordinates, I meant that was supposed to happen at a higher level. The coordinates are there at the lower levels (for a feedback such as grasping), and lost at a level where e.g. only a qualitative list of objects is created, such as "Which items are present". Isn't losing parameters from the focus of processing the solution for incremental complexity and one of the general purposes of the hierarchical processing? Complexity at each level could be kept under control, a limited amount of new variables. Each lower level keeps its parameters, but if all are cross-compared/processed, the demand would grow exponentially. "
Higher-level coordinates will lose resolution but increase range. There must be "what" and "where" parameters on each level, but their range and resolution will vary depending on relative predictive value.
Even "where" of top-level parameters is limited by total scope of experience accumulated by the system, they can't be projected to infinity.
"I understand that you start with the simplest (one-bit) and also that we can assume that the specific absolute coordinates within the input frame are less important, because the input frame is supposed to be a window towards another bigger space, thus the window location can be adjusted by feedback and the given pattern could travel around in a predictable way. Shouldn't that be learnable?"
There are no absolute coordinates, they are all within some frame of reference.
They can be re-calibrated by coordinate filter feedback (learnable), but initially represent simple order of input into the system (a priori).
" During the first cycles, the system may not know about that and whether/when the changes are due to the feedback or due to yet unpredictable processes in the input space.
That's discovered by comparing the feedback with the changes in the input and thus the patterns' parameters, between frames/clocks. "
Right, but that's true in any time frame, initial or not.
> Higher-level coordinates will lose resolution but increase range.
Yes, but this is somewhat by default. (And lower resolution for the same range means bigger pixels.)
> Even "where" of top-level parameters is limited by total scope of experience accumulated by the system, they can't be projected to infinity.
Sure, it must be mapped to something "real", internal of the Alg of collected data and its derivatives. "Inifinity" doesn't lead to specific explicit feedback.
> "I understand that you start with the simplest (one-bit) and also that we can assume that the specific absolute coordinates within the input fr
ame are less important, because the input frame is supposed to be a window towards another bigger space, thus the window location can be adjusted by feedback and the given pattern could travel around in a predictable way. Shouldn't that be learnable?"
>
> There are no absolute coordinates, they are all within some frame of reference.
One of these fr.of.ref. are the locations in the sensory matrices and more precisely the limits of the sensory space where scanning itself, besides the pattern, face interruptions. The highest amount of novelty is supposed to happen there, scrolling through the unknown. The borders of the primary input spaces would have empirically different properties than the intermediate, right from the start and in the internal operations.
> " During the first cycles, the system may not know about that and whether/when the changes are due to the feedback or due to yet unpredictable processes in the input space.
> That's discovered by comparing the feedback with the changes in the input and thus the patterns' parameters, between frames/clocks. "
>
> Right, but that's true in any time frame, initial or not.
>
After experience is gained these checks are supposed to be suppressed or skipped after none or partial processing, when deliberate moves (feedback) are initiated.
I beleive that human computation derives from human experience.We observe what is going on in the real world, and then using a form of synaesthesia we manipulate information using the experience of reality as a basis. So for example we reuse concepts alot in speech formation. If i tell you to "raise your voice", i am tying the concept of raising an object to that of the voice. Even in that last sentence i used the concept of tying as in tying a string, to tie two concepts together.When interest rates fall we are using the analogy of falling objects to illuminate what we are saying. The patterns we see in reality are the basis of ALL mental computation. In fact we could define computation as applying the laws of physics to a domain of concepts. Given that the term applying erives from applying a substance over another.
Thinking then becomes treating concpts as objects and manipulating them using patterns found in reality. If i say two concepts "clash" i am drawing a parralel between two phisical objects clahing and the subsequent dischord or lack of harmony in the result. See there i go again using the terms dischord and harmony found in music to this example.
if we were to create a pattern based agaent..it would need this form of synaesthesia to handle reality...(see the word handle being used in that context there?) also the word form derives from physicl entities.
So there is an almost implicit rule that the way to manipulte higher order concepts is to either treat them as physical objects and relate them using physical laws, or to treat the concepts as the physical laws themselves. One wierd thing is that certain concepts pair up better with certain physical entities or laws. Why would "a clash of ideas" lead to eerily similar consequences to a clash of two physical objects. in that the derivates of such an action also correlate strongly with each other. i beleive that we could pair up any concept with any physical correlate, but some do so more naturally in terms of their both their derivatives correlating, and the derivative of their derivatives correlating and so forth. One avenue to establish intelligence would be to use concepts in such a way that them and their physical correlates create similar sorts of information. If i say to you physics is a very sad subject, that adjective sad does not produce useful(meaningful) derivatives. One could observe how sad people react (the derivative of being sad) and not find a way to relate it to the term physics. When you use the phrase physic is a very interesting subject. We know of interesting phenomena in the world...e.g. i could have first learnt its context by observing interesting colours on a sunset which led me to study it more..in that way both derivatives of the term interesting when applied to physics and also to colours are in harmony, because you would presumably want to study physics more..so computation would then be defined as the process of finding ways to express concepts such that they behave most similarly to physical phenomena found in reality. The more intelligent you are the longer the chain of coincidences that the words you use to describe a conceptual phenomena correlates with some physical process. That chain of coincidences at each stage being a result of a stage in the processing of that purely conceptual phenomena
Thanks Tofara. It's fun to speculate in anthropomorphic terms, but I prefer to focus on core cognitive function: prediction. And to design it from the bottom-up, regardless of the way the brain works. Because it obviously doesn't work well.
Hi, it is not an anthropomorphic based theory. it is based on the laws of physics and how objects interact. we observe how objects interact, this becomes the basis for all future classifications of patterns. What this does , is move away from a multi layer approach of pattern recognition.If you process information this way, at each stage in the process , converting you information into different representations that can be described in a way that correlates with some physical process...that is what we mean by understanding the material and deriving new information from old. There is no new information unless it can be described in a way that it is the derivative of a process in the real world. With this in mind , the process of processing information involves finding congruences between chains of physical processes and concepts. This does not have to differe radicaly from your approach, you simply need to define patterns that trigger whenever a certain phenomena in the real world occurs, then associate those concepts/patterns with higher order paterns that you have found the way you describe your pattern reognition system....this could be learnt behaviour, through reinforcement where the most succesful correlations get selected for.
I think this is a terminological confusion. Everything is ultimately "physical", down to pixels.
Laws are patterns, objects are patterns, and interactions are patterns: all these things must be learned, hierarchically.
Correlations, correspondence, "congruedness": these are different words for some sort of similarity / match,
Your "triggered" or "associated" is my recognized and included into higher-order pattern.
I select for projected match, don't see how that's different from your "reinforcement". Unless you mean manual supervision, which I don't care for.
An example would be "to break a glass" , and "to break an engagement"...The way you have it , those are totally different sets of patterns...i encourage the reuse of the first pattern as a way of processing information.Both those broken glass and broken engagements can result in shattered pieces of some sort, that is why it is useful to describe them with the same pattern. Could you agi realise this with the system you have set up in a processor efficient way?The way it is set up now those two statements would require a different set of resources to process, and i feel that means the agi doesn't *understand* them. and also i believe that that will require more processing and storage resources in the long run.
They are very different, and that's important to understand. There is a superficial similarity, which will be recognized and probably ignored, depending on the context.
Look, we can't operate on the level of examples, the is no limit to them.
If this ever has a chance to get constructive, we have to operate with definitions.
YOU: They are very different, and that's important to understand. ******There is a superficial similarity*******, which will be recognized and probably ignored, depending on the context.
ME: The whole point of this is to recognise patterns.
If the agi cannot understand that they form the same pattern it will never be able to understand higher level concepts. ALL higher level concepts that we have use this basis in reality to describe them. Think of how the concept of "space" is ubiquitous in mathematics.Vector spaces do not just have superficial similarity...that similarity is everything.
Imagine your agi was to observe music videos and to learn their posterior distribution in such a way that it can recreate them. Your system could identify groupings of pixels and how they move in relation to each other..but it will not scale to the level where it will recognise plots. We recognise plots in the same way we recognise the movement of pixels e.g. the example i gave of broken glass. But if i was to ask you to describe a plot where two people's relationship was broken in terms other than the raw pixel movements (i.e.don't use separate, break ,leave..etc...) you would not be able to .
So how would a system based of simple patterns scale to be able to understand social relationships, if the ways the patterns are represented don't relate in consistent ways. BUT if it COULD, it would then model the joint distribution of pixel movements conditioned on social dynamics...which as it is i don't believe your system will be able to scale to that.
If you belive i am way of topic you are free to say so and i will end this thread of dialogue
But i would think you would find it useful to have as few distinct patterns as possible.
Patterns should be selected by their relative strength.
I define strength as projected match, accumulated from all levels of a pattern.
I don't see how you define it, quantitatively, all you have is examples.
This is not constructive.
A pattern is an isomorphism between two things. The hebbian principle at its simplest...if two things always occur together , then they follow the same pattern. So when i observe that broken things leave shattered pieces, and broken relationships leave shattered individuals, that is an isomorphism. Vector spaces are all isomorphisms of actual space. Just as its impossible to understand something without the notion of a space , we also use isomorphism between what happens in actual space to understand that things in the space we are trying to fathom. If an agent is said to understand something i believe it is relating different spaces together. Your agent *needs* to be able to do this. breakups rarely have the same dynamics as the physical bit level so it would seem that their patterns are sparsely distributed among the space of patterns, i.e. the hebbian process is only loosely followed from one break up to another within actual space. In the social space they have unsurprisingly a lot in common however and if an agent is sensitive to variations within this space it can recognise the correlation/match quite easily. But since on the most basic level , the agent only has access to that space through the physical representations it must trigger(fire) low level physicalist concepts whenever those of any other space fire..in order for that other space to even be a space..(e.g. actaul space fires whenever we encounter a problem,(the problem space) then its other facets fire as e solve it(the solution space).and that triggering is the hebbian factor correlating what happens in the real world with what happens in the other spaces.
You before:There is a superficial similarity, which will be recognized and probably ignored, depending on the context
the way you have set it up, these similarities will be very difficult to pick up on, as is absolutely crucial. Simply because of the sparse distribution of the aspects that correlate...they wont be much above noise and in alot of cases will be the same....which will make it very difficult for it to model a social space for example..
Hi, there, Tofara. You think in the right domain, however I agree with Boris that so general verbal explanations without specific quantification and mathematical mapping are not good enough. It declares relatedness of schools of thoughts, but formal schema and code are needed.
What you talk about is termed "symbol grounding", sensori-motor grounding, embodied cognition and is applied in the developmental robotics.
As of the shattered patterns - yes, they are similar at abstract verbal level, where the pattern definitions and their visual and "physical"* form is simple (short) enough to have a small enough set of variables that allow them to match. In your case the pattern is a sequence from continuous to discontinuous, from a whole to parts.
Such level would happen when the system uses natural-like language (in its semantics at least, the encoding could be different) and the respective compressed patterns.
That's one of its purposes - operating with highly compressed representations in order to achieve high enough match, in order to allow precise enough communication and possible storage with the limited and unreliable output bandwidth of speech and writing.
The triggering of the lower level patterns is match at the lower level patterns, in Boris architecture there's always a chain of interaction from the lowest to the highest and back through several forms of feedforward (elevation) and feedback (adjustments of the search at the lower level). The ulitmate lower level feedback is a coordinate adjustment of an actuator. The process of scanning of the coordinate space or the recorded patterns could also be imagined as a virtual walk-through that space.
* With physical I think you mean lowest level input and sensori-motor simulatable, including continuity-discontinuity in space-time, which maps to span and coordinates of match/mismatch in more patternistic terms).
Thanks Todor.
Tofara, you are right that algorithm must ultimately search across input spaces, from space-time to derived / symbolic, and I don't think mine will need a modification to do so, if implemented according to the principles introduced here.
But I don't want to argue about it, life is not long enough. I think any direct discussion of all things NLP and social is utterly hopeless when we are talking about seed AI, essentially a retina level.
Isomorphism means everything and nothing.
Co-occurrence is a very crude way to measure match, basically a combination of binary contents' match (similarity) and binary coordinates' match (proximity).
Yes, the neurons do appear to rely on that, at least in part. From my perspective, this is one of many ways in which biology is grossly sub-optimal and I have no intention of imitating it.
I appreciate your interest, but don't see how this could get productive.
Thanks guys...the presence of this algorithm and your views on coming up with implementable ideas are useful in modifying my own thinking.
Great. Sorry for being brisk, I developed an allergy to AI philosophy after decades of interminable discussions :).
Thats ok :)
One wording of the problem with philosophy without mapping to schema. We need:
Definitions which are specific enough in order to make specific decisions on specific current input and input history, yet that are general enough in order to cover all kinds of spatio-temporally continuous input and incrementally map it to a hierarchical search space that allows reliable predictions of future inputs based on the past ones.
For the symbol grounding/sensory-motor grounding, the system has to be able to map to pixel-level input and sequences and to respective smallest step coordinate adjustments.
Here is an example of some of my work on AGI. if you dont like me posting this you are free to remove it. Boris.
===========
INVERSE REINFORCEMENT LEARNING CONDITIONED ON BRAIN SCANS
https://www.researchgate.net/publication/331152721_Inverse_reinforcement_learning_conditioned_on_Brain_Scans
We outline a way for an agent to learn the dispositions of a particular individual through inverse reinforcement learning where the state space at time t includes a fMRI scan of the individual, to represent his brain state at that time. The fundamental assumption being that the information shown on an fMRI scan of an individual is conditioned on his thoughts and thought processes. The system models both long and short term memory as well any internal dynamics we may not be aware of that are in the human brain. The human expert will put on a suit for a set duration with sensors whose information will be used to train a policy network, while a generative model will be trained to produce the next fMRI scan image conditioned on the present one and the state of the environment. During operation the humanoid robots actions will be conditioned on this evolving fMRI and the environment it is in.
Tofara, my intro makes it very clear that I have low regard for coarse modeling. Please stop posting here.
The human brain is incremental. Artificial models do not tend to go that way. I like you do.
look at this picture from a new born baby: https://pl.wikipedia.org/w/index.php?title=Paul_Flechsig&oldid=18203928#/media/File:FlechsigSaggital4.jpg only the motor and sensors axons are connected with fat. That is why a baby brain is so small. later on more and more neurons get included in the cirquit.
Thanks. My model has some high-level similarities with the brain, but I think the differences are more illuminating.
Post a Comment