Discussion with Andrey K


Hi Boris,

recently I looked at visual attention models like this: http://arxiv.org/abs/1406.6247 The models provide better results then convolutional networks and require less calculations. This seems inspired by the fact that eye don't perceive whole view but very quickly move lens to project various view details on a tiny spot in retina.

It appears 'attention' is a key concept in learning. Senses generate huge amount of input and trying to learn patterns from all that data would quickly blow any system. I searched at your algorithm description for 'attention' and didn't find it. Perhaps you use different term? What are your thoughts on the matter?


Andrey, attention is synonymous to feedback in my terms. Feedback is increasingly coarse with relative elevation:
previous inputs for comparison, next-level inputs for evaluation (they are converted into filters), next-next-level inputs for pre-valuation (they are converted into pre-filters), and so on. 
Attention is how you select among future inputs. Filters select inputs for spatio-temporal resolution: comparison vs. summation, and pre-filters select future input spans for initial detection: evaluation vs. skipping. This equivalent to skipping by saccades, but my selection process is supposed to be much finer-grained. It's an implementation of selection for additive predictive value, strictly unsupervised process.


Right, I missed that.

From what I understand the feedback objective is discovering new patterns in higher levels from where it comes, right?

If so, it still can blow up. If we take 'pixels in video' input then yes, it may work. But lets take a human who has 3 directional vision, hearing, body tactile pressure senses, smell, temperature and etc. These are so many modalities. Having "discovering patterns" objective + ability to actively change environment via moving would result in exponential grow. Brain handles it via more complex objective mostly focusing on surviving and etc. This certainly limits brain's ability to discover new patterns but it also save it from blowing up.  


The objective is to maximize predictive value of the model of the world. Patterns is how we quantize that model, to make it selective. It can't blow up because threshold (filter, feedback) for selection is average predictive value per computational resource in the model. The more data in the model, the higher the threshold, the more selective it gets. The problem is not selectivity per se, it's the accuracy in quantifying predictive correspondence. 

Brain doesn't much in the way of fixed objectives. Instincts are only a starting point, most of human motivation is relatively fluid conditioning.


Got it, thanks.

Usually model is represented as a probability distribution:
P(St | St-1, St-2, ....),
where St - state at time t, St-1 - state at time t-1 and etc
So P is probability density function  (PDF) of new state given all previous states.
In Markovian systems only previous state St-1.

It also can be deterministic, where instead of PDF would be regular function.

Without going into particular algorithm details, is this definition similar to what you use? 

Typical ways to create such models is getting a lot of state transition data and approximating PDF via regression. Then this model can be used to plan/perform optimal actions for achieve specific objections. Usually model alone has no practical application. So some other objectives outside creating a world model are desirable. My understanding that there are two methods to define objectives:
1) set specific target state - Starget and get optimal actions via differentiable state transition function. 
2) If target state is unknown or/and state transition is not differentiable then define state rewards and follow them to achieve maximum cumulative reward within limited number of actions by dynamic programming and other methods.

I'm not sure if that is somehow correlates with what you have.


Andrey, my model is the whole hierarchy of patterns. 
The difference from a sequence of states is that patterns have variable length, past match, and all other variables that determine where and to what extent they are predictive.
I define predictive value as projected match, grey-scale, part 2. This is a major difference from probabilistic prediction, where a match of state to input is defined with binary precision.
And so on, all of my intro is about principles and mechanics of prediction. BTW, I just updated it, mostly parts 4, 5, 5b.

Yes, you can bias a model or patterns therein with practical targets and rewards. But the hard part is to have accurate and compressed model in the first place.


OK, I still miss high level definition. Again, without going into details such as hierarchy of patterns, variables and etc. Let's leave that within a black box. 
A model assumes an input, state and output. It is clear what input and state is. What is output, or in other words what the model predicts?

From what I understand so far, you build a compressed representation of input. This is certainly cool, but there are other ways to do it, e.g. deep convolution, sparse coding, auto-encoders and other methods of self-taught learning. I do believe your algorithm provides more effective representation than those methods. However, the problem with those methods is not lack of effectiveness, the problem is "how to bias them with practical targets" in your terms. While classification tasks such as speech, object, image and video recognition are being successfully solved, shaping a policy of actions by rewards is extremely difficult task, e.g. in chess you get a reward only at the end of long sequence of turns, and there is nothing in between. Classification works because you can define "bias" (or shortcuts) via labels, so it becomes supervised learning.


The model predicts future inputs, duh. Higher levels are increasingly compressed and predictive representations. Compression, by classifying inputs into patterns and variables, is all science does. It will never be effective enough: complexity of a subject matter grows with the data. The only limit here is one's imagination. Supervision of this process is only difficult if this classification is not explicit (low resolution), as it is in all statistical learning. If you have well-defined and stable patterns (potential targets) and variables (potential rewards), weighting or biasing them is a no-brainer. They should also be easy to associate with subjective taxonomy of an human expert in the subject. You simulate such models by setting target values of specific variables, and then multiplying associated vectors (differences, ratios, etc., within patterns) by a ratio of their current value to that target value. 
See, you won't get very far by black-boxing the process.


Well, I don't see much difference with current unsupervised learning. It is easy to get a hierarchy of firing neurons with one of them firing at specific high level pattern. May be not that easy as in your algorithm as due to shared weights there might be many firing neurons, but still this is not an obstacle. The obstacle is to identify why it is actually firing - may be it properly recognized an apple or it just fires because of rounding form. 
They should also be easy to associate with subjective taxonomy of an human expert in the subject.
This is very questionable. You will never know for sure with your algorithm too, unless you show many different objects and observe pattern values associated with them. Basically it is the same supervised learning, which is BTW is not difficult at all. The only difficulty is getting that many objects and showing them. So we come back to regular supervised learning as is with statistical approaches.

My point is that hierarchical pattern representation is not really critical part, it could be as vague as neurons or as precise as in your algorithm. The most complicated part is to get something practical on top of this representation. Or in other words, set objectives with minimal efforts.


The obstacle is to identify why it is actually firing 

That was my point re variables being potential rewards. NN nodes recognize patterns, but they don't represent specific derived parameters within them. My pattern is a sequence of parameters in a fixed sequence (syntax), so you know which ones are being recognized in each case. And then you can reinforce parameters you are interested in, not a whole pattern. This way you don't need any supervised training at all, the system already knows what to look for. Basically, it will use a strong-enough parameter as a coordinate to re-order inputs, compare them in that order and derive vectors associated with increase in that "reward" parameter (I have that at the end of part 7). Then you interactively project these vectors to maximize the reward. 

The only way to do this with NNs is to have multiple nets, each recognizing patterns according to one parameter, and then integrate these nets. So, patterns will be microstructure and parameters macrostructure, which is extremely perverse and inefficient. Again, in my approach patterns are macro and variables are micro. 

There is not fixed ratio of supervised to unsupervised learning, the more efficient the latter, the less supervision you need. If the algorithm can scale to understanding why experts do what they do (lots of data there), it can then predict / project what they will do, thus doing their job for them. All strictly unsupervised.


As simple question: how you know the actual meaning of particular pattern and even particular parameter represented by a variable in that pattern?

My understanding is that you may only guess by observing the variables while showing objects. And still your guess would not be reliable enough unless you show statistically representative number of samples. The task is as complicated as guessing why particular neuron is firing. This is like trying to understand the meaning of unknown language with millions words in vocabulary.

Am I wrong?


I did say that last we talked about it, but mostly because I am not into supervision. 

Syntax should be easy to track: it's tree with a known root at the bottom level and selective branching on each subsequent level. So, you end up with a lot of branches, but the map is preserved and easy to visualize because topological proximity of branches directly corresponds to their semantic proximity. When you have a strong variable in a high-level pattern, you can trace back it's origins across the levels, and relate to known salient parameters of a subject matter. And understanding salient variables of pattern tells you what it represents. 

This is very different from neurons: they fire from summed inputs, so contribution of each is not preserved. Except through some convoluted and protracted pruning by feedback, which drastically reduces effective complexity of traceable causes.


Boris, what you describe might be a very complicated procedure. Relying on external translation of internal representation is overoptimistic.

Certain feedback can be considered as actions. The actions that communicate internal representation to an external observer. The system should exhibit some behavior that indicate its state, otherwise it would be very difficult to analyze its internals. Besides, actions would help to explore more of environment states, and get more understanding of it.

It's actually simple, Andrey. My patterns are the same as equations of analytical geometry. It's very straightforward to decompress them into 4D distributions, outlines, skeletons, with multiple levels of detail, which will be visually recognizable. 

I do cover actions at the end of part 4 on feedback. But they are a way to selectively expand the scope of inputs, interpretation is easier by visualizing resulting high-level motor patterns.

BTW, I just updated the intro, mostly part 5. I am trying to generalize several orders of feedback that I already defined. Feedback is key concept, having self-generated higher orders might give me a scalable algorithm. 


Thanks, I'm checking the updates.

Following your comments on geometry equations, does it mean it is specifically adjusted to visual input? If so, then it is not general intelligence and level algorithms should be adjusted for each new modality, right?


No, any modality will have 4D distribution, because we live in 4D continuum. Discontinuity will increase on higher levels of search and generalization, but it's only a matter of degree, and their inputs are previously discovered 4D patterns. Mind you, symbols are not new modalities, they only stand for such patterns, sometimes implicitly. Except for pure math.
The algorithm is not specific to space-time, the core is time-only, and it can derive any number of secondary dimensions (end of part 7). But it always follows some explicit coordinates, so the same rules apply.
Of course, you can only interpret resulting patterns if you understand the subject matter at least as well as the system does.




Hi Boris,

you refer On Intelligence by Hawkins as an excellent introduction as it describes foundations: hierarchical memory, predictions, feedback and etc.

Hawkins also outlines that layer's output is a squeezed summary of input sequences, e.g. one output per 10 inputs that form some recognized sequence (temporal pattern). This is required to get invariant representation of a sequence and for unfolding feedback. 

It does not appear that such concept is used in any current ann structures. They do not provide invariance on sequences of input, only invariance of an input (spatial invariance).

The implementation would assume that frequency of upper layers/levels is getting lower. Indeed, an upper level have to wait until lower level recognizes input sequence. This appears as a waste of resources and I doubt it actually works this way.

Could you please comment on importance of this concept, do you use it in your algorithm? 




Invariance is another term for match, which defines my patterns. But in Hawkins model the inputs are binary, every synapse receives a sequence, and pattern consists of coincident inputs, their match is also binary. My inputs are integers, match is grey-scale, and there is only one "synapse" and sequence per level. That's because the first step in my model is analog-to-digital conversion that forms integers. It's a start because digitization is the simplest form of lossless compression, and complexity must be strictly incremental with elevation.
So, his layer is multiple sequences of binary inputs, mine is single sequence of integer inputs. Because my dimensionality is strictly incremental: processing starts with 0D digitization, which is far more efficient compressor for raw inputs than comparison among multiple sequences.

Regular ANN can process sequences if a layer is 1D. I think use of binary inputs in HTM restricts, not improves, its usefulness. Hawkins makes a big deal that he uses connections rather than weights, but it's only POV: connection is the same a binary weight.    

Reduced frequency is an interesting question, but it doesn't mean that higher levels have to wait. In my model their inputs are complex multi-level patterns, which take a lot longer to search through, compensating for the delay of inputs. In the cortex, higher association areas have longer-range connections, and the number of possible associations is exponential with their range. So, they "reverberate" longer to establish connections, also compensating for the delay of inputs. 


Interesting, thanks for reply Boris.

I'm still in doubt on sequences. The frequency decrease depends on sequence length. If length is 10 then frequency  is decreased by 10^6 after six levels, which is huge difference for cortex I think. It does not look like this can be explained by longer-range connections. By the way, what is the length of sequences, is it constant or variable? If variable, where a sequence ends?

Just a side comment on Hawkins. Hawkins sequence is not per-synapse (binary). He describes a region of cortical columns, and a time step is activation status of all columns in the region, not just one. This is the same as integers. But I really don't want to go into comparison of Hawkins vs your algorithm. Hawkins does not propose any algorithm, he just outlines ideas of how cortex may work.


Effect of connection length is exponential because pattern / unit of representation in cortex is co-activated network of neurons, or cognit: http://charbonniers.org/2009/12/31/fusters-theory-of-cognits/ , http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3457701/
These networks are similar to RNNs, the nodes continue to activate each other for some time after initial input.
I think that time is exponential with the number * length of connections within a network.

Sequence is a pattern, so the length is variable, it ends when inputs stop matching. In my model it's a complemented pattern: span of contiguous match + span of contiguous miss. But the criterion: average match, is adjusted by feedback to keep pattern length within a certain range. Again, this is different from Hawkins' and statistical models because primary comparison in my model is within a sequence, not between sequences. 

See appendix A, the inputs and node activations within a column are binary, not integers.


While recurrent connectivity can be long I'm not sure it is used for communications between levels. I think recurrence is used to memorize sequences. Assuming low speed of neural signal, frequency decrease would stop all activities after very few levels. The entire concept of passing invariant representation of sequences to higher level looks nice but I don't see how it could be workable, particularly for variable sequence length. Besides delays, it will desynchronize communications with higher level, as "waiting time" would depend on sequence length and could be either too short or prohibitory long. Maybe sequence length is limited, like spatial perception field is limited for each region? For example, each region could memorize sequences up to 10 in length.

Another solution would be not waiting until sequence end (prediction mismatch) but send invariant prediction (according to prediction) immediately. Though it will bring too much noise.

It is still not clear to me how invariant representations of sequences could be formed in cortex.

Thanks for the links!  I didn't know Hawkins has more progress. Right, columns are binary but a sequence term is applied to a region of columns. A region has many columns and their combined activation is a time step in a sequence.


Recurrent connectivity of cognit is mostly within a level, it needs to reverberate to sift through increased number of possible associations there. Search within a level compensates for reduced frequency of lower-level inputs. Increased temporal receptive field on higher association levels is well-established fact.
Again, sequences can be represented by 1D ANN layers. You simply accumulate them over time, otherwise it's the same layers but 1D instead of 2D.

Yes, my patterns are also outputted before termination if they exceed certain length, I explain it briefly in part 3.
Resolution of both input and coordinate are limited by bit and byte, that maximal length is coordinate' byte.  

Hawkins has a sequence of binary inputs for each synapse, see that appendix A. Although the node seems to OR these inputs between sequences, that's no processing at all. I don't understand the point of that, but his basics are wrong anyway.



Again, sequences can be represented by 1D ANN layers. You simply accumulate them over time

How to accumulate them? I checked latest Hawkins document and didn't find anything on the matter as well. In fact he completely omits how temporal invariant representation is formed. He focuses a lot on simple tasks, such as prediction and memorizing of sequences, but omits how invariant representation of a predicted sequence goes to upper levels.

I need some realistic explanation of how cortex may do that, otherwise it is not well-established fact for me. True, it is known that neurons have different firing frequencies but this difference is too small to describe it by increased temporal field. The only explanation I have so far is outputting before sequence termination, as in your algorithm. But I don't think it would work well due to much noise - errors in prediction of long sequences. 

I'm not sure why you referring Appendix A. That Appendix has no a single appearance of "sequence" word. Where sequences are defined in the text they are always applied to the regions and never to single columns.

Actually I found a convincing explanation in Hawkins document. The idea is based on predicting multiple steps ahead in time:
A region doesn’t just predict what will happen immediately next. If it can, it will predict multiple steps ahead in time. Let’s say a region can predict five steps ahead. When a new input arrives, the newly predicted step changes but the four of the previously predicted steps might not. Consequently, even though each new input is completely different, only a part of the output is changing, making outputs more stable than inputs. 

In this case I assume that invariant representation of a sequence would be summary of several future inputs (predictions). And there is no waiting - each input produces an output.

I gave it more thoughts and don't think it is that convincing. The issue is that how to create summary of several future inputs. The key word is "summary". The summary inevitably increases size of output pro rata to the sequence length, which cannot be true if we want to increase receptive field (assuming each level has the same number of inputs as previous one - so number of outputs should be less than number of inputs)

Boris, could you please explain why increasing of temporal receptive field is that important? Sequences at all levels can be well predicted without that. Music rithms also may be modeled via fixed size silence intervals. Most of people do not really have good relative time feeling. You may think that you spent more time on something while in fact it is opposite. Insignificant frequency differences in neuron firings can be explained by spatial receptive field increase and so forth. Is there something which mandates increasing of temporal receptive fields?


Again, sequences can be represented by 1D ANN layers. You simply accumulate them over time
How to accumulate them? I checked latest Hawkins document and didn't find anything on the matter as well.

I meant this is how conventional ANNs recognize sequences. There are different ways to quantize them, I didn't get into details because I don't like weighted summation anyway.

In fact he completely omits how temporal invariant representation is formed. He focuses a lot on simple tasks, such as prediction and memorizing of sequences, but omits how invariant representation of a predicted sequence goes to upper levels.

I think you are asking how the sequences are quantized before being correlated. It's what he calls "temporal pooling", and he quite fuzzy about it: https://github.com/numenta/nupic/wiki/New-Ideas-About-Temporal-Pooling
There is more on IBM collaboration:

I need some realistic explanation of how cortex may do that, otherwise it is not well-established fact for me. True, it is known that neurons have different firing frequencies but this difference is too small to describe it by increased temporal field. 

Increased temporal receptive field is about reduced output frequency, not defining sequential patterns.
I have a bunch of references in my post: http://cognitive-focus.blogspot.com/2014/10/cortical-trade-offs-specialist-vs.html , the section oDifferences between higher and lower cortical regions

The only explanation I have so far is outputting before sequence termination, as in your algorithm. But I don't think it would work well due to much noise - errors in prediction of long sequences. 

In my algorithm it's an exception not the rule. These prematurely terminated patterns are then spliced on the next level, which eliminates the noise.

I'm not sure why you referring Appendix A. That Appendix has no a single appearance of "sequence" word. Where sequences are defined in the text they are always applied to the regions and never to single columns.

It's not words, it's graphics: sequences of binary inputs are shown at each synapse and the "neuron" OR-es them. Which doesn't make any sense to me. There are sequences on all levels of scale, but actual processing must be done by the node.

In this case I assume that invariant representation of a sequence would be summary of several future inputs (predictions). And there is no waiting - each input produces an output.

That doesn't explain how input sequence and predicted sequence are aligned, - the main problem.
Predictions generation is precisely why higher-level cognits reverberate so long, that's not waiting.
Each input doesn't produces output: match or projected match to predictions block the input. Otherwise there is no compression and no point in predicting.  

Increase in receptive field is synonymous with increase of generality | invariance | predictive value: the whole point of cognition. It has to be both spatial and temporal: time is the macro-dimension for prediction. Music is not a good test case here: it's utterly meaningless.


Thanks, it is more clear now.


If you are not familiar with it, author has few blog posts on the matter:

At one point he raises this concern:
"But is prediction really the main, or even the most important, tool underlying fluid adaptive intelligence? Or is multi-level neural prediction just one small cog in the complex evolved adaptive machine? I’m betting on it being way, way more than just one small cog. But beyond that, your (multi-level, precision-modulated) guess is as good as mine."

I would appreciate your opinion on those "dark sides or predictive brain", if you could find a moment to look at this.


That book seems very vague and philosophical, not the level I work on these days. Regarding “dark side”, I think he is conflating purely cognitive and instinctive + conditioned mechanisms in the brain. Very roughly:
- brainstem and hypothalamus implement low-level innate instincts,
- limbic system implements various types of conditioning, including reinforcement learning,
- neocortex and cerebellum implement cognitive mechanisms, with lots of innate biases and “phyletic memories”.

I am only interested in the third category, without biases and neuro- artifacts. In purely cognitive terms, Clark thinks that maximizing prediction will minimize surprise / novelty, which is not what we are doing. That’s true for maximizing match of inputs to memory, but not for maximizing prediction: projected match, which is what my model does. Relevant paragraphs in part 3:

Hierarchical input selection must use a common criterion, which defines a system. Two obvious choices are novelty and generality (miss and match), but we can’t select for both: they exhaust all possibilities. Novelty can’t be primary criterion: it would select for noise and filter out all patterns, which are defined by match. On the other hand, to maximize the match of inputs to memory we can “stare at a wall” and lock into predictable environments. But of course, natural curiosity actively avoids predictable locations, thus reducing the match.

This contradiction is resolved if we maximize predictive power: projected match, rather than actual match, of inputs to records. To the extent that new match was projected by previous knowledge, it is not additive to system’s projected match. But neither are novel inputs that are projected to miss subsequent inputs.
To increase predictive power, new inputs should match subsequent but not previous inputs:
additive projected match = downward (V) projected match - upward (Λ) projected match.

This also covers his “prediction by action”. It’s true that making environment more predictable will increase locally projected match. But equivalent resources spent on exploration and cognitive capacity will increase globally projected match, which is far greater quantitatively.
So, we don’t need to bring up biological and survival aspects to resolve this novelty vs. generality dilemma.

BTW, do you mind if I post this and last three of our threads in the “discussion” section of my blog?


Thanks for reply Boris. Understood for "dark sides".

Regarding the actions to increase prediction. I think of actions as a feedback to facilitate invariance. For example, rotate/shift/zoom an image to match a pattern (or increase prediction in other words). The number of such actions should be limited, of course, they may start randomly and move in direction of higher match. Does it make sense? Note, image is just an example, could be any other input.

Sure, you may share our messages.


Thanks Andrey, I just posted the discussion.
I also posted update to the core article, there are some significant changes in the way levels 2 and higher operate, especially since we last talked. This is mostly in part 4, the only one I work on now.

Implementation will be my level 1 adapted to process horizontal scan lines, level 2 to process frames, level 3 to process temporal sequence of frames. Depth would probably be a derived dimension, not explicitly declared. I am trying to formalize these 3 levels into pseudo code, currently working on level 2, it is still a mess. So, actual programming has to wait for pseudo code. But if you want to pitch in formalizing level 2, by all means, be my best friend :).

Regarding “invariating” perception by action, this is the most basic way to make environment more predictable. It helps human brain, but I think that’s because it is so horribly inefficient. For something like my algorithm, “mental” rotation, zooming, etc., should be much cheaper than doing it physically. Except as way to gain new information, but that won’t likely be invariant.

No comments: