Andrey:
recently I looked at visual attention models like this: http://arxiv.org/abs/ 1406.6247 The models provide better results then convolutional networks and require less calculations. This seems inspired by the fact that eye don't perceive whole view but very quickly move lens to project various view details on a tiny spot in retina.
It appears 'attention' is a key concept in learning. Senses generate huge amount of input and trying to learn patterns from all that data would quickly blow any system. I searched at your algorithm description for 'attention' and didn't find it. Perhaps you use different term? What are your thoughts on the matter?
Boris:
Andrey, attention is synonymous to feedback in my terms. Feedback is increasingly coarse with relative elevation:
previous inputs for comparison, next-level inputs for evaluation (they are converted into filters), next-next-level inputs for pre-valuation (they are converted into pre-filters), and so on.
Attention is how you select among future inputs. Filters select inputs for spatio-temporal resolution: comparison vs. summation, and pre-filters select future input spans for initial detection: evaluation vs. skipping. This equivalent to skipping by saccades, but my selection process is supposed to be much finer-grained. It's an implementation of selection for additive predictive value, strictly unsupervised process.
Andrey:
Right, I missed that.
From what I understand the feedback objective is discovering new patterns in higher levels from where it comes, right?
If so, it still can blow up. If we take 'pixels in video' input then yes, it may work. But lets take a human who has 3 directional vision, hearing, body tactile pressure senses, smell, temperature and etc. These are so many modalities. Having "discovering patterns" objective + ability to actively change environment via moving would result in exponential grow. Brain handles it via more complex objective mostly focusing on surviving and etc. This certainly limits brain's ability to discover new patterns but it also save it from blowing up.
Boris:
The objective is to maximize predictive value of the model of the world. Patterns is how we quantize that model, to make it selective. It can't blow up because threshold (filter, feedback) for selection is average predictive value per computational resource in the model. The more data in the model, the higher the threshold, the more selective it gets. The problem is not selectivity per se, it's the accuracy in quantifying predictive correspondence.
Brain doesn't much in the way of fixed objectives. Instincts are only a starting point, most of human motivation is relatively fluid conditioning.
Andrey:
Got it, thanks.
Usually model is represented as a probability distribution:
P(St | St-1, St-2, ....),
where St - state at time t, St-1 - state at time t-1 and etc
So P is probability density function (PDF) of new state given all previous states.
In Markovian systems only previous state St-1.
It also can be deterministic, where instead of PDF would be regular function.
Without going into particular algorithm details, is this definition similar to what you use?
Typical ways to create such models is getting a lot of state transition data and approximating PDF via regression. Then this model can be used to plan/perform optimal actions for achieve specific objections. Usually model alone has no practical application. So some other objectives outside creating a world model are desirable. My understanding that there are two methods to define objectives:
1) set specific target state - Starget and get optimal actions via differentiable state transition function.
2) If target state is unknown or/and state transition is not differentiable then define state rewards and follow them to achieve maximum cumulative reward within limited number of actions by dynamic programming and other methods.
I'm not sure if that is somehow correlates with what you have.
Boris:
Andrey, my model is the whole hierarchy of patterns.
The difference from a sequence of states is that patterns have variable length, past match, and all other variables that determine where and to what extent they are predictive.
I define predictive value as projected match, grey-scale, part 2. This is a major difference from probabilistic prediction, where a match of state to input is defined with binary precision.
And so on, all of my intro is about principles and mechanics of prediction. BTW, I just updated it, mostly parts 4, 5, 5b.
Yes, you can bias a model or patterns therein with practical targets and rewards. But the hard part is to have accurate and compressed model in the first place.
Andrey:
OK, I still miss high level definition. Again, without going into details such as hierarchy of patterns, variables and etc. Let's leave that within a black box.
A model assumes an input, state and output. It is clear what input and state is. What is output, or in other words what the model predicts?
From what I understand so far, you build a compressed representation of input. This is certainly cool, but there are other ways to do it, e.g. deep convolution, sparse coding, auto-encoders and other methods of self-taught learning. I do believe your algorithm provides more effective representation than those methods. However, the problem with those methods is not lack of effectiveness, the problem is "how to bias them with practical targets" in your terms. While classification tasks such as speech, object, image and video recognition are being successfully solved, shaping a policy of actions by rewards is extremely difficult task, e.g. in chess you get a reward only at the end of long sequence of turns, and there is nothing in between. Classification works because you can define "bias" (or shortcuts) via labels, so it becomes supervised learning.
Boris:
The model predicts future inputs, duh. Higher levels are increasingly compressed and predictive representations. Compression, by classifying inputs into patterns and variables, is all science does. It will never be effective enough: complexity of a subject matter grows with the data. The only limit here is one's imagination. Supervision of this process is only difficult if this classification is not explicit (low resolution), as it is in all statistical learning. If you have well-defined and stable patterns (potential targets) and variables (potential rewards), weighting or biasing them is a no-brainer. They should also be easy to associate with subjective taxonomy of an human expert in the subject. You simulate such models by setting target values of specific variables, and then multiplying associated vectors (differences, ratios, etc., within patterns) by a ratio of their current value to that target value.
See, you won't get very far by black-boxing the process.
Andrey:
Well, I don't see much difference with current unsupervised learning. It is easy to get a hierarchy of firing neurons with one of them firing at specific high level pattern. May be not that easy as in your algorithm as due to shared weights there might be many firing neurons, but still this is not an obstacle. The obstacle is to identify why it is actually firing - may be it properly recognized an apple or it just fires because of rounding form.
"
They should also be easy to associate with subjective taxonomy of an human expert in the subject.
"
This is very questionable. You will never know for sure with your algorithm too, unless you show many different objects and observe pattern values associated with them. Basically it is the same supervised learning, which is BTW is not difficult at all. The only difficulty is getting that many objects and showing them. So we come back to regular supervised learning as is with statistical approaches.
My point is that hierarchical pattern representation is not really critical part, it could be as vague as neurons or as precise as in your algorithm. The most complicated part is to get something practical on top of this representation. Or in other words, set objectives with minimal efforts.
Boris:
The obstacle is to identify why it is actually firing
That was my point re variables being potential rewards. NN nodes recognize patterns, but they don't represent specific derived parameters within them. My pattern is a sequence of parameters in a fixed sequence (syntax), so you know which ones are being recognized in each case. And then you can reinforce parameters you are interested in, not a whole pattern. This way you don't need any supervised training at all, the system already knows what to look for. Basically, it will use a strong-enough parameter as a coordinate to re-order inputs, compare them in that order and derive vectors associated with increase in that "reward" parameter (I have that at the end of part 7). Then you interactively project these vectors to maximize the reward.
The only way to do this with NNs is to have multiple nets, each recognizing patterns according to one parameter, and then integrate these nets. So, patterns will be microstructure and parameters macrostructure, which is extremely perverse and inefficient. Again, in my approach patterns are macro and variables are micro.
There is not fixed ratio of supervised to unsupervised learning, the more efficient the latter, the less supervision you need. If the algorithm can scale to understanding why experts do what they do (lots of data there), it can then predict / project what they will do, thus doing their job for them. All strictly unsupervised.
Andrey:
As simple question: how you know the actual meaning of particular pattern and even particular parameter represented by a variable in that pattern?
My understanding is that you may only guess by observing the variables while showing objects. And still your guess would not be reliable enough unless you show statistically representative number of samples. The task is as complicated as guessing why particular neuron is firing. This is like trying to understand the meaning of unknown language with millions words in vocabulary.
Am I wrong?
Boris:
I did say that last we talked about it, but mostly because I am not into supervision.
Syntax should be easy to track: it's tree with a known root at the bottom level and selective branching on each subsequent level. So, you end up with a lot of branches, but the map is preserved and easy to visualize because topological proximity of branches directly corresponds to their semantic proximity. When you have a strong variable in a high-level pattern, you can trace back it's origins across the levels, and relate to known salient parameters of a subject matter. And understanding salient variables of pattern tells you what it represents.
This is very different from neurons: they fire from summed inputs, so contribution of each is not preserved. Except through some convoluted and protracted pruning by feedback, which drastically reduces effective complexity of traceable causes.
Andrey:
Boris, what you describe might be a very complicated procedure. Relying on external translation of internal representation is overoptimistic.
Certain feedback can be considered as actions. The actions that communicate internal representation to an external observer. The system should exhibit some behavior that indicate its state, otherwise it would be very difficult to analyze its internals. Besides, actions would help to explore more of environment states, and get more understanding of it.
It's actually simple, Andrey. My patterns are the same as equations of analytical geometry. It's very straightforward to decompress them into 4D distributions, outlines, skeletons, with multiple levels of detail, which will be visually recognizable.
I do cover actions at the end of part 4 on feedback. But they are a way to selectively expand the scope of inputs, interpretation is easier by visualizing resulting high-level motor patterns.
BTW, I just updated the intro, mostly part 5. I am trying to generalize several orders of feedback that I already defined. Feedback is key concept, having self-generated higher orders might give me a scalable algorithm.
Andrey:
Thanks, I'm checking the updates.
Following your comments on geometry equations, does it mean it is specifically adjusted to visual input? If so, then it is not general intelligence and level algorithms should be adjusted for each new modality, right?
Boris:
No, any modality will have 4D distribution, because we live in 4D continuum. Discontinuity will increase on higher levels of search and generalization, but it's only a matter of degree, and their inputs are previously discovered 4D patterns. Mind you, symbols are not new modalities, they only stand for such patterns, sometimes implicitly. Except for pure math.
The algorithm is not specific to space-time, the core is time-only, and it can derive any number of secondary dimensions (end of part 7). But it always follows some explicit coordinates, so the same rules apply.
Of course, you can only interpret resulting patterns if you understand the subject matter at least as well as the system does.
Andrey:
OK,thanks.
Andrey:
No comments:
Post a Comment