Quite recently, when using multi-layered ELMo representations of words, I thought of implications of having such an “auxiliary”, preliminary architecture in natural language processing. What’s going on here is an important kind of looking inside yourself for better understanding of the world around you. If this sounds vaguely New Age, let me explain.
A regular NLP architecture that uses word embeddings works somewhat like a 1980s game console: you put into it one or more cartridges (embedding vectors), each representing a word. You use as many words of context as you need for the task. But the “cartridges” are static in the sense that for a word, you will always use the same one (the same vector). This means that, for example, the representation of word ‘play’ will mix meanings related to sports, acting and gaming. (Or it will be dominated by the meaning that was most frequent in the corpus on which the embeddings were trained.)
But what if you could have your word embedding re-manufactured each time you need it? This is more or less what ELMo brought to the table; instead of training a vector for each word, we train a network capable of taking the words and characters of context, and producing a multi-layer representation of all that.
Then, for each new task, we only have to find the most useful weights for layers of the embedding network. The intuition is that the first layer (of three, which is the default) should represent mostly low-level, phonological, letter-related information, the middle one–morphology, common/meaningful chunks of words etc., and the third layer–serious semantics and big picture stuff. It makes sense that different tasks may make most use of different levels of this linguistic modelling.
Processing vs. perception
Let’s say that when you want to make use of some piece of information, you may be in one of two situations (broadly speaking). Either the information is structured and prepared in a way that makes your job as straightforward as possible. You get exactly the data you need and in a form that you can digest easily. A pure example is being a bureaucrat and receiving a properly filled out form. Let’s call this mode processing.
The opposite situation is when the information is in no way meant for your needs. It can be too little or, most likely, too much, the structure is unknown and you can only hope to make some sense of it. I would call this perception.
And I meant to use just human vision as an example, but frankly, the fact is most of the things we get to see in high civilization were indeed meant to be seen by us: machines, buildings, art, advertising, electronic interfaces etc. But this used not to be the case in the ocean, where other fish would often prefer to hide from you and all the dark waters and rocks just did not care. Still, visual parts of our nervous system learned to break down what they perceived into lines, dots, shapes, and moving from there to build some representation–which for all we know can be rather frivolous in relation to what is “really” going on. But it does seem to serve its purpose. There are no ancient horrors from hidden dimensions as far as we can tell.
So, are we introspecting yet?
Now, when instead of putting “word cartridges” into it, you show a language-savvy computer system (say, a neural net) a complex and unpredictable representation, it would appear that we move from processing to perception. The embedding network, such as ELMo, is at least initially trained without knowledge of its final task (which is admittedly the case with language modelling in general), and it encodes extent of detail which is probably not necessary for any particular language processing job.
Confronted with this breadth of linguistic information, the “consumer” network has to come up with some strategy for using it: focus on some things, and ignore others. This makes weighting of ELMo layers itself a primitive form of attention. At least we get our data as real numbers, which is digestible for artificial neural nets, instead of raw characters that we would know little what to do with.
Is this really introspection? Maybe not? My first intuition would be that introspection is a situation where I perceive something that directly determines my behavior (my weights, if I was an artificial neural net) and, at the same time, something that I can modify. But as I human being, I can “introspect” emotions and other states I cannot directly control. So this leaves us just with something determining my behavior.
It is trivially true that a neural network can “see” its more shallow layer for the deeper one, just to compute the state of the latter, but the question stands, is it more processing or perceiving. I’d argue that in the normal case, it is obviously processing: the training algorithm, whatever it is, makes sure that any information flow from layer to layer is as streamlined as possible.
But in the scenario when the previous layer is a transplanted, separately designed representation, there is a somewhat believable case that the network is perceiving itself. This would be more clear-cut if the connection was recurrent. Some sort of recurrent introspection takes place in the line of research starting from Neural Turing Machines. Then again, these networks fit their method of storing and retrieving information strictly for their specific task. Thus we are back to processing.
Into oneself, outside of the comfort zone
The main benefit of perception, I think, is that if the representation was simple and aggressively optimized for the final task, we would miss the information that is useful in less obvious ways. We could be in a situation of a bureaucrat receiving detailed forms for cavalry incursions into our lands, when suddenly tanks, or hurricanes, or elephants come in.
To put it into more serious terms, when we strive directly for modelling of the world, eg. by predicting language, every little bit of understanding helps in the optimization process. We are incentivized to disentangle potentially multi-level, non-stationary regularities, even if payoff for the downstream task becomes apparent only at the very end and only in some cases. Otherwise making the whole journey through error in the parameter space would be something we would have no reason to try. There is use in broadly, almost indiscriminately reflecting complexity of the world in you, so you can exploit it fully.
This is what happens when we, as intelligent beings, perceive our internal states: the ones tightly coupled to the outside world (such as vision) and ones related to our inner workings, as emotions. These states, while already somewhat prepared by our mind for processing, contain way more than is necessary for any particular task. In this way, I suspect that separate, pretrained representations lead us in an interesting direction. When an AI will look into itself to be truly surprised, an important thing will happen.