The vocabulary of a neural network is represented as vectors — which I wrote about yesterday. This enables many related words to be “close to” one another, which is how the network perceives similarity and difference. This is as near as a computer comes to understanding meaning — which is not very near at all, but good enough for a lot of practical applications of natural language processing.
A previous way of representing vocabulary for a neural network was to assign just one number to each word. If the neural net had a vocabulary of 20,000 words, that meant it had 20,000 separate inputs in the first layer — the input layer. (I discussed neural nets in an earlier post here.) For each word, only one input was activated. This is called “one-hot encoding.”
Representing words as vectors (instead of with a single number) means that each number in the array for one word is an input for the neural net. Among the many possible inputs, several or many are “hot,” not just one.
As I was sorting this in my mind today, reading and thinking, I had to think about how to convey to my students (who might have no computer science background at all) this idea of words. The word itself doesn’t exist. The word is represented in the system as a list of numbers. The numbers have meaning; they locate the the word-object in a mathematical space, for which computers are ideally suited. But there is no word.
Long ago in school I learned about the signifier and the signified. Together, they create a sign. Language is our way of representing the world in speech and in writing. The word is not the thing itself; the map is not the territory. And here we are, building a representation of human language in code, where a vocabulary of tens of thousands of human words exists in an imaginary space consisting of numbers — because numbers are the only things a computer can use.
I had a much easier time understanding the concepts of image recognition than I am having with NLP.
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.