At Nalantis, they know all too well the challenges of building language technology. The Antwerp-based company has been working for years on tools that enable computers to process human language, both from written documents and from audio and video files. “One of the biggest obstacles is that it is very difficult for a computer to give ‘meanings’ to words,” says Chief Technology Officer Jan Van Sas. “A computer lacks a representation of our world. A computer does not have the connection with our environment that is stored in our human brains.”
There is a frantic effort with AI techniques such as deep learning and technologies such as vector distance for computers to recognize these analogies, but it is still far from the way our brains work. Scientists have actually not yet found the right form to make the mathematical link, says Van Sas.
It is therefore primarily the semantics or ‘theory of meaning’ that is difficult for AI systems to master. After all, it is not enough to derive the meaning of a word from some kind of dictionary, you also have to look at the word in its context. In other words: there must text comprehension to be. Van Sas gives an example: “If you tell me that you are ‘sitting on a bench’, there is a high probability that I can infer from our conversation that you are resting on a wooden structure. And not that you are at the top of a financial institution (laughs). For an AI system, however, such a thing is absolutely not trivial. And we haven’t even talked about the road yet where you are saying something. It can also provide a whole new layer of meaning that a computer doesn’t recognize.”
The infinity of language
The reason AI systems have such difficulty with language, context and meaning is that they are often based on big data, deep learning and machine learning. So they process pre-existing data (for example, millions of web pages) to get smarter and to ‘learn’. Often these systems can even do it with unstructured data, without any human intervention. that’s what you’unsupervised learning’ call. However, language is a kind of ‘infinite’ object. Although AI systems are good at math-based operations, such as recognizing and reproducing existing patterns, they are not that much further along in making sense of words. “The weakness of deep learning is that it does not have an explicit understanding of what happens in a conversation,” explains Van Sas. “deep learning is a kind of statistical game, with probability calculations and thresholds, but it cannot say exactly what is being said or what is being referred to.”
Even if it doesn’t mean that deep learning is completely worthless in language technology, emphasizes Van Sas. “Absolutely not, we also use it when we think it makes sense. It is deep learning for example, quite good at predicting what word will come next if your sentence is already quite advanced. If you have a sentence with five words, a deep learningthe system can reasonably well estimate what the sixth word is. But that is also its shortcoming. Sometimes trillions of parameters are seen in such predictions, the models simply become too complex.”
The weakness of deep learning is that it lacks an explicit understanding of what happens in a conversation.
And that is why the future of language technology is hybrid, says Van Sas. By only betting on deep learning as AI technology, we won’t get there. This should be supplemented with other techniques, such as NLP/NLU or Natural language processing and Natural language understanding. The idea here is to divide natural language into smaller and more manageable parts and run special algorithms that analyze these parts. In this way, interrelationships, dependencies and context between the different parts can be identified. The natural language is thus processed and transformed into a kind of standardized structure. Text understanding can then be found in that structure by inferring content, searching for context and generating insight. Or in other words: people gain weight means extraction.
“Compare that to how a person learns language,” says Van Sas. “We don’t need to read half of the internet to be able to communicate excellently through language. We get a limited supply and think about it and build a language system in our head, through semantic analysis and generation. So it’s totally different from how deep learningsystems handle it.”
Van Sas makes the comparison with the well-known psychologist Daniel Kahneman’s theories. “Do you know his book Thinking, fast and slow? In it, he explains that there are actually two competing systems in our brain. The ‘Fast’ model learns things very quickly and intuitively, especially things that we as humans need to survive. But on the other hand, knowledge is also created in a ‘slow’ model, where we consider everything rationally and sensibly. Language technology is similar. We will have to partly rely on statistical solutions, but partly also on rational solutions.”
Nalantis himself puts this into practice with his own engine who developed it and who was dubbed SAGE. “SAGE stands for Semantic analysis and generation engine”, says Van Sas. “It is a hybrid system that uses NLP techniques and deep learningmodels are combined. It breaks sentences into paragraphs and words and looks at syntax. From this follows a semantic analysis, so we try to understand what a specific text. Work with SAGE started more than ten years ago. A first version was built in Java, another incarnation we completely reprogrammed in Python, the standard language now used for AI applications.”
SAGE’s language technology is already being used for some very specific niches, says Van Sas. “Our innovations are used, for example, to convert recordings from municipal councils into comprehensible data about what has been decided there. This data can then be consulted by municipal employees and citizens who have questions about certain decisions: who has received the specific permission? When was it delivered? In this way, we contribute to transparency in government. The city of Ghent has already started working with this technology.”
Nalantis worked for FPS Finans a proof of concept that helps determine which tax offenders should be brought to justice. “What parameters make it useful to prosecute someone and what chance does it have? We did a language analysis of all the processes, the lawyers’ arguments and the verdicts. Certain words like ‘self-employed’, ‘car’ or ‘expense’ guided the recommendations to the officials in a particular direction.”
Another important area of activity for Nalantis is human resources. “SAGE can automatically link the CVs that a company receives to the vacancies it has open,” says Van Sas. “It recognizes what experience a particular candidate has, where he has worked before and what studies he has done. The system also knows all job descriptions and can therefore suggest the most suitable candidates for a particular job. You can imagine that such a first choice can save an enormous amount of time and trouble for the companies’ HR departments.”
No black box
Finally, a very important fact in the way Nalantis works isno black box‘-approach. What exactly is it? “On a lot deep learningmodels, we don’t really know what’s going on behind the scenes,” says Van Sas. “It’s like a black, closed box. The algorithm uses millions of data points as input, makes correlations between specific data characteristics, and thus generates a certain output. It’s a largely self-directed process and very often difficult to interpret for the data scientists as well as the programmers and end-users. We know well the algorithms used, but exactly which statistical coincidence has occurred is no longer traceable.”
And that is a problem. Because it entails, for example, ‘bias’ or bias. Bias is errors in the output of AI systems because the algorithm was fed biased assumptions. Suppose you, as an AI specialist, are asked to design a system that can search for images based on a keyword. Today we live in a world where 90 percent of CEOs are white men. So do you design your system to reflect this reality? And so only white men appear when someone types ‘CEO’? Or do you create a search engine that shows a more balanced mix, even if it is not the mix that is the reality today?
“We want to offset those kinds of issues,” says Van Sas. “Nalantis never works according to black box-principle. We are always able to have the internal ‘rules’ of the system adapted by our linguists. Everything the system does can be explained and tracked. So we can always dive under the hood and know where to look if we want to change something.”