DeepMind: RETRO language model even more powerful thanks to “external memory”

It’s been two years since OpenAI released its impressive GPT-3 language model. Since then, most of the well-known AI laboratories have produced their own text generators. Google, Facebook and Microsoft as well as a handful of Chinese companies have developed systems that can generate convincing content at first glance – and chat with people, answer questions and much more.

The systems are known as Large Language Models – due to the enormous size of the neural networks on which they are based. They have now become a dominant trend in AI. They have strengths and weaknesses – in addition to the remarkable ability to generate believable speech, there are still distortions typical of AI and, above all, an enormous consumption of computing power by the technology.

So far, DeepMind has caught the eye of the Large Language Models because of its absence. But last week, the British subsidiary of Google, which is behind some of the most impressive achievements in the field of AI – including AlphaZero and AlphaFold – got in touch three major studies on novel language models to speak. The most important result of this research is an AI with a special feature: It has an external memory in the form of a huge database of text passages, which it uses as a kind of cheat sheet when generating new phrases and sentences.

The AI, called RETRO (for “Retrieval-Enhanced Transformer”), achieves the performance of neural networks 25 times as large as they are, according to the developer, thus saving time and money for training very large models. The researchers also claim that the database makes it easier to analyze the information learned from the AI ​​- which could help filter out bias and hate speech.

“Being able to look up things on the fly instead of having to memorize everything can often be useful – just like with humans,” says Jack Rae of DeepMind, who leads the company’s research on large language models.

Language models create text by predicting which words will come next in a sentence or conversation. The larger a model, the more information it can learn about the world during its training, which makes its predictions better. GPT-3 has 175 billion parameters – that is, values ​​in a neural network that store data and can be adjusted when the model is learned. Microsoft’s Megatron language model has 530 billion parameters. However, large models also require enormous amounts of computing power to train them, making them affordable only to the richest businesses.

With RETRO, DeepMind has now tried to reduce the costs of training without reducing the learning performance of the AI. Researchers trained the model on a huge data set of news articles, Wikipedia pages, books, and text from GitHub, the popular online code repository. The data set contains texts in 10 languages, including English, Spanish, German, French, Russian, Chinese, Swahili and Urdu.

The basic setting of the RETRO neural network only has 7 billion parameters. For this, the system has a database with around 2 billion text passages. Both the database and the neural network are trained at the same time. When RETRO generates a text, it uses the database to look up and compare passages similar to the one it is currently writing, which makes its predictions more accurate. By moving part of the neural network’s memory to the database, RETRO can do more with less training.

The idea is not new. But it is the first time that such a “reference work” has been developed for a large language model – and at the same time the first time that the results of this approach can compete with the performance of the best voice AI systems on the market. RETRO is based on two other studies carried out by DeepMind. One examines how the size of a model affects its performance, and the other examines the potential problems that AI could cause.

To study the effects of size, DeepMind created a large language model called Gopher with 280 billion parameters. It outperformed the most advanced competing models on 82 percent of more than 150 common language tasks used for the test. The researchers then compared the results to RETRO and found that the model, with 7 billion parameters, could keep up with Gopher on most tasks.


More from MIT Technology Review

More from MIT Technology Review


More from MIT Technology Review

More from MIT Technology Review

Study number two deals with the topic of generated hate speech. It is a comprehensive overview of known problems associated with large language models. These models pick up bias, misinformation, and toxic language from the articles and books they have been trained on. As a result, they sometimes spit out harmful statements by repeating what they found in the training text – without knowing what it means. “Even a model that perfectly mimicked all of the data would be biased,” says Rae.

According to DeepMind, RETRO could help solve this problem because it is easier to see what the AI ​​has learned by examining the database rather than studying the entire neural network. In theory, examples of problematic language could be filtered out or “balanced” with unproblematic training data. However, DeepMind has not yet tested this assumption. “The problem has not yet been fully resolved, and work is continuing to address this challenge,” says Laura Weidinger, a researcher at DeepMind.

The database can also be updated without retraining the neural network. This means that new information, for example who won the US Open tennis tournament, can be quickly added and outdated or incorrect information can be removed. Systems like RETRO are more transparent than black box models like GPT-3, says Devendra Sachan, a PhD student at McGill University in Canada. “But that’s not a guarantee that they will prevent toxic language and bias.” Sachan developed a forerunner of RETRO in a previous collaboration with DeepMind, but was not involved in this latest work.

For Sachan, fixing problematic behavior of language models requires careful curation of the training data before training begins. Still, systems like RETRO can help: “It is easier to adopt such guidelines when a model uses external data for its predictions.” DeepMind may be late in this discussion. But instead of overhauling existing AI systems, it provides them with an alternative approach. “This is the future of the great language models,” believes Sachan.


(bsc)

To home page

.
source site