MIT researchers warn that deep learning is approaching computational limits


Last chance: sign up for Transform, VB’s AI event of the year, hosted online July 15-17.


We are approaching the computational limits of deep learning. That’s according to researchers from the Massachusetts Institute of Technology, Underwood International College and the University of Brasilia, who found in a recent study that progress in deep learning has been “strongly dependent” on increases in computation. They claim that continued progress will require more “drastically” computationally efficient deep learning methods, either through changes in existing techniques or through new methods not yet discovered.

“We show that deep learning is not computationally expensive by accident, but by design. The same flexibility that makes it great for modeling various phenomena and outperforming expert models also makes it dramatically more computationally expensive, ”the co-authors wrote. “Despite this, we found that the actual computational burden of deep learning models is increasing faster than the (known) lower limits of the theory, suggesting that substantial improvements could be possible.”

Deep learning is the subfield of machine learning related to algorithms inspired by the structure and function of the brain. These algorithms, called artificial neural networks, consist of layered functions (neurons) that transmit signals to other neurons. The signals, which are the product of the input data fed into the network, travel from layer to layer and slowly “tune” the network, in effect adjusting the synaptic strength (weights) of each connection. The network eventually learns to make predictions by extracting characteristics from the data set and identifying cross-sample trends.

MIT AI efficiency study

The researchers analyzed 1,058 documents from the Arxiv.org prepress server, as well as other reference sources to understand the connection between deep learning performance and computing, paying particular attention to domains, including image classification, object detection, answer to questions, recognition of named entities and machine Translation. They conducted two separate analyzes of the computational requirements that reflect the two types of information available:

  • Network Pass Computing, or the number of floating point operations required for a single pass (i.e., weight adjustment) in a given deep learning model.
  • Hardware load, or the computational capacity of the hardware used to train the model, calculated as the number of processors multiplied by the speed and the calculation time. (The researchers acknowledge that while it is an imprecise measure of computation, it was reported more widely in the documents they analyzed than other benchmarks.)

Co-authors report “highly statistically significant” and “strong explanatory power” slopes for all benchmarks except machine translation from English to German, where there was little variation in the computational power used. Object detection, named entity recognition, and machine translation in particular showed large increases in hardware load with relatively small improvements in results, with computational power accounting for 43% of the variation in classification accuracy. of images in the popular open source ImageNet benchmark.

The researchers estimate that three years of algorithmic improvement is equivalent to a 10-fold increase in computing power. Taken together, our results make clear that, in many areas of deep learning, progress in training models has depended on large increases in the amount of computing power used, “they wrote. “Another possibility is that obtaining an algorithmic improvement may require complementary increases in computing power.”

In the course of their research, the researchers also extrapolated the projections to understand the computational power required to reach various theoretical benchmarks, along with the associated economic and environmental costs. By even the most optimistic calculation, reducing the image classification error rate on ImageNet would require 105 5 more computing

MIT AI efficiency study

Above: Extrapolated projections from researchers.

To its point, a Synced report estimated that the University of Washington Grover fake news detection model cost $ 25,000 to train in approximately two weeks. OpenAI reportedly racked up a whopping $ 12 million to train its GPT-3 language model, and Google spent an estimated $ 6,912 training BERT, a two-way transformer model that redefined the state-of-the-art for 11 processing tasks. natural language.

In a separate report last June, researchers at the University of Massachusetts at Amherst concluded that the amount of energy required to train and search for a certain model implies the emissions of approximately 626,000 pounds of carbon dioxide. That’s equivalent to almost five times the lifetime emissions of the average American car.

“We do not anticipate that the computational requirements implicit in the targets … The hardware, environmental and monetary costs would be prohibitive,” the researchers wrote. “Achieving this economically will require more efficient hardware, more efficient algorithms, or other enhancements such that the net impact is such a big gain.”

The researchers note that there is a historical precedent for algorithmic-level deep learning enhancements. They point to the emergence of hardware accelerators like Google’s Tensor Processing Units, Field Programmable Gate Arrays (FPGAs), and Application Specific Integrated Circuits (ASICs), as well as attempts to reduce computational complexity through of compression and acceleration techniques of the network. They also cite the search for neural architecture and metalearning, which use optimization to find architectures that retain good performance on a class of problems, as pathways to computationally efficient improvement methods.

In fact, an OpenAI study suggests that the amount of computation required to train an AI model with the same image classification performance on ImageNet has decreased by a factor of 2 every 16 months since 2012. Google’s Transformer architecture exceeded a previous state-of-the-art model, seq2seq, which was also developed by Google, with 61 times less computation three years after the introduction of seq2seq AND AlphaMero from DeepMind, a system that learned from scratch how to master chess games, shogi and Go required eight times less computing to match an improved version of the system’s predecessor, AlphaGoZero, a year later.

“The explosion of computing power used for deep learning models has ended the ‘AI winter’ and set new benchmarks for computer performance on a wide range of tasks. However, the prodigious appetite for deep learning by computing power imposes a limit on how far it can improve performance in its current form, particularly at a time when improvements in hardware performance are slowing down, “the researchers wrote. “The likely impact of these computational limits is to force … machine learning toward techniques that are more computationally efficient than deep learning.”