Large AI models, environmental concerns & (ethical) Occam’s razor
Last week, I had the pleasure to kickstart the conversation during the monthly AI Ethics meetup in Utrecht, the Netherlands, presenting the environmental and societal impact of large (language) models. The topic lead to quite some interesting points and a good discussion, so I wanted to share the topic here as well!
However, I have to start with a small disclaimer. In this post, I will talk about large AI models, and why they are oftentimes bad and irresponsible. However, as an active AI and ML practitioner, fan, and data scientist, I do use a lot of such models in work and personal projects. So I am interested in this topic and would like to actively participate in the transition to better use of AI, but there is a slight conflict of interest there from time to time ;)
It’s the age of large AI models…
For the last few years, it has been the age of large AI models. 2020 and 2021 were particularly filled with larger and larger language models, and the trend certainly seems to continue into 2022.
In 2020, we saw GPT-3 have 175 Billion parameters with our jaws wide open. I still remember reading about it on the release day during my thesis, and I just couldn’t believe it. Within one and a half years from that point, we had Wu Dao 2.0, a model that has 1.75 Trillion parameters. A whole 10x in size. And the craziest thing is, we weren’t even that impressed.
A new paper that was released a month ago introduced a new concept called Switch Transformers, which allows efficient training and serving of models with trillions of parameters.
While researching these huge models, I saw the following headline:
While I think we won’t see it this year, the possibility of such a model is definitely there, and one might ask: what are the prerequisites, and what are the consequences?
To put the growth in perspective, GPT-2, the predecessor of GPT-3, was released in 2019 and it had 1.5 Billion parameters, which was an awful lot back then. By the way, when I say “back then”, I really mean three years ago.
Anyway, GPT-3 that came a year later had more than 100x that amount of parameters!
Another interesting thing is to look at the dataset sizes through that time. We went from 16 Gb of data to 745 Gb of data for, mostly, transformer models within a span of two years:
Of course, with GPT models following a roughly one-year gap between the releases, people are expecting a GPT-4 release somewhere this year/beginning of next year. When reading about it, I stumbled on the following words from the CEO of OpenAI:
GPT-4 will not be any bigger than GPT-3, but it will use more compute resources.
I think the idea is that new GPT models will focus on making specialized use-cases more possible than GPT-3, and with that, I of course refer to the likes of the Codex model that fuels GitHub Co-Pilot, which can write code for you. Presumably, OpenAI wants to go more in that direction, but man, was this a confusing statement to read.
Steps in the right direction…
That being said, there is work being done on creating models with similar performance to GPT-3, but using fewer resources. Steps in the right direction…
… or are they?
Looking at the paper, the thing that jumps out to me is that while the technical implementation of FLAN, the model proposed in the paper, is looking solid, the end result is still a model with 137 Billion parameters. How demanding is such model, exactly?
If we scroll further into the paper, just before the references, an interesting page appears which I definitely did not expect in an ICLR paper, and was very pleasantly surprised: page 10 contains the “Ethical Considerations” and “Environmental Considerations” paragraphs! Albeit small, I have never seen dedicated pieces to these topics in a paper until now, so this is definitely a step in the right direction! However, the contents of the paragraphs, not so much…
The paper reports the following numbers:
The energy cost and carbon footprint for the pretrained models were 451 MWh and 26 tCO2e, respectively.
I was not really familiar with the energy consumption levels in megawatt, nor with the ways to measure carbon dioxide emissions. A quick Google search revealed that the average US household consumes 10.65 megawatt hours a year of electricity. An easy arithmetical exercise tells us that the reported 451 MWh is equal to 451/10.65 ~= 42 US households worth of power.
Looking at the carbon dioxide emissions, I found the following comparison sheet:
First thing that jumps out though is not even related to the CO2: according to Milieucentraal, the average household consumption in the Netherlands is just 2765 kWh, or 2.765 MWh. That is amost a whopping 4 times less than in the Unites States! Makes you think… But I digress.
Looking at the actual emission numbers, the paper reports 26 tCO2e. Let’s take the 2.6 economy flights Amsterdam — Rome as an example. 2.6 x 26 tones is around 68 flights, that is a lot of flights!
The numbers don't lie: the training process is costly, both in financial and environmental terms. But is this necessarily a bad thing?
If you are an (scientific) optimist and believe that the results of such papers as the one we are discussing here will actually be implemented widely, then perhaps the abovementioned numbers are a reasonable price to pay for the progress. However, in reality I fear that Big Tech oftentimes prints AI papers for the sake of printing AI papers, and not necessarily for the benign reasons of scientific progress or betterment of the world. Let us at least hope that the financial considerations of having smaller models is a good reason on itself.
BigScience Large Language Model Training
Another example of a project in which the NLP enthusiast in me clashes with my ethical and environmentally-conscious self, is the HuggingFace BigScience project.
On itself, the project’s idea is extremely interesting and beneficial for the NLP community: the possibility of having an open-source, multilingual, large language model is one we could only wish for as AI community. Moreover, the open scientific collaborations behind this year-long enterprise is very nice to see.
However, I remember visiting the HuggingFace model repository a few month back and seeing this banner at the top of the webpage that said something along the following lines:
Training completion: 7%. Time to completion: 3 months.
I’m sorry, 3 months?! From the model card:
The training of BigScience’s main model started on March 11, 2022 11:42am PST and will continue for 3–4 months on 384 A100 80GB GPUs of the Jean Zay public supercomputer
Admittedly, in the model card the authors do mention environmental considerations, and state that the energy used is mostly a lower carbon emission variant and that they reuse the heat, but, looking at 384 80GB GPU’s running 24/7 for 3–4 months, I don’t even want to know how much energy is used for the process (I actually do want to know).
Data, representativity, and values
Looking beyond the environmental concerns and into the domain of ethical considerations, I collected a few snippets of text from different sources that act as good starting point for thinking.
More and more papers talk about the inherent bias big language models have, which of course is related to the data the models aretrained on. The question is: does collecting *a lot* of data help?
As it turns out : not really. The following snippet comes from the very well written paper by Emily Bender and Timnit Gebru, among others:
Based on the above, saying “Scrape more of the internet”, just simply doesn’t fix it, as long as the sources stay unrepresentative.
Alright, I get it, large models are bad
…not necessarily. It depends on what we want to achieve. One of the largest reasons to train such excessively large AI models in the recent years has been the ongoing pursuit of creating AI that acts, and reasons as a human. In the field, this “ideal” end-goal for AI is called Artificial General Intelligence, or AGI.
When looking at the relation between the large models and AGI, one of the questions is: In the pursuit of AGI, do we need these large models or can we approximate the AGI “feeling” with simpler models?
The thing is, for most use-cases that our *current* society has for AI, we do not necessarily need such big models. In most cases, the practical use case of using machine learning in 2022 is to either gain insights from business data, or to optimise some kind of business process. For that, one mostly does not need a model trained on the whole contents of Wikipedia. Granted, for use-cases when natural language is involved, we can’t really go back: there is no way we are going away from transformer models in the coming years, and they do require fairly large amounts of data to become good. But even then, a pre-trained BERT model from HuggingFace Hub will probably do wonders, if properly used and, potentially, adapted to the use-case. We don’t really need to have an instance of GPT-3 running at every company that uses AI in their operations. And frankly, we probably can’t because of the compute requirements.
Or, the other end of the line: Are we so far away from reaching AGI that these large, energy-slurping and ethically dubious models, are an expensive underkill?
They say that an image is better than a thousand words:
Occam’s razor is a problem-solving principle which says: of all competing and valid explanations to a phenomena, choose the one with the least assumptions or variables. In case of machine learning, that would mean choosing a model with the least complexity in terms of parameters, architecture or underlying assumptions.
There are multiple ways to apply Occam’s razor in this scenario. A practical one could be using linear or logistic regression above neural nets, as many problems are probably linear and the solutions can be achieved by using simpler, much more explainable algorithms. While I think its a great idea to do that, and definitely one of the key steps, this is not the topic I would like to elaborate on within this post.
Neither I am focussing on more sophisticated approaches such as model pruning or other model complexity reduction, which could be a great topic for another post.
What I would like to convey in this post is more of a broader concept. Should we not accept that, given the current achievements within AI, focus should lie on responsibly applying the state-of-the-art, and efficiently making use of the already very impressive results we achieved?
Or should we continue developing even larger and larger models in pursuit of “better” intelligence, and maybe even AGI, which, for 90% of the end uses, is not a necessity?
In this post, I discussed large language models and why they are potentially ethically dubious, and environmentally hazardous. In the last few years, we have seen ever-increasing AI model sizes, with ever-increasing-but-unrepresentative data behind them.
All that while the value AI brings now, in 2022, is still mostly in the form of generating very business specific data insights, or automating or enhancing some business logic.
Applying the scientific principle of Occam’s razor on to the field of large AI models, we could end up thinking about the trade-off between the scientific curiosity of creating larger and larger models and the actual use of such models, while keeping the ethical and environmental implications in check.