As a newcomer to the Machine Translation community, I observe that there is a strong focus on competition. Research groups all strive to do better than one another, through a combination of computing power and cleverness. Evaluation of “better” is done with one or two standard measures that are intended to capture some notion of translation quality. As our computing facilities struggle to support
the energy and cooling demands that this brings, it seems that it would be useful to adopt a practice of considering measures of resource usage alongside translation quality.
An adult human, on average, uses energy at a rate of about 100W. The brain of an average human accounts for about 20% of that energy budget. This wetware allows us to learn not only how to navigate our environment but how to think and communicate. It is enough to acquire language, often several of them, and even to produce gold-standard reference translations to evaluate the systems that we are developing to reproduce these abilities by emulating them in silico.
Over a span of 80 years, the human brain will use 14MWh of energy. According to Ofgem, the UK Energy Regulator, a typical but conspicuously large and inefficient household heated with electricity will consume about half of that, 7.1MWh of energy in the course of one year. Recently, to show that a typical neural machine translation system trained using data produced by the Paracrawl project yields a small but significant improvement in quality, we consumed that much energy (7MWh) in three days translating 500 million German sentences into English.
Humans are not good bulk data processors. Conservatively, it would take half a million people working full time for the better part of three weeks (100MWh) to pull off such a gargantuan feat of translating 500 million sentences. That is an order of magnitude less efficient than modern computers, and that's without counting the rest of those half-million people's energy budget. Maybe it's surprising that the difference is only one order of magnitude.
What humans lack in bulk throughput, they make up for in quality. By the standard way of measuring (BLEU score), comparing translations made by humans to reference translations can be expected to get a score of about 70. This measure is a percentage and we don't expect humans to achieve a perfect score because, in translation, choices are inherent. The state-of-the-art machine translation systems will achieve a score of 35-55 for a language like German where there is plenty of training data.
The thing is, this measure, the BLEU score is not linear. The difference between a score of 0 (complete, unintelligible or unrelated rubbish) and 35 (a decent machine translation system) is greater than the difference between 35 and 70 (a good human translation). The rule of thumb that an improvement of one or two BLEU points is a publishable result starts to become less plausible as systems get better and approach parity with human translators. As the systems become better, we enter a regime of diminishing returns.
The important question, especially in the context of an incipient climate catastrophe, is when the energy cost of our computations starts to be greater than the value of gaining one or two more BLEU points. As systems become better and those points themselves become less valuable, it becomes more important to critically evaluate this trade-off. This is why we should adopt a practice of explicitly mentioning the energy use of our experiments alongside the standard quality evaluation mechanisms.
Perhaps this practice will encourage more research into efficient ways of performing these expensive computations for training and running neural machine translation systems. Perhaps performance on quality metrics can sensibly be traded for energy savings -- sacrificing 10% of quality for a 10x increase in efficiency may well be worth it.