ENERGY DEMAND.

MARCH 2025.

The past 20 years have been defined by anemic US electricity demand growth. A key driver of this trend has been the harnessing of more energy efficient technologies. The adoption of LED lighting is a standout example. Residential LEDs use at least 75% less energy and last up to 25 times longer than incandescent lighting.¹ Outsourcing manufacturing capacity to emerging markets, most notably China, also led to a significant reduction in domestic energy demand. US manufacturing energy consumption fell 16% from 2002 to 2010, coinciding with peak offshoring.¹ This trend is now in the early stages of a sharp reversal.

Much of the recent focus has been on the increase in electricity demand from the build out of digital infrastructure to support both cloud computing and generative AI. According to McKinsey, the power needs of data centers are expected to grow to about three times higher than the current capacity by the end of the decade, going from between 3-4% of total US power demand today to between 11-12% in 2030.²

In late January a focus on the differentiated approach to Large Language Model (LLM) design inherent in the development of Deepseek’s R1 model drove a sharp reassessment of the power supply necessary to support increasing compute demand. In our view the interpretation of R1 as introducing significant risk to the power demand thesis was misplaced and in fact the shift to reasoning based models and heightened demand for inference implies that compute demand should actually accelerate. In this report we seek to offer our interpretation of the recent shift in progress towards more efficient approaches to AI inference. There is some complexity here but the scale of the opportunity for power demand growth and the investable opportunity this presents, demands, in our view, detailed analysis and review.

AI advancements in regular LLMs have been driven by the “transformer architecture” for computing and the “attention mechanism”. The attention mechanism calculates the relationship between the words (measured in tokens) in the context length. Asking a chatbot model “is the sky blue?” would be 5 tokens (4 words + question mark), while dropping 20 documents into the chatbot could be 1000s-10,000s tokens. Input tokens are efficiently processed in a parallel batch format using transformers for matrix multiplication.

Output tokens (i.e. the answer) are more compute and memory intensive. In our simple example, if the answer was “Yes”, that would be just one token. If we instead had asked “why is the sky blue?”, for a total of 6 tokens, and got a 100 word paragraph as an answer, this would require substantially more compute and memory resources, because the prediction of the output tokens is autoregressive, it cannot be parallelized using transformers. It’s a “step-wise” process starting with applying the attention mechanism on token #1, calculating a new value, updating it, while keeping the entire model plus old token values in memory. Then the process shifts to token #2, #4…#99, #100…repeating the same calculation, updating and storing memory. It’s worth noting here that the memory demand grows exponentially with the output context length. Hence, long context/question inputs can only generate large context/answer outputs if you have access to a lot of compute and memory.

These type of “regular” LLM models improved primarily by what was called “train-time” compute, where three variables were increased at the same time, namely 1) The model got bigger (#of parameters), 2) the datasets got bigger (# of input tokens), and 3) the compute got bigger (# of FLOPs used). These drivers are behind the DC and power buildout for compute, and the quest to use synthetic, audio, image, and video for more training data. The new “reasoning” models, which are no more than 3-5 months old, sought to find an alternative way to advance by “thinking longer” during the inference stage in what is called “test-time” compute. This is illustrated in Figure 1.

The reasoning process is illustrated in Figure 2, where the “regular” LLM could be the chatbot described in our “is the sky blue?” example above. It just provides the one output/answer token “Yes”. However, the reasoning process consists of several steps called “Thought process” 1, 2, etc. As an example, consider that Thought Process 1 has an output context length of 500 words/tokens. These were all calculated using the autoregressive attention mechanism steps of compute, update, save to memory…repeat…for each of the 500 output tokens. However, this is not our answer, but the input tokens to Thought process 2, which we say has an output context length of 1500 words/tokens. It is now clear that for each thought process in the reasoning process the LLM is 1) crunching a lot of numbers, and 2) using a lot of memory. Given the exponential growth in memory demand with context length, this was the key problem that DeepSeek had to solve for.

DeepSeek corrected for the memory issues by inventing the Multi-Head Attention mechanism which significantly reduces the amount of memory required as the context length increased. As evidenced by their data below in Figure 3, the length of the output context in tokens went up (here running between 500 – 10,000) as the number of thought process steps increased. Again, the attention mechanism process consists of 1) calculate, where the matrices are directly proportional to context length), 2) update, 3) story in memory. The DeepSeek claims are around reducing the memory demand, not the number of instances of compute, which will still be required.

To summarize the above, especially as it relates to future compute and power demand for AI, we can go back to figure 1, and think of it as the “total compute/energy spent on training 1 model, fine-tuning 1 model, and running 1 query”. It was always understood that for regular LLMs, the bulk of the energy spent would be in the “train-time” compute, and inference would be very small (again, with short context output). For the reasoning model, we still need the train-time compute to generate the large LLM (and in DeepSeeks case “distill it down” to a smaller model), but we replace some part of the train-time with test-time compute in the inference stage.

As described above, due to the multiple thought process steps and long context outputs, we can conclude that the compute/energy intensity of the reasoning model didn’t necessarily diminish alongside the memory demand. The more interesting thing could actually be that if reasoning models unlock their use in infinitely verifiable use cases, such as Level 4 digital twins and autonomous robots which require inference at “the edge”, the continuous nature and long/changing context lengths of this inference could actually see compute/power demand inflect very sharply.

The analysis and opinions above are based on the limited disclosure from DeepSeek on certain aspects of their training, datasets, and inference compute. However, early tests reported in the MIT Technology review, where they evaluated the DeepSeek model aligns with the conclusions drawn. We will continue to monitor developments in reasoning models but largely stand behind the view that this is bullish for power demand and industrial adoption of physical AI.

More News