AI models consume significant energy, making efficiency a key concern for both costs and environmental impact. Training large models like GPT-3 can use as much electricity as 100 U.S. homes in a year, and data centers already account for up to 2% of global energy demand. By 2030, this could rise to 21%. Benchmarking is a practical way to measure and improve energy use in AI systems.
By measuring energy use and applying these strategies, businesses can cut costs, meet sustainability goals, and improve AI performance. Benchmarking and optimization aren't just good practice - they're necessary for scaling AI responsibly.
Getting energy benchmarking right means using the right mix of hardware, software, and controlled environments. Without the proper tools, your results can be wildly inaccurate - sometimes overestimating energy usage by as much as 4.1 times the actual consumption. Precision here isn't optional; it's essential, and it hinges on having specialized hardware and purpose-built software.
Reliable energy benchmarking requires specialized AI hardware. GPUs are the backbone, consuming 50–70% of the total power provisioned in data centers running machine learning tasks. For standardized benchmarking, the NVIDIA H100 GPU with 80GB memory is the go-to option. On the other hand, the RTX 4090 with 24GB memory is ideal for single consumer GPU setups, while the A100 GPU with 40GB memory is frequently used in research environments . Standardizing hardware - like using identical NVIDIA H100 GPUs - removes variability and ensures consistent results across models.
For edge AI applications, processors like Intel's Core Ultra Series with integrated NPUs and iGPUs can handle tasks like single video streams at 30 fps. But when the workload ramps up, such as exceeding 120 fps, discrete GPUs like the NVIDIA RTX 4000 Ada SFF step in.
On the software side, tools are critical for real-time energy tracking. CodeCarbon monitors energy consumption across CPU, GPU, and RAM during inference, while Zeus, an open-source tool from the University of Michigan, specializes in measuring and optimizing energy use for deep learning tasks on both NVIDIA and AMD GPUs . To streamline the benchmarking process, the Optimum Benchmark package integrates with these tools, delivering detailed performance and efficiency metrics.
High-performance memory is another must-have. A minimum of 8GB is required, but 16GB is recommended for larger models. High-bandwidth memory ensures smooth data transfers between processing units.
Beyond hardware and software, consistent benchmarking also depends on standardized datasets and controlled environments.
Controlled environments are key to producing reliable results. Effective benchmarking efforts often use custom datasets that reflect real-world usage by sampling from well-known sources.
For instance, the AI Energy Score method creates datasets with 1,000 data points for each task, pulling from three respected datasets per category. Here are some examples:
Task Category | Standard Datasets Used |
---|---|
Text Generation | WikiText, OSCAR, UltraChat |
Summarization | CNN Daily Mail, SamSUM, ArXiv |
Image Classification | ImageNet ILSVRC, Food 101, Bean Disease Dataset |
Object Detection | COCO 2017, Visual Genome, Plastic in River |
Speech Recognition | LibriSpeech, Common Voice, People's Speech |
To ensure consistency, the AI Energy Score uses FP32 precision for most tasks but switches to FP16 for text generation, allowing better GPU resource management for larger models. A batch size of 1 is standard, and each model is tested ten times on its dataset to ensure statistically reliable results.
Meanwhile, the ML.ENERGY Benchmark takes a different approach. It processes batches of 500 prompts from larger datasets and focuses on production-grade setups. Benchmarks are run on NVIDIA A100 and H100 GPUs, capturing steady-state energy consumption during extended deployments .
Reproducible environments are critical for consistent results. Both major benchmarking initiatives rely on cloud-based standardized instances. For example, the ML.ENERGY Benchmark uses AWS p4d.24xlarge and p5.48xlarge instances, which minimize variability caused by factors like cooling systems, power delivery, and background processes. These environments ensure consistent energy measurements across runs. Additionally, secure containerized setups allow organizations to benchmark proprietary models while safeguarding intellectual property. Validation scripts further ensure proper GPU utilization during benchmarking.
To benchmark the energy use of AI models effectively, you’ll need to follow a structured process. This involves three key phases: preparation, measurement, and reporting. Each phase requires careful attention to ensure your results are accurate and align with industry standards. Using the hardware and dataset guidelines mentioned earlier, here’s how to execute the benchmarking process.
Start by defining the tasks and methodology you’ll use to ensure your results are meaningful.
Define Your AI Tasks. Focus on widely used machine learning tasks that reflect practical applications across various domains. The AI Energy Score project, introduced at the AI Action Summit in Paris in February 2025, provides a good starting point with standard tasks like text generation, image classification, object detection, summarization, speech-to-text, image generation, and image captioning.
Create Representative Datasets. For each task, build a dataset by sampling equally from established sources like WikiText, OSCAR, and UltraChat for text generation. This approach minimizes the risk of training data contamination.
"The goal of AI Energy Score is to establish a standardized approach for evaluating the energy efficiency of AI model inference. By focusing on controlled and comparable metrics, such as specific tasks and hardware, we aim to provide useful insights for researchers, developers, organizations, and policymakers."
Configure Your Models Consistently.
For text generation tasks, group models based on their hardware needs - whether they’re optimized for single consumer GPUs, single cloud GPUs, or multiple cloud GPUs. This classification ensures fair comparisons. Additionally, consider containerized testing to protect proprietary models.
Once your setup is ready, the next step is to measure energy use. This starts with establishing a baseline to isolate the energy consumed by your AI workloads.
Focus on GPU Energy Use. GPUs often account for about half of a server's total energy demand during AI tasks, making it essential to monitor their power draw. Tools like nvidia-smi, integrated with CodeCarbon, can track GPU energy consumption in real time.
Track Energy Across Inference Phases. Modern tools let you measure energy use during different stages, such as preprocessing, prefill, and decoding. Summing these values gives you the total energy consumed. To reduce variability, run each model 10 times and use the average.
For smaller models, CodeCarbon is a reliable tool as it monitors energy use across the CPU, GPU, and RAM. However, ensure your workloads run for at least 5 minutes to avoid errors like "No emissions data recorded." For larger language models, MLCarbon offers more comprehensive tracking, covering the full lifecycle - training, inference, and storage.
Monitor Total System Energy. To estimate total server energy, double the measured GPU energy consumption, as GPUs typically account for roughly half of a server's energy use. Once you’ve collected this data, you can move on to recording and standardizing your results.
Raw energy data needs to be normalized and presented in a clear, comparable format.
Standardize Results in Watt-Hours per 1,000 Queries. This metric allows for fair comparisons across different models, regardless of their absolute power consumption.
Convert Energy to Familiar Units. Express energy in kilowatt-hours (kWh) and calculate carbon emissions using U.S. carbon intensity figures. For example, the average in 2024 was 402.49 grams of CO₂ equivalent per kWh.
Relate kWh Values to Everyday Contexts:
Energy Context | Conversion Factor | Example |
---|---|---|
Microwave Operation | 1 kWh = 1 hour at 1,000W | 0.5 kWh = 30 minutes |
E-bike Range | 1 kWh = 20–40 miles | 0.1 kWh = 2–4 miles |
LED Bulb (10W) | 1 kWh = 100 hours | 0.01 kWh = 1 hour |
Document Your Methodology. Record all key variables, such as hardware specs (e.g., NVIDIA H100 GPUs), model precision settings, quantization configurations, and batching strategies. This ensures others can reproduce your results and understand any limitations.
Implement a Rating System. Alongside raw energy data, use a 1-to-5 star rating system (with 5 stars for the most energy-efficient models) to provide an easy-to-understand comparison. Update these ratings periodically as more efficient models become available.
Save Detailed Logs. Keep comprehensive logs of all benchmarking data. For instance, CodeCarbon saves data to a CSV file (emissions.csv
), and you can integrate outputs with monitoring platforms like Prometheus for deeper analysis. These metrics help compare models directly and guide efforts to improve energy efficiency.
After gathering your energy benchmarking data, the real challenge begins: interpreting those numbers to guide your AI model selection. Raw energy consumption figures alone aren’t enough - they need context. By understanding how different metrics interact, you can uncover the trade-offs between performance and efficiency, leading to smarter model comparisons and better choices.
Energy benchmarking provides a range of metrics that are essential for decision-making. One of the most basic yet crucial measurements is energy per inference, typically expressed in microjoules (μJ) for smaller models or watt-hours for larger systems. For instance, MLPerf Tiny measures energy consumption in μJ per inference, making it ideal for evaluating resource-constrained setups like IoT devices.
For larger language models, the AI Energy Score uses GPU energy consumption, measured in watt-hours per 1,000 queries, as its main metric. This standardized measurement ensures fair comparisons across models, regardless of size or architecture. Additionally, the AI Energy Score employs a 1-to-5 star rating system, where 5 stars indicate the most energy-efficient models and 1 star the least efficient.
Composite metrics go a step further by capturing the balance between efficiency and speed. One such metric is the Energy Delay Product (EDP), calculated by multiplying energy consumption by execution time (EDP = E × T). This metric is particularly valuable for devices where both efficiency and speed are critical, such as battery-powered systems.
"The EDP is a widely recognized metric in the literature and is commonly used in latency-sensitive applications to quantify efficiency." – Pietro Bartoli et al., Politecnico di Milano
Take, for example, tests conducted in May 2025 on the STM32N6 MCU. Switching from High to Low Performance reduced the DSCNN model's energy consumption from 219.0 ± 4.1 μJ to 156.5 ± 3.9 μJ. This adjustment resulted in a 27% improvement in EDP, with only a minor increase in latency. Such insights highlight how composite metrics can uncover optimization opportunities that single metrics might miss.
The relative Energy Delay Product (rEDP) further simplifies this concept by showing EDP percentage changes compared to a reference configuration. This makes it easier to communicate efficiency improvements to stakeholders. Of course, energy efficiency alone isn’t enough - a model must still perform its intended tasks effectively. Metrics like accuracy, throughput, and latency remain critical. For this reason, the AI Energy Score initiative requires models to meet predefined accuracy thresholds before they can earn energy efficiency ratings.
Once you understand these metrics, you can use them in comparison tables to make well-informed model selections.
Comparison tables are a practical way to evaluate models by combining energy metrics with performance indicators. These tables should include both direct energy measurements and relative efficiency ratings alongside key performance metrics like accuracy, precision, recall, and F1-scores.
For instance, a May 2025 evaluation of climate data showed that the Phi-4 model (14.7B parameters) achieved an accuracy of 0.8 - just 7% lower than the top-performing Qwen3-235B-A22B model (235B parameters). However, Phi-4 used 24 times less energy (12.69 Wh vs. 286 Wh) to complete the same task. A table summarizing these findings could look like this:
Model | Parameters | Accuracy | Energy (Wh) | Relative Efficiency | Performance Trade-off |
---|---|---|---|---|---|
Qwen3-235B-A22B | 235B | 0.867 | 286 | Baseline | - |
Phi-4 | 14.7B | 0.8 | 12.69 | 24× more efficient | 7% accuracy reduction |
Tables like these are even more powerful when they include before and after optimization data. For example, quantization and local inference techniques can reduce carbon emissions for large language models by up to 45%. A comparison of Llama 3.2's performance before optimization (0.45 accuracy, 0.012 kg CO₂) versus after optimization (0.48 accuracy, 0.005 kg CO₂) demonstrates how efficiency improvements can also enhance overall performance.
Breaking down energy consumption by stages - such as preprocessing, prefill, and decoding - can further refine your analysis. Modern benchmarking tools make this possible, helping you identify where optimizations will have the greatest impact.
Energy efficiency isn’t just about sustainability - it also directly reduces operational costs, especially in large-scale deployments. This makes it a key consideration for enterprises.
It’s essential to keep these comparison tables up to date. For example, the AI Energy Score leaderboard is recalibrated roughly every six months to reflect technological advancements. What qualifies as a 5-star efficiency rating today might be considered average in the near future.
When selecting models, always test them in environments that closely resemble your actual deployment setup. Factors like hardware configurations, batch sizes, and workload patterns can significantly affect energy consumption. Your tables should account for these real-world variables.
"The AI Energy Score builds on existing initiatives like MLPerf, Zeus, and Ecologits by focusing solely on standardized energy efficiency benchmarking for AI inference. Unlike MLPerf, which prioritizes performance with optional energy metrics, or Zeus and Ecologits, which may be limited by open-source constraints or estimation methods, the AI Energy Score provides a unified framework that evaluates both open-source and proprietary models consistently." – AI Energy Score
For procurement processes, consider including energy transparency requirements in your RFPs and tenders. Ask vendors to provide AI Energy Scores or equivalent energy consumption metrics. This not only promotes transparency but also encourages the adoption of energy-efficient AI practices across the industry.
Once you've benchmarked your AI models, the next step is to optimize them to lower energy consumption while maintaining performance. Techniques like quantization can shrink model sizes by as much as 75–80% with minimal impact on accuracy. This not only reduces operational costs but also helps cut emissions. These strategies build on benchmarking by directly addressing energy demands.
The journey to energy-efficient AI begins with choosing the right model architecture. Larger models with more parameters naturally consume more energy, but bigger doesn’t always mean better for every task.
One promising approach is using sparse models, which focus only on the components needed for a specific task. This can reduce computation requirements by 5 to 10 times. By comparison, dense models process every parameter, regardless of relevance.
Small Language Models (SLMs) are another efficient choice, especially for tasks in resource-constrained environments like edge devices. These models deliver robust performance for targeted tasks while consuming far less power than their larger, general-purpose counterparts.
Mixture of Experts (MoE) architectures take efficiency a step further. These models consist of multiple specialized sub-models, but they only activate the ones relevant to a given task. This selective activation minimizes computational load and energy use while retaining the advantages of specialization.
Matching your model to the task at hand is also critical. For example, using a massive model like GPT-4 (with over 1 trillion parameters) for a simple text classification task is overkill. A smaller, specialized model with just a few million parameters could achieve similar accuracy while using significantly less energy.
After selecting an efficient architecture, you can further lower energy consumption through techniques that simplify models and streamline their operation.
For maximum results, these methods can be combined. For example, pruning followed by quantization can create models that are 4–5 times smaller and 2–3 times faster.
Even well-crafted prompts can improve efficiency by reducing the number of tokens processed. Tools like God of Prompt offer guides and optimized prompts to help streamline operations across various AI platforms.
Optimization Technique | Energy Benefits | Accuracy Impact | Best Use Cases |
---|---|---|---|
Quantization | 75–80% size reduction | <2% accuracy loss | Large-scale inference, edge devices |
Pruning | 30–50% parameter reduction | Minimal with tuning | Overparameterized models |
Knowledge Distillation | 90–95% teacher performance retained | Significant size reduction | Resource-limited environments |
Combined (Quantization + Pruning) | 4–5× smaller, 2–3× faster | Varies by implementation | Production deployments |
Optimization doesn’t stop once a model is deployed. Continuous monitoring is essential to sustain energy efficiency throughout the AI lifecycle.
"Continuous monitoring of energy consumption during the operation of AI systems is essential for sustainable deployment. Utilizing tools that provide real-time energy consumption data can help teams make informed decisions on possible optimizations and adjustment needs. This proactive monitoring aids in maintaining the efficiency of AI applications throughout their life cycle."
- OrhanErgun.net Blog
Tools like CodeCarbon track energy usage for general machine learning models, whether running locally or in the cloud. For large language models, platforms like MLCarbon provide detailed tracking across their entire lifecycle.
This data isn’t just for show - it should guide real improvements. For example, you can adjust model complexity dynamically based on workload and energy data. Some organizations even use carbon-aware AI systems that optimize tasks based on the carbon intensity of the power source.
Regular updates to software and hardware are also key. These updates often include performance improvements that reduce energy use. Staying current with optimization techniques ensures your models remain as efficient as possible.
Google's "4Ms" framework (Model, Machine, Mechanization, and Map) highlights how systematic optimization can slash energy use by up to 100× and CO₂ emissions by up to 1,000× during machine learning training.
"Energy management is an ongoing process. Businesses should continuously monitor consumption data and adapt their strategies as new patterns emerge, ensuring long-term energy efficiency and cost savings."
- Simon Stano and J. Mark Munoz, California Management Review
For high-volume systems, daily monitoring and monthly reviews are recommended. Smaller deployments may need less frequent attention. Keep in mind that inference often accounts for the bulk of AI’s energy consumption due to its repetitive use across millions - or even billions - of users. Regular updates and optimizations are essential to keep energy usage in check.
Benchmarking AI models plays a critical role in ensuring efficient performance while keeping costs and energy use in check. Consider this: global data centers consume about 1–1.3% of the world’s electricity, with energy use growing by 20–40% annually. AI applications alone contribute to 10–20% of that consumption. These numbers highlight the importance of monitoring and optimizing energy usage in AI systems.
By systematically measuring, comparing, and refining AI models, benchmarking provides actionable insights for improvement. Earlier sections covered tools and techniques for tracking energy consumption and outlined strategies - like pruning or quantization - that can cut energy use by as much as 50%.
A major energy drain in AI comes from inference, which can account for over 80% of a model’s total lifecycle energy use. This makes optimization efforts especially impactful, as they reduce energy consumption every time the model is used. For instance, a single GPT-4o query consumes 0.42 Wh - about 40% more energy than a typical Google search at 0.30 Wh. When scaled to millions of users, even small efficiency improvements can lead to significant energy savings.
The industry is already moving toward greater energy transparency. Benchmarking initiatives are updated regularly to drive progress, and energy efficiency is becoming a key factor in procurement decisions. As Dr. Sasha Luccioni of Hugging Face explains:
"The AI Energy Score represents a major milestone for sustainable AI. By creating a transparent rating system, we address a key blocker for reducing the environmental impact of AI. We're thrilled to launch this project and look forward to seeing wider adoption."
To reduce both costs and environmental impact, measure your models’ energy use and apply optimization techniques. Whether it’s pruning unnecessary parameters, adopting quantization, or transitioning to more efficient architectures like sparse models, every step counts.
Continuous monitoring is equally important. Automate tracking systems, set internal benchmarks for sustainability, and make energy efficiency a priority when selecting models. As highlighted earlier, these practices - when combined with regular updates and fine-tuning - can substantially lower energy costs and carbon footprints.
"For organizations using AI/ML technologies, it is crucial to systematically track the carbon footprint of the ML lifecycle and implement best practices in model development and deployment stages." – Lakshmithejaswi Narasannagari, Senior Developer, InfoQ
Striking a balance between performance and sustainability is essential for the future of AI. By embedding energy benchmarking into your workflow now, you’re not just optimizing models - you’re laying the groundwork for responsible AI that scales without depleting our planet’s resources.
Businesses can strike a balance between AI performance and energy efficiency by using optimization techniques designed to save energy. These approaches help cut down on computational demands while still maintaining accuracy. On top of that, AI itself can play a role in streamlining data center operations - making resource allocation smarter and scheduling workloads more efficiently.
AI-driven tools can also make a big difference in areas like cooling systems and energy grid management. By adopting smarter infrastructure and refining operational strategies, businesses can boost performance while cutting down on energy use, reducing their overall environmental footprint.
To evaluate how energy-efficient AI models are, you can rely on specialized tools and frameworks built for this purpose. One example is the AI Energy Score, which provides a standardized way to assess and compare energy consumption across models. Similarly, MLPerf Power focuses on system-level energy efficiency by monitoring power usage during AI tasks. These tools often feature automated benchmarking, consistent metrics, and public leaderboards to ensure reliable and transparent evaluations.
Using these resources can help you better understand the energy demands of your AI models. They also reveal opportunities for improvement, allowing you to refine your models for better performance while reducing their environmental impact.
Regular monitoring and timely updates are crucial to keep AI models performing at their best and ready to handle new challenges. Even after fine-tuning energy efficiency, unexpected shifts in data patterns, system needs, or operating conditions can arise, potentially affecting how well the models function.
By keeping a close eye on performance, you can catch anomalies early, avoid potential breakdowns, and ensure the model stays dependable. Regular updates not only enhance safety but also help cut operational costs and maintain stability - especially important in dynamic and intricate energy systems.