Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and Performance Divergence (2024)

The Cortex-A78Micro-architecture: PPA Focused

The new Cortex-A78 had been on Arm’s roadmaps for a few years now, and we have been expecting the design to represent the smallest generational microarchitectural jump in Arm’s new Austin family. As the third iteration of Arm's Austin core designs, A78 follows the sizable 25-30% IPC improvements that Arm delivered on the Cortex-A76 and A77, which is to say that Arm has already picked a lot of the low-hanging fruit in refining their Austin core.

As the new A78 now finds itself part of a sibling pairing along side the higher performance X1 CPU, we naturally see the biggest focus of this particular microarchitecture being on improving the PPA of the design. Arm’s goals were reasonable performance improvements, balanced with reduced power usage and maintaining or reducing the area of the core.

It’s still an Arm v8.2 CPU, sharing ISA compatibility with the Cortex-A55 CPU for which it is meant to be paired with in a DynamIQ cluster. We see similar scaling possibilities here, with up to 4 cores per DSU, with an L3 cache scaling up to 4MB in Arm’s projected average target designs.

Microarchitectural improvements of the core are found throughout the design. On the front-end, the biggest change has been on the part of the branch predictor, which now is able to process up to two taken branches per cycle. Last year, the Cortex-A77 had introduced as secondary branch execution unit in the back-end, however the actual branch unit on the front-end still only resolved a single branch per cycle.

The A78 is now able to concurrently resolve two predictions per cycle, vastly increasing the throughput on this part of the core and better able to recover from branch mispredictions and resulting pipeline bubbles further downstream in the core. Arm claims their microarchitecture is very branch prediction driven so the improvements here add a lot to the generational improvements of the core. Naturally, the branch predictors themselves have also been improved in terms of their accuracy, which is an ongoing effort with every new generation.

Arm focused on a slew of different aspects of the front-end to improve power efficiency. On the part of the L1I cache, we're now seeing the company offer a 32KB implementation option for vendors, allowing customers to reduce area of the core for a small hit on performance, but with gains in efficiency. Other changes were done to some structures of the branch predictors, where the company downsized some of the low return-on-investment blocks which had a larger cost on area and power, but didn’t have an as largeimpact on performance.

The Mop cache on the Cortex-A78 remained the same as on the A77, housing up to 1500 already decoded macro-ops. The bandwidth from the front-end to the mid-core is the same as on the A77, with an up to 4-wide instruction decoder and fetching up to 6 instructions from the macro-op cache to the rename stage, bypassing the decoder.

In the mid-core and execution pipelines, most of the work was done in regards to improving the area and power efficiency of the design. We’re now seeing more cases of instruction fusions taking place, which helps not only performance of the core, but also power efficiency as it now uses up less resources for the same amount of work, using less energy.

The issue queues have also seen quite larger changes in their designs. Arm explains that in any OOO-core these are quite power-hungry features, and the designers have made some good power efficiency improvements in these structures, although not detailing any specifics of the changes.

Register renaming structures and register files have also been optimized for efficiency, sometimes seeing a reduction of their sizes. The register files in particular have seen a redesign in the density of the entries they’re able to house, packing in more data in the same amount of space, enabling the designers to reduce the structures’ overall size without reducing their capabilities or performance.

On the re-order-buffer side, although the capacity remains the same at 160 entries, the new A78 improves power efficiency and the density of instructions that can be packed into the buffer, increasing the instructions per unit area of the structure.

Arm has also fine-tuned the out-of-order window size of the A78, actually reducing it in comparison to the A77. The explanation here is that larger window sizes generally do not deliver a good return on investment when scaling up in size, and the goal of the A78 is to maximize efficiency. It’s to be noted that the OOO-window here not solely refers to the ROB which has remained the same size, Arm here employs different buffers, queues, and structures which enable OOO operation, and it’s likely in these blocks where we’re seeing a reduction in capacity.

On the diagram, here we see Arm slightly changing its descriptions on the dispatch stage, disclosing a dispatch bandwidth of 6 macro-ops (Mops) per cycle, whereas last year the company had described the A77 as dispatching 10 µops. The apples-to-apples comparison here is that the new A78 increases the dispatch bandwidth to 12 µops per cycle on the dispatch end, allowing for a wider execution core which houses some new capabilities.

On the integer execution side, the only big addition has been the upgrade of one of the ALUs to a more complex pipeline which now also handles multiplications, essentially doubling the integer MUL bandwidth of the core.

The rest of the execution units have seen very little to no changes this generation, and are pretty much in line with what we’ve already seen in the Cortex-A77. It’s only next year where we expect to see a bigger change in the execution units of Arm’s cores.

On the back-end of the core and the memory subsystem, we actually find some larger changes for performance improvements. The first big change is the addition of a new load AGU which complements the two-existing load/store AGUs. This doesn’t change the store operations executed per cycle, but gives the core a 50% increase in load bandwidth.

The interface bandwidth from the LD/ST queues to the L1D cache has been doubled from 16 bytes per cycle to 32 bytes per cycle, and the core’s interfaces to the L2 has also been doubled up in terms of both its read and write bandwidth.

Arm seemingly already has some of the most advanced prefetchers in the industry, and here they claim the A78 further improves the designs both in terms of their memory area coverage, accuracy and timeliness. Timeliness here refers to their quick latching on onto emerging patterns and bringing in the data into the lower caches as fast as possible. You also don’t watch the prefetchers to kick in too early or too late, such as needlessly prefetching data that’s not going to be used for some time.

Much like the L1I cache, the A78 now also offers an 32KB L1D option that gives vendors the choice to configure a smaller core setup. The L2 TLB has also been reduced from 1280 to 1024 pages – this essentially improves the power efficiency of the structure whilst still retaining enough entries to allow for complete coverage of a 4MB L3 cache, still minimizing access latency in that regard.

Overall, the Cortex-A78’s microarchitectural disclosures might sound surprising if the core were to be presented in a vacuum, as we’re seeing quite a lot of mentions of reduced structure sizes and overall compromises being made in order to maximize energy efficiency. Naturally this makes sense given that the Cortex-X1 focuses on performance…

Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and Performance Divergence (2024)

FAQs

Which is better ARM Cortex-A78 or ARM Cortex X1? ›

The Cortex-X1 design is based on the ARM Cortex-A78, but redesigned for purely performance instead of a balance of performance, power, and area (PPA). The Cortex-X1 is a 5-wide decode out-of-order superscalar design with a 3K macro-OP (MOPs) cache.

Is ARM Cortex-A78 good? ›

The processor is built on a standard Cortex-A roadmap and offers a 2.1 GHz (5 nm) chipset which makes it better than its predecessor in the following ways: 7% better performance. 4% lower power consumption. 5% smaller, meaning 15% more area serving for a quad-core cluster, extra GPU, NPU.

How efficient is the Cortex-A78? ›

LITTLE configuration, Cortex-A78 extends the performance and efficiency of premium smartphones to multiple form factors. Consumers demand premium level performance from devices, Cortex-A78 provides up to a 20% sustained performance increase, along with 50% energy savings when compared to earlier generation devices.

What is the difference between cortex X1 and A77? ›

Cortex-X1 is the most powerful Cortex CPU to date, bringing 30 percent peak performance improvements in the next generation over the current Arm Cortex-A77 CPU. It is designed to bring ultimate performance for next-generation custom solutions.

Is ARM Cortex-A7 good? ›

The Arm Cortex-A7 processor is the most efficient Armv7-A processor The Cortex-A7 processor provides up to 20% more single thread performance than the Cortex-A5.

What is the most efficient ARM processor? ›

The Arm Cortex-M0+ processor is the most energy-efficient Arm processor available for constrained embedded applications.

Is ARM processor good or bad? ›

ARM processors tend to be less powerful than traditional CPUs, but they also require less power to run. Many companies choose to work with ARM-based processors in order to create lightweight devices with long battery life that offer solid, well-balanced performance.

Why are ARM processors cheaper? ›

Because ARM is RISC based, the architecture requires fewer transistors which helps to improve cost, power consumption, and produces lower heat.

Why are ARM processors so popular? ›

Due to their low costs, low power consumption, and low heat generation, ARM processors are useful for light, portable, battery-powered devices, including smartphones, laptops, and tablet computers, as well as embedded systems.

What is the best arm cortex? ›

The Cortex-A15 is the highest performance member of this series, providing (in a mobile configuration) twice the performance you would get from a Cortex-A9.

Who manufactures Cortex A7? ›

The ARM Cortex-A7 MPCore is a 32-bit microprocessor core licensed by ARM Holdings implementing the ARMv7-A architecture announced in 2011.

What is the difference between ARM A78 and A77? ›

Cortex-A78 transforms next-generation user experiences on smartphones through double digit improvements for sustained performance. It provides a 20 percent sustained performance improvement over Arm Cortex-A77 CPU in the same mobile thermal power envelope¹.

What is the difference between ARM Cortex A8 and ARM Cortex-A7? ›

A single Cortex-A7 processor is five times more energy efficient than a Cortex-A8 processor, with a 50% performance improvement, compared with a fifth of the size of the latter.

What is the Cortex-A77 architecture? ›

The Cortex-A77 is a 4-wide decode out-of-order superscalar design with a new 1.5K macro-OP (MOPs) cache. It can fetch 4 instructions and 6 Mops per cycle. And rename and dispatch 6 Mops, and 13 μops per cycle. The out-of-order window size has been increased to 160 entries.

What is the fastest arm cortex processor? ›

Cortex-M85 is Arm's fastest core for standalone microcontrollers and MCU-like subsystems. Its integer and floating-point performance eclipses that of Cortex-M7, and it adds the Helium vector processing extensions, which are compatible with Cortex-M55 but faster. The M85 delivers 20% more AI throughput than the M55.

Which cortex is best? ›

Cortex-R4 is the best example for the automotive applications with a clock frequency up to 600MHz, has an 8stage pipeline with dual-issue and low latency interrupt system that can interrupt multi-cycle operations to serve the incoming interrupt.

Which is the fastest arm cortex? ›

The Cortex-X4 is Arm's highest-performing core to date, featuring an anticipated core clock speed of 3.4 GHz and an increased L2 cache per core, doubling its capacity to 2 MB compared to last year's 1 MB Cortex-X3 .

What is the difference between ARM Cortex-A8 and ARM Cortex A7? ›

A single Cortex-A7 processor is five times more energy efficient than a Cortex-A8 processor, with a 50% performance improvement, compared with a fifth of the size of the latter.

References

Top Articles
Latest Posts
Article information

Author: Van Hayes

Last Updated:

Views: 5905

Rating: 4.6 / 5 (46 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Van Hayes

Birthday: 1994-06-07

Address: 2004 Kling Rapid, New Destiny, MT 64658-2367

Phone: +512425013758

Job: National Farming Director

Hobby: Reading, Polo, Genealogy, amateur radio, Scouting, Stand-up comedy, Cryptography

Introduction: My name is Van Hayes, I am a thankful, friendly, smiling, calm, powerful, fine, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.