ARM processor

ARM processor

ARM architectures
The ARM logo
Designer ARM Holdings
Bits 32-bit or 64-bit
Introduced 1985
Design RISC
Type Register-Register
Branching Condition code
Open Proprietary
64/32-bit architecture
Introduced 2011
Version ARMv8-A
Encoding AArch64/A64 and AArch32/A32 use 32-bit instructions, T32 (Thumb2) uses mixed 16- and 32-bit instructions. ARMv7 user-space compatibility[1]
Endianness Bi (Little as default)
Extensions All mandatory: Thumb-2, NEON, Jazelle, VFPv4-D16, VFPv4
Registers
General purpose 31x 64-bit integer registers[1] plus PC and SP / ELR / SPSR for exception levels
Floating point 32× 128-bit registers,[1] scalar 32 and 64-bit FP, SIMD 64 and 128-bit FP and integer
32-bit architectures (Cortex)
Version ARMv8-R, ARMv7-A, ARMv7-R, ARMv7E-M, ARMv7-M, ARMv6-M
Encoding 32-bit except Thumb2 extensions use mixed 16- and 32-bit instructions.
Endianness Bi (Little as default)
Extensions Thumb-2 (mandatory since ARMv7), NEON, Jazelle, FPv4-SP
Registers
General purpose 16x 32-bit integer registers including PC and SP
Floating point Up to 32× 64-bit registers,[2] SIMD/floating-point (optional)
32-bit architectures (legacy)
Version ARMv6, ARMv5, ARMv4T
Encoding 32-bit except Thumb extension uses mixed 16- and 32-bit instructions.
Endianness Bi (Little as default)
Extensions Thumb, Jazelle
Registers
General purpose 16x 32-bit integer registers including PC and SP

ARM is a family of instruction set architectures for computer processors based on a reduced instruction set computing (RISC) architecture developed by British company ARM Holdings. Using a RISC-based approach to computer design, ARM processors require significantly fewer transistors than processors that would typically be found in a traditional computer. The benefits of this approach are reduced costs, heat and power usage compared to more complex chip designs, traits which are desirable for light, portable, battery-powered devices (which in recent years, have included smartphones, laptops, tablet, and notepad computers), and other embedded systems. Alternatively the use of a simpler design allows more efficient multi-core CPUs and higher core counts at lower cost, allowing higher levels of processing power and improved energy efficiency for servers and supercomputers.[3][4][5]

Although it develops the instruction set architectures for them, ARM Holdings does not manufacture ARM-based products itself; the company licenses chip designs and the ARM instruction set architectures to third-parties, allowing them to design their own products implementing one of those architectures, including systems-on-chips (SoC) incorporating memory, interfaces, radios, etc., produced by companies such as Apple, Nvidia, Qualcomm, Samsung Electronics, and Texas Instruments. ARM periodically releases updates to its cores — currently the widely used Cortex cores, older "Classic" cores, and specialized SecurCore cores Variants are available for each of these to include or exclude optional capabilities. ARM's current cores have a 32-bit address space and 32-bit arithmetic, with 32-bit-wide instructions, but accommodate 16-bit-wide instructions for economy and can also handle Java bytecodes which use 32-bit addresses. The recently introduced ARMv8-A architecture add support for a 64-bit address space and 64-bit arithmetic; it was first implemented in the Apple A7 in the iPhone 5S.

In 2005, about 98% of all mobile phones sold used at least one ARM processor.[6] Due to low power consumption the ARM architecture got very popular and as of 2013 it is the most widely used architecture in mobile devices and most popular 32-bit one in embedded systems.[7] According to ARM Holdings, in 2010 alone, producers of chips based on ARM architectures reported shipments of 6.1 billion ARM-based processors, representing 95% of smartphones, 35% of digital televisions and set-top boxes and 10% of mobile computers. It is the most widely used 32-bit instruction set architecture in terms of quantity produced.[8][9]

History


First developed in the 1980's by the British computer manufacturer Acorn Computers for use in its personal computers, the first ARM-based products were the co-processor modules for the BBC Micro series of computers. After achieving success with the BBC Micro computer, Acorn Computers considered how to move on from the relatively simple MOS Technology 6502 processor to address business markets like the one that would soon be dominated by the IBM PC, launched in 1981. The Acorn Business Computer (ABC) plan required a number of second processors to be made to work with the BBC Micro platform, but processors such as the Motorola 68000 and National Semiconductor 32016 were considered to be unsuitable, and the 6502 was not powerful enough for a graphics based user interface.[10]

After testing all of the available processors and finding them lacking, Acorn decided that it needed a new architecture. Inspired by white papers on the Berkeley RISC project, Acorn considered designing its own processor.[11] A visit to the Western Design Center in Phoenix, where the 6502 was being updated by what was effectively a single-person company, showed Acorn engineers Steve Furber and Sophie Wilson they did not need massive resources and state-of-the-art research and development facilities.[12]

Wilson developed the instruction set, writing a simulation of the processor in BBC Basic that ran on a BBC Micro with a second 6502 processor. This convinced the Acorn engineers that they were on the right track. Wilson approached Acorn's CEO, Hermann Hauser, and requested more resources. Once approval was given, a small team was assembled to implement Wilson's model in hardware.

Acorn RISC Machine: ARM2

The official Acorn RISC Machine project started in October 1983. VLSI Technology was chosen as the "silicon partner", as they were a source of ROMs and custom chips for Acorn. The design was led by Wilson and Furber, and was consciously implemented with a similar efficiency ethos as the 6502.[13] A key design goal was achieving low-latency input/output (interrupt) handling like the 6502. The 6502's memory access architecture had allowed developers to produce fast machines without using costly direct memory access hardware. VLSI produced the first ARM silicon on 26 April 1985—it worked the first time, and was known as ARM1 by April 1985.[3] The first production systems named ARM2 were available the following year.

The first practical application of the ARM was as a second processor for the BBC Micro, where it saw use developing the simulation software to finish development of the support chips (VIDC, IOC, MEMC), and to speed up the CAD software used in ARM2 development. Wilson subsequently rewrote BBC Basic in ARM assembly language, and the in-depth knowledge obtained from designing the instruction set enabled the code to be very dense, making ARM BBC Basic an extremely good test for any ARM emulator. The original aim of a principally ARM-based computer was achieved in 1987 with the release of the Acorn Archimedes.[14]

In 1992, Acorn once more won the Queen's Award for Technology for the ARM.

The ARM2 featured a 32-bit data bus, 26-bit address space and 27 32-bit registers. 8 bits from the program counter register were available for other purposes; the top 6 bits (available because of the 26-bit address space), served as status flags, and the bottom 2 bits (available because the program counter was always word-aligned), were used for setting modes. Although the address bus was extended to 32 bits in the ARM6, program code still had to lie within the first 64 megabytes of memory in 26-bit compatibility mode, due to the reserved bits for the status flags.[15] The ARM2 had a transistor count of just 30,000, compared to Motorola's six-year-older 68000 model with 68,000.[16] Much of this simplicity came from the lack of microcode (which represents about one-quarter to one-third of the 68000) and from (like most CPUs of the day) not including any cache. This simplicity enabled low power consumption, yet better performance than the Intel 80286. A successor, ARM3, was produced with a 4 KB cache, which further improved performance.[17]

Apple, DEC, Intel, Marvell: ARM6, StrongARM, XScale

In the late 1980s Apple Computer and VLSI Technology started working with Acorn on newer versions of the ARM core. In 1990, Acorn spun off the design team into a new company named Acorn RISC Machines Ltd., which became ARM Ltd when its parent company, ARM Holdings plc, floated on the London Stock Exchange and NASDAQ in 1998.[18]

The new Apple-ARM work would eventually evolve into the ARM6, first released in early 1992. Apple used the ARM6-based ARM 610 as the basis for their Apple Newton PDA. In 1994, Acorn used the ARM 610 as the main central processing unit (CPU) in their RiscPC computers. DEC licensed the ARM6 architecture and produced the StrongARM. At 233 MHz, this CPU drew only one watt (newer versions draw far less). This work was later passed to Intel as a part of a lawsuit settlement, and Intel took the opportunity to supplement their i960 line with the StrongARM. Intel later developed its own high performance implementation named XScale which it has since sold to Marvell.

Licensing


Core license

The ARM core has remained essentially the same size throughout these changes. ARM2 had 30,000 transistors, the ARM6 grew only to 35,000. ARM's primary business is selling IP cores, which licensees use to create microcontrollers (MCUs) and CPUs based on those cores. The original design manufacturer combines the ARM core with other parts to produce a complete CPU, typically one that can be built in existing semiconductor fabs at low cost and still deliver substantial performance. The most successful implementation has been the ARM7TDMI with hundreds of millions sold. Atmel has been a precursor design center in the ARM7TDMI-based embedded system.

The ARM architectures used in smartphones, PDAs and other mobile devices range from ARMv5, used in low-end devices, through ARMv6, to ARMv7 in current high-end devices. ARMv7 includes a hardware floating-point unit (FPU), with improved speed compared to software-based floating-point.

In 2009, some manufacturers introduced netbooks based on ARM architecture CPUs, in direct competition with netbooks based on Intel Atom.[19] According to analyst firm IHS iSuppli, by 2015, ARM ICs are estimated to be in 23% of all laptops.[20]

ARM Holdings offers a variety of licensing terms, varying in cost and deliverables. ARM Holdings provides to all licensees an integratable hardware description of the ARM core as well as complete software development toolset (compiler, debugger, software development kit) and the right to sell manufactured silicon containing the ARM CPU.

SoC packages integrating ARM's core designs include Nvidia Tegra's first three generations, CSR plc's Quatro family, ST-Ericsson's Nova and NovaThor, Silicon Labs's Precision32 MCU, Texas Instruments's OMAP products, Samsung's Hummingbird and Exynos products, Apple's A4, A5, and A5X, and Freescale's i.MX.

Fabless licensees, who wish to integrate an ARM core into their own chip design, are usually only interested in acquiring a ready-to-manufacture verified IP core. For these customers, ARM delivers a gate netlist description of the chosen ARM core, along with an abstracted simulation model and test programs to aid design integration and verification. More ambitious customers, including integrated device manufacturers (IDM) and foundry operators, choose to acquire the processor IP in synthesizable RTL (Verilog) form. With the synthesizable RTL, the customer has the ability to perform architectural level optimisations and extensions. This allows the designer to achieve exotic design goals not otherwise possible with an unmodified netlist (high clock speed, very low power consumption, instruction set extensions, etc.). While ARM does not grant the licensee the right to resell the ARM architecture itself, licensees may freely sell manufactured product such as chip devices, evaluation boards, complete systems. Merchant foundries can be a special case; not only are they allowed to sell finished silicon containing ARM cores, they generally hold the right to re-manufacture ARM cores for other customers.

ARM prices its IP based on perceived value; lower performing ARM cores typically have lower licence costs than higher performing cores. In implementation terms, a synthesizable core costs more than a hard macro (blackbox) core. Complicating price matters, a merchant foundry which holds an ARM licence, such as Samsung and Fujitsu, can offer reduced licensing costs to its fab customers. In exchange for acquiring the ARM core through the foundry's in-house design services, the customer can reduce or eliminate payment of ARM's upfront licence fee. Compared to dedicated semiconductor foundries (such as TSMC and UMC) without in-house design services, Fujitsu/Samsung charge two- to three-times more per manufactured wafer. For low to mid volume applications, a design service foundry offers lower overall pricing (through subsidisation of the licence fee). For high volume mass-produced parts, the long term cost reduction achievable through lower wafer pricing reduces the impact of ARM's NRE (Non-Recurring Engineering) costs, making the dedicated foundry a better choice.

Architectural licence

Companies can also obtain an ARM architectural licence for designing their own CPU cores using the ARM instruction set.

Cores

Main article: List of ARM cores
Architecture Bit
width
Cores designed by ARM Holdings Cores designed by 3rd parties Cortex profile References
ARMv1
32/26
ARM1
ARMv2
32/26
ARM2, ARM3 Amber
ARMv3
32
ARM6, ARM7
ARMv4
32
ARM8 StrongARM, FA526
ARMv4T
32
ARM7TDMI, ARM9TDMI
ARMv5
32
ARM7EJ, ARM9E, ARM10E XScale, FA626TE, Feroceon, PJ1/Mohawk
ARMv6
32
ARM11
ARMv6-M
32
ARM Cortex-M0, ARM Cortex-M0+, ARM Cortex-M1 Microcontroller
ARMv7-M
32
ARM Cortex-M3
Microcontroller
ARMv7E-M
32
ARM Cortex-M4
Microcontroller
ARMv7-R
32
ARM Cortex-R4, ARM Cortex-R5, ARM Cortex-R7
Real-time
ARMv7-A
32
ARM Cortex-A5, ARM Cortex-A7, ARM Cortex-A8,
ARM Cortex-A9, ARM Cortex-A12, ARM Cortex-A15
Krait, Scorpion, PJ4/Sheeva, Apple A6 / A6X (Swift)
Application
ARMv8-A
64/32
ARM Cortex-A53, ARM Cortex-A57[21] X-Gene, Denver, Apple A7 (Cyclone)
Application
[22][23]
ARMv8-R
32
No announcements yet
Real-time
[24][25]

A list of vendors who implement ARM cores in their design (application specific standard products (ASSP), microprocessor and microcontrollers) is provided by ARM.[26]

Example applications of ARM cores

Main article: List of applications of ARM cores

ARM cores are used in a number of products, particularly PDAs and smartphones. Some computing examples are the Microsoft Surface, Apple's iPad and ASUS Eee Pad Transformer. Others include Apple's iPhone smartphone and iPod portable media player, Canon PowerShot A470 digital camera, Nintendo DS handheld game console and TomTom turn-by-turn navigation system.

In 2005, ARM took part in the development of Manchester University's computer, SpiNNaker, which used ARM cores to simulate the human brain.[27]

ARM chips are also used in Raspberry Pi, BeagleBoard, BeagleBone, PandaBoard and other single-board computers, because they are very small, inexpensive and consume very little power.

32-bit architecture

From 1995, the ARM Architecture Reference Manual has been the primary source of documentation on the ARM processor architecture and instruction set, distinguishing interfaces that all ARM processors are required to support (such as instruction semantics) from implementation details that may vary. The architecture has evolved over time, and starting with the Cortex series of cores, three "profiles" are defined:

  • "Application" profile: Cortex-A series
  • "Real-time" profile: Cortex-R series
  • "Microcontroller" profile: Cortex-M series.

Profiles are allowed to subset the architecture. For example, the ARMv6-M profile (used by the Cortex M0 / M0+ / M1) is a subset of the ARMv7-M profile which supports fewer instructions.

CPU modes

The 32-bit ARM architecture specifies several CPU modes, depending on architecture. At any moment in time, the CPU can be in only one mode, but it can switch modes due to external events (interrupts) or programmatically.[28]

User mode
The only non-privileged mode.
Fast Interrupt mode
A privileged mode that is entered whenever the processor accepts an FIQ interrupt.
Interrupt mode
A privileged mode that is entered whenever the processor accepts an IRQ interrupt.
Supervisor (svc) mode
A privileged mode entered whenever the CPU is reset or when a SWI instruction is executed.
Abort mode
A privileged mode that is entered whenever a prefetch abort or data abort exception occurs.
Undefined mode
A privileged mode that is entered whenever an undefined instruction exception occurs.
System mode (ARMv4 and above)
The only privileged mode that is not entered by an exception. It can only be entered by executing an instruction that explicitly writes to the mode bits of the CPSR.
MON Mode (Security Extensions only)
A monitor mode is introduced to support TrustZone extension in ARM Core.
HYP a.k.a. PL2 Mode (ARMv7)
A virtualization extensions / hypervisor mode in ARM Core that was introduced in latest as-of 2012 ARMv7 architecture.[29]

Instruction set

The original ARM implementation was hardwired without microcode, like the much simpler 8-bit 6502 processor used in prior Acorn microcomputers.

The 32-bit ARM architecture (and the 64-bit architecture for the most part, see below for exceptions) includes the following RISC features:

  • Load/store architecture.
  • No support for unaligned memory accesses in the original version of the architecture. ARMv6 and later, except some microcontroller versions, support unaligned accesses for half-word and single-word load/store instructions with some limitations, such as no guaranteed atomicity.[30][31]
  • Uniform 16× 32-bit register file (including the Program Counter, Stack Pointer and the Link Register).
  • Fixed instruction width of 32 bits to ease decoding and pipelining, at the cost of decreased code density. Later, the Thumb instruction set added 16-bit instructions and increased code density.
  • Mostly single clock-cycle execution.

To compensate for the simpler design, compared with processors like the Intel 80286 and Motorola 68020, some additional design features were used:

  • Conditional execution of most instructions, reducing branch overhead and compensating for the lack of a branch predictor.
  • Arithmetic instructions alter condition codes only when desired.
  • 32-bit barrel shifter which can be used without performance penalty with most arithmetic instructions and address calculations.
  • Powerful indexed addressing modes.
  • A link register for fast leaf function calls.
  • Simple, but fast, 2-priority-level interrupt subsystem with switched register banks.

Arithmetic instructions

The ARM supports add, subtract, and multiply instructions. The integer divide instructions are only implemented by ARM cores based on the following ARM architectures:

  • ARMv7-M and ARMv7E-M architectures always includes divide instructions.[32]
  • ARMv7-R architecture always includes divide instructions in the Thumb instruction set, but optionally in the ARM instruction set.[33]
  • ARMv7-A architecture optionally includes the divide instructions. The instructions might not be implemented, or implemented only in the Thumb instruction set, or implemented in both the Thumb and ARM instructions sets, or implemented if the Virtualization Extensions are included.[33]

Registers

Registers R0 through R7 are the same across all CPU modes; they are never banked.

R13 and R14 are banked across all privileged CPU modes except system mode. That is, each mode that can be entered because of an exception has its own R13 and R14. These registers generally contain the stack pointer and the return address from function calls, respectively.

Registers across CPU modes
usr sys svc abt und irq fiq
R0
R1
R2
R3
R4
R5
R6
R7
R8 R8_fiq
R9 R9_fiq
R10 R10_fiq
R11 R11_fiq
R12 R12_fiq
R13 R13_svc R13_abt R13_und R13_irq R13_fiq
R14 R14_svc R14_abt R14_und R14_irq R14_fiq
R15
CPSR
SPSR_svc SPSR_abt SPSR_und SPSR_irq SPSR_fiq

Aliases:

  • R13 is also referred to as SP, the Stack Pointer.
  • R14 is also referred to as LR, the Link Register.
  • R15 is also referred to as PC, the Program Counter.

CPSR has the following 32 bits.[34]

  • M (bits 0 - 4) is the processor mode bits.
  • T (bit 5) is the Thumb state bit.
  • F (bit 6) is the FIQ disable bit.
  • I (bit 7) is the IRQ disable bit.
  • A (bit 8) is the imprecise data abort disable bit.
  • E (bit 9) is the data endianness bit.
  • IT (bits 10 - 15 and 25 - 26) is the if-then state bits.
  • GE (bits 16 - 19) is the greater-than-or-equal-to bits.
  • DNM (bits 20 - 23) is the do not modify bits.
  • J (bit 24) is the Java state bit.
  • Q (bit 27) is the sticky overflow bit.
  • V (bit 28) is the overflow bit.
  • C (bit 29) is the carry/borrow/extend bit.
  • Z (bit 30) is the zero bit.
  • N (bit 31) is the negative/less than bit.

Conditional execution

Almost every ARM instruction has a conditional execution feature called predication, which is implemented with a 4-bit condition code selector (the predicate). To allow for unconditional execution, one of the four-bit codes causes the instruction to be always executed. Most other CPU architectures only have condition codes on branch instructions.

Though the predicate takes up 4 of the 32 bits in an instruction code, and thus cuts down significantly on the encoding bits available for displacements in memory access instructions, it avoids branch instructions when generating code for small if statements. Apart from eliminating the branch instructions themselves, this preserves the fetch/decode/execute pipeline at the cost of only one cycle per skipped instruction.

The standard example of conditional execution is the subtraction-based Euclidean algorithm:

In the C programming language, the loop is:

    while (i != j)
    {
       if (i > j)
       {
           i -= j;
       }
       else  /* i < j (since i != j in while condition) */
       {
           j -= i;
       }
    }

In ARM assembly, the loop is:

loop:   CMP  Ri, Rj         ; set condition "NE" if (i != j),
                            ;               "GT" if (i > j),
                            ;            or "LT" if (i < j)
        SUBGT  Ri, Ri, Rj   ; if "GT" (Greater Than), i = i-j;
        SUBLT  Rj, Rj, Ri   ; if "LT" (Less Than), j = j-i;
        BNE  loop           ; if "NE" (Not Equal), then loop

which avoids the branches around the then and else clauses. If Ri and Rj are equal then neither of the SUB instructions will be executed, eliminating the need for a conditional branch to implement the while check at the top of the loop, for example had SUBLE (less than or equal) been used.

One of the ways that Thumb code provides a more dense encoding is to remove the four bit selector from non-branch instructions.

Other features

Another feature of the instruction set is the ability to fold shifts and rotates into the "data processing" (arithmetic, logical, and register-register move) instructions, so that, for example, the C statement

a += (j << 2);

could be rendered as a single-word, single-cycle instruction:[35]

ADD  Ra, Ra, Rj, LSL #2

This results in the typical ARM program being denser than expected with fewer memory accesses; thus the pipeline is used more efficiently.

The ARM processor also has features rarely seen in other RISC architectures, such as PC-relative addressing (indeed, on the 32-bit[1] ARM the PC is one of its 16 registers) and pre- and post-increment addressing modes.

The ARM instruction set has increased over time. Some early ARM processors (before ARM7TDMI), for example, have no instruction to store a two-byte quantity.

Pipelines and other implementation issues

The ARM7 and earlier implementations have a three-stage pipeline; the stages being fetch, decode and execute. Higher-performance designs, such as the ARM9, have deeper pipelines: Cortex-A8 has thirteen stages. Additional implementation changes for higher performance include a faster adder and more extensive branch prediction logic. The difference between the ARM7DI and ARM7DMI cores, for example, was an improved multiplier; hence the added "M".

Coprocessors

The ARM architecture provides a non-intrusive way of extending the instruction set using "coprocessors" which can be addressed using MCR, MRC, MRRC, MCRR, and similar instructions. The coprocessor space is divided logically into 16 coprocessors with numbers from 0 to 15, coprocessor 15 (cp15) being reserved for some typical control functions like managing the caches and MMU operation on processors that have one.

In ARM-based machines, peripheral devices are usually attached to the processor by mapping their physical registers into ARM memory space or into the coprocessor space or connecting to another device (a bus) which in turn attaches to the processor. Coprocessor accesses have lower latency so some peripherals, for example an XScale interrupt controller, are designed to be accessible in both ways through memory and through coprocessors.

In other cases, chip designers only integrate hardware using the coprocessor mechanism. For example, an image processing engine might be a small ARM7TDMI core combined with a coprocessor that has specialised operations to support a specific set of HDTV transcoding primitives.

Debugging

All modern ARM processors include hardware debugging facilities, allowing software debuggers to perform operations such as halting, stepping, and breakpointing of code starting from reset. These facilities are built using JTAG support, though some newer cores optionally support ARM's own two-wire "SWD" protocol. In ARM7TDMI cores, the "D" represented JTAG debug support, and the "I" represented presence of an "EmbeddedICE" debug module. For ARM7 and ARM9 core generations, EmbeddedICE over JTAG was a de facto debug standard, although it was not architecturally guaranteed.

The ARMv7 architecture defines basic debug facilities at an architectural level. These include breakpoints, watchpoints and instruction execution in a "Debug Mode"; similar facilities were also available with EmbeddedICE. Both "halt mode" and "monitor" mode debugging are supported. The actual transport mechanism used to access the debug facilities is not architecturally specified, but implementations generally include JTAG support.

There is a separate ARM "CoreSight" debug architecture, which is not architecturally required by ARMv7 processors.

Tools

The ARM architecture is supported by a set of development tools such as Emprog ThunderBench for ARM. Such tools allow development engineers to program the ARM architecture device using a high level language like C.[36]

DSP enhancement instructions

To improve the ARM architecture for digital signal processing and multimedia applications, DSP instructions were added to the set.[37] These are signified by an "E" in the name of the ARMv5TE and ARMv5TEJ architectures. E-variants also imply T,D,M and I.

The new instructions are common in digital signal processor architectures. They include variations on signed multiply–accumulate, saturated add and subtract, and count leading zeros.

SIMD extensions for multimedia

Introduced in ARMv6 architecture.[38]

Jazelle

Main article: Jazelle

Jazelle DBX (Direct Bytecode eXecution) is a technique that allows Java Bytecode to be executed directly in the ARM architecture as a third execution state (and instruction set) alongside the existing ARM and Thumb-mode. Support for this state is signified by the "J" in the ARMv5TEJ architecture, and in ARM9EJ-S and ARM7EJ-S core names. Support for this state is required starting in ARMv6 (except for the ARMv7-M profile), although newer cores only include a trivial implementation that provides no hardware acceleration.

Thumb

To improve compiled code-density, processors since the ARM7TDMI (released in 1994[39]) have featured Thumb instruction set, which have their own state. (The "T" in "TDMI" indicates the Thumb feature.) When in this state, the processor executes the Thumb instruction set, a compact 16-bit encoding for a subset of the ARM instruction set.[40] Most of the Thumb instructions are directly mapped to normal ARM instructions. The space-saving comes from making some of the instruction operands implicit and limiting the number of possibilities compared to the ARM instructions executed in the ARM instruction set state.

In Thumb, the 16-bit opcodes have less functionality. For example, only branches can be conditional, and many opcodes are restricted to accessing only half of all of the CPU's general-purpose registers. The shorter opcodes give improved code density overall, even though some operations require extra instructions. In situations where the memory port or bus width is constrained to less than 32 bits, the shorter Thumb opcodes allow increased performance compared with 32-bit ARM code, as less program code may need to be loaded into the processor over the constrained memory bandwidth.

Embedded hardware, such as the Game Boy Advance, typically have a small amount of RAM accessible with a full 32-bit datapath; the majority is accessed via a 16-bit or narrower secondary datapath. In this situation, it usually makes sense to compile Thumb code and hand-optimise a few of the most CPU-intensive sections using full 32-bit ARM instructions, placing these wider instructions into the 32-bit bus accessible memory.

The first processor with a Thumb instruction decoder was the ARM7TDMI. All ARM9 and later families, including XScale, have included a Thumb instruction decoder.

Thumb-2

Thumb-2 technology was introduced in the ARM1156 core, announced in 2003. Thumb-2 extends the limited 16-bit instruction set of Thumb with additional 32-bit instructions to give the instruction set more breadth, thus producing a variable-length instruction set. A stated aim for Thumb-2 was to achieve code density similar to Thumb with performance similar to the ARM instruction set on 32-bit memory. In ARMv7 this goal can be said to have been met.

Thumb-2 extends both the ARM and Thumb instruction set with bit-field manipulation, table branches and conditional execution. A new "Unified Assembly Language" (UAL) supports generation of either Thumb-2 or ARM instructions from the same source code; versions of Thumb seen on ARMv7 processors are essentially as capable as ARM code (including the ability to write interrupt handlers). This requires a bit of care, and use of a new "IT" (if-then) instruction, which permits up to four successive instructions to execute based on a tested condition. When compiling into ARM code this is ignored, but when compiling into Thumb-2 it generates an actual instruction. For example:

; if (r0 == r1)
CMP r0, r1
ITE EQ        ; ARM: no code ... Thumb: IT instruction
; then r0 = r2;
MOVEQ r0, r2  ; ARM: conditional; Thumb: condition via ITE 'T' (then)
; else r0 = r3;
MOVNE r0, r3  ; ARM: conditional; Thumb: condition via ITE 'E' (else)
; recall that the Thumb MOV instruction has no bits to encode "EQ" or "NE"

All ARMv7 chips support the Thumb-2 instruction set. Other chips in the Cortex and ARM11 series support both "ARM instruction set state" and "Thumb-2 instruction set state".[41][42][43]

Thumb Execution Environment (ThumbEE)

ThumbEE, also termed Thumb-2EE, and marketed as JIT compilers to output smaller compiled code without impacting performance.

New features provided by ThumbEE include automatic null pointer checks on every load and store instruction, an instruction to perform an array bounds check, access to registers r8-r15 (where the Jazelle/DBX Java VM state is held), and special instructions that call a handler.[44] Handlers are small sections of frequently called code, commonly used to implement high level languages, such as allocating memory for a new object. These changes come from repurposing a handful of opcodes, and knowing the core is in the new ThumbEE mode.

On 23 November 2011, ARM deprecated any use of the ThumbEE instruction set.[45]

Floating-point (VFP)

VFP (Vector Floating Point) technology is an FPU coprocessor extension to the ARM architecture. It provides low-cost single-precision and double-precision floating-point computation fully compliant with the ANSI/IEEE Std 754-1985 Standard for Binary Floating-Point Arithmetic. VFP provides floating-point computation suitable for a wide spectrum of applications such as PDAs, smartphones, voice compression and decompression, three-dimensional graphics and digital audio, printers, set-top boxes, and automotive applications. The VFP architecture was intended to support execution of short "vector mode" instructions but these operated on each vector element sequentially and thus did not offer the performance of true single instruction, multiple data (SIMD) vector parallelism. This vector mode was therefore removed shortly after its introduction,[46] to be replaced with the much more powerful NEON Advanced SIMD unit.

Some devices such as the ARM Cortex-A8 have a cut-down VFPLite module instead of a full VFP module, and require roughly ten times more clock cycles per float operation.[47] Other floating-point and/or SIMD coprocessors found in ARM-based processors include FPA, FPE, iwMMXt. They provide some of the same functionality as VFP but are not opcode-compatible with it.

  • VFPv1 is obsolete.
  • VFPv2 is an optional extension to the ARM instruction set in the ARMv5TE, ARMv5TEJ and ARMv6 architectures.
  • VFPv3/VFPv3-D32 is implemented on earlier ARMv7 processors (Cortex-A8 and A9) and is backwards compatible with VFPv2, except that it cannot trap floating point exceptions. VFPv3 has 32 64-bit FPU registers as standard, adds VCVT instructions to convert between scalar, float and double, adds immediate mode to VMOV such that constants can be loaded into FPU registers.
  • VFPv3-D16: as above, but it has only 16 64-bit FPU registers.
  • VFPv3-F16 is uncommon; it supports IEEE754-2008 half-precision (16-bit) floating point.
  • VFPv4/VFPv4-D32 is implemented on later ARMv7 processors (Cortex-A12 and A15). VFPv4 has 32 64-bit FPU registers as standard, adds both half-precision extensions and fused multiply-accumulate instructions to the features of VFPv3.
  • VFPv4-D16: as above, but it has only 16 64-bit FPU registers. Implemented on Cortex-A5 and A7 processors.

Advanced SIMD (NEON)

The Advanced SIMD extension (aka NEON or "MPE" Media Processing Engine) is a combined 64- and 128-bit SIMD instruction set that provides standardized acceleration for media and signal processing applications. NEON is included in all Cortex-A8 devices but is optional in Cortex-A9 devices.[48] NEON can execute MP3 audio decoding on CPUs running at 10 MHz and can run the GSM adaptive multi-rate (AMR) speech codec at no more than 13 MHz. It features a comprehensive instruction set, separate register files and independent execution hardware.[49] NEON supports 8-, 16-, 32- and 64-bit integer and single-precision (32-bit) floating-point data and SIMD operations for handling audio and video processing as well as graphics and gaming processing. In NEON, the SIMD supports up to 16 operations at the same time. The NEON hardware shares the same floating-point registers as used in VFP. Devices such as the ARM Cortex-A8 and Cortex-A9 support 128-bit vectors but will execute with 64 bits at a time,[47] whereas newer Cortex-A15 devices can execute 128 bits at a time.

Security extensions (TrustZone)

The Security Extensions, marketed as TrustZone Technology, is found in ARMv6KZ and later application profile architectures. It provides a low cost alternative to adding an additional dedicated security core to an SoC, by providing two virtual processors backed by hardware based access control. This enables the application core to switch between two states, referred to as worlds (to reduce confusion with other names for capability domains), in order to prevent information from leaking from the more trusted world to the less trusted world. This world switch is generally orthogonal to all other capabilities of the processor, thus each world can operate independently of the other while using the same core. Memory and peripherals are then made aware of the operating world of the core and may use this to provide access control to secrets and code on the device.

Typical applications of TrustZone Technology are to run a rich operating system in the less trusted world, and smaller security-specialized code in the more trusted world (named TrustZone Software, a TrustZone optimised version of the Trusted Foundations Software developed by

In practice, since the specific implementation details of TrustZone are proprietary and have not been publicly disclosed for review, it is unclear what level of assurance is provided for a given threat model.

No-execute page protection

As of ARMv6, the ARM architecture supports no-execute page protection, which is referred to as XN, for eXecute Never.[54]

ARMv8-R

The ARMv8-R subarchitecture announced after the ARMv8-A shares some features except that it is not 32-bit.

64-bit architecture

ARMv8-A

Announced in October 2011,[55] ARMv8-A represents a fundamental change to the ARM architecture. It adds a 64-bit architecture, named "AArch64", and a new "A64" instruction set. AArch64 provides user-space compatibility with ARMv7-A ISA, the 32-bit architecture, therein referred to as "AArch32" and the old 32-bit instruction set, now named "A32". The Thumb instruction sets are referred to as "T32" and have no 64-bit counterpart. ARMv8-A allows 32-bit applications to be executed in a 64-bit OS, and a 32-bit OS to be under the control of a 64-bit hypervisor.[1] ARM announced their Cortex-A53 and Cortex-A57 cores on 30 October 2012.[21]

To both AArch32 and AArch64, ARMv8-A makes VFPv3/v4 and advanced SIMD (NEON) standard. It also adds cryptography instructions supporting AES and SHA-1/SHA-256.

AArch64 features:

  • New instruction set, A64
    • Has 31 general-purpose 64-bit registers.
    • Has separate dedicated SP and PC.
    • Instructions are still 32 bits long and mostly the same as A32 (with LDM/STM instructions and most conditional execution dropped).
      • Has paired loads/stores (in place of LDM/STM).
    • Most instructions can take 32-bit or 64-bit arguments.
    • Addresses assumed to be 64-bit.
  • Advanced SIMD (NEON) enhanced
    • Has 32× 128-bit registers (up from 16), also accessible via VFPv4.
    • Supports double-precision floating point.
    • Fully IEEE 754 compliant.
    • AES encrypt/decrypt and SHA-1/SHA-2 hashing instructions also use these registers.
  • A new exception system
    • Fewer banked registers and modes.
  • Memory translation from 48-bit virtual addresses based on the existing LPAE, which was designed to be easily extended to 64-bit

OS support:

  • Linux – patches adding ARMv8-A support have been posted for review by Catalin Marinas of ARM Ltd. The patches have been included in Linux kernel version 3.7 in late 2012.[56]
  • iOS - iOS 7 on the 64-bit Apple A7 SOC has ARMv8-A application support.

32-bit operating systems


Historical operating systems

The first ARM-based Acorn Archimedes personal computers ran an interim operating system called Arthur, which evolved into RISC OS, used on later ARM-based systems from Acorn and other vendors.

Embedded operating systems

The ARM architecture is supported by a large number of embedded and real-time operating systems, including Linux, Windows CE, Symbian, ChibiOS/RT, FreeRTOS, eCos, Integrity, Nucleus PLUS, MicroC/OS-II, PikeOS,[57] QNX, RTEMS, RTXC Quadros, ThreadX, VxWorks, DRYOS, MQX, T-Kernel, OSE, SCIOPTA and RISC OS.

Mobile device operating systems

The ARM architecture is the primary hardware environment for most mobile device operating systems such as iOS, Android, Windows Phone, Bada, Blackberry OS/Blackberry 10, MeeGo, Firefox OS, Tizen, Ubuntu Touch and Sailfish.

Desktop operating systems

The ARM architecture is supported by Windows RT, RISC OS and multiple Unix-like operating systems including BSD and various Linux distributions such as Ubuntu and Chrome OS.

64-bit operating systems

Mobile device operating systems

The ARMv8-A architecture is used in mobile device operating systems such as iOS (on 64-bit capable ARM processors).

Desktop operating systems

The ARMv8-A architecture is supported by some Linux distributions.

See also

Electronics portal
  • ARM big.LITTLE, ARM's heterogeneous computing architecture
  • ARM Accredited Engineer certification program
  • ARMulator
  • Comparison of current ARM cores
  • Amber (processor core), an open-source ARM-compatible processor core
  • AMULET microprocessor, an asynchronous implementation of the ARM architecture
  • Unicore, a 32-register architecture based heavily on ARM.

References

Further reading

  • Assembly Language Programming : ARM Cortex-M3; 1st Edition; Vincent Mahout; Wiley-ISTE; 256 pages; 2012; ISBN 978-1848213296.
  • The Definitive Guide to the ARM Cortex-M3 and Cortex-M4 Processors; 3rd Edition; Joseph Yiu; Newnes; 600 pages; 2013; ISBN 978-0124080829.
  • The Definitive Guide to the ARM Cortex-M3; 2nd Edition; Joseph Yiu; Newnes; 480 pages; 2009; (Online Sample)
  • The Definitive Guide to the ARM Cortex-M0; 1st Edition; Joseph Yiu; Newnes; 552 pages; 2011; (Online Sample)

External links

  • , ARM Ltd.
  • ARM Virtualization Extensions
Quick Reference Cards
  • Instructions: Thumb (3)
  • Opcodes: Thumb (5.