

GitHub - corsix/amx: Apple AMX Instruction Set
source link: https://github.com/corsix/amx
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Contemporary M1 / M2 machines from Apple have (at least) four different ways for low-level programmers to perform heavy computations:
- Standard ARMv8 SIMD/NEON vector instructions on CPU cores (128 bits wide, issue up to four per cycle on Firestorm)
- Apple's undocumented AMX instructions, issued from CPU, executed on a special accelerator execution unit
- The Neural Engine (called ANE or NPU)
- The GPU (e.g. Metal Compute Shaders)
This repository is all about the 2nd of those: Apple's AMX instructions. Note that these instructions are neither documented nor supported by Apple. As a source of potential great confusion, Apple's AMX instructions are completely distinct from Intel's AMX instructions, though both are intended for issuing matrix multiply operations from a CPU.
The research was done on an Apple M1 Max (2021). Older or newer chips might have different AMX instructions. Some sources report that the M1 contains version 2 of the AMX instructions, which seems plausible (possibly everything using 7-bit writemasks comes from version 1, and everything using 9-bit writemasks is new in version 2).
A good one-image summary of AMX is the following figure from abandoned patent US20180074824A1. Consider a 32x32 grid of compute units, where each unit can perform 16-bit multiply-accumulate, or a 2x2 subgrid of units can perform 32-bit multiply-accumulate, or a 4x4 subgrid can perform 64-bit multiply-accumulate. To feed this grid, there is a pool of X registers each containing 32 16-bit elements (or 16 32-bit elements, or 8 64-bit elements) and a pool of Y registers similarly containing 32 16-bit elements (or 16 32-bit elements, or 8 64-bit elements). A single instruction can perform a full outer product: multiply every element of an X register with every element of a Y register, and accumulate with the Z element in the corresponding position.
A single row of the 32x32 grid can also be used to perform vector operations (rather than matrix operations) between X and YT.
In terms of available data types, the general pattern is:
- IEEE754 f16 or f32 or f64 (same width for all three fused-multiply-add operands)
- IEEE754 f16 multiplicands, accumulating onto f32
- Integer 8-bit or 16-bit multiplicands, accumulating onto 16 or 32 bits (in various signednesses)
This repository provides:
Recommend
-
84
SIMD Everywhere The SIMDe header-only library provides fast, portable implementations of SIMD intrinsics on hardware which doesn't natively support them, such as calling
-
74
README.md ___ ___ ___ ___ ___ ___ ___ ___ ___ /\ \ /\ \ ___ /\__\ /\ \ /\__\ /\__\ /\...
-
115
Precourse - This repo contains the instruction material and assignments for Lambda School's pre-course program.
-
74
README.md
-
39
A one instruction set computer ( OISC ), sometimes called an ultimate reduced instruction set computer ( URISC ), is anabstract machin...
-
5
AES instruction set From Wikipedia, the free encyclopedia Jump to navigation
-
10
README.md ___...
-
9
GitHub Network Instruction & Branch, Tag 2021/11/25...
-
7
CPU AMX 详解 2016 年开始,随着 NV GPU AI 能力的不断加强,隐隐感觉到威胁的 Intel 也不断在面向数据中心的至强系列 CPU 上堆砌计算能力,增加 core count 、提高 frequency 、增强向量协处理器计算...
-
6
RISC-V formal ISA Specification Copyright © Evgeny Ukhanov This is a formal (and executable) specification for the RISC-V ISA (Instruction Set Architecture), written in F# purely functional style....
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK