GitHub - corsix/amx: Apple AMX Instruction Set

2 years ago

source link: https://github.com/corsix/amx
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

README.md

Contemporary M1 / M2 machines from Apple have (at least) four different ways for low-level programmers to perform heavy computations:

Standard ARMv8 SIMD/NEON vector instructions on CPU cores (128 bits wide, issue up to four per cycle on Firestorm)
Apple's undocumented AMX instructions, issued from CPU, executed on a special accelerator execution unit
The Neural Engine (called ANE or NPU)
The GPU (e.g. Metal Compute Shaders)

This repository is all about the 2nd of those: Apple's AMX instructions. Note that these instructions are neither documented nor supported by Apple. As a source of potential great confusion, Apple's AMX instructions are completely distinct from Intel's AMX instructions, though both are intended for issuing matrix multiply operations from a CPU.

The research was done on an Apple M1 Max (2021). Older or newer chips might have different AMX instructions. Some sources report that the M1 contains version 2 of the AMX instructions, which seems plausible (possibly everything using 7-bit writemasks comes from version 1, and everything using 9-bit writemasks is new in version 2).

A good one-image summary of AMX is the following figure from abandoned patent US20180074824A1. Consider a 32x32 grid of compute units, where each unit can perform 16-bit multiply-accumulate, or a 2x2 subgrid of units can perform 32-bit multiply-accumulate, or a 4x4 subgrid can perform 64-bit multiply-accumulate. To feed this grid, there is a pool of X registers each containing 32 16-bit elements (or 16 32-bit elements, or 8 64-bit elements) and a pool of Y registers similarly containing 32 16-bit elements (or 16 32-bit elements, or 8 64-bit elements). A single instruction can perform a full outer product: multiply every element of an X register with every element of a Y register, and accumulate with the Z element in the corresponding position.

A single row of the 32x32 grid can also be used to perform vector operations (rather than matrix operations) between X and YT.

In terms of available data types, the general pattern is:

IEEE754 f16 or f32 or f64 (same width for all three fused-multiply-add operands)
IEEE754 f16 multiplicands, accumulating onto f32
Integer 8-bit or 16-bit multiplicands, accumulating onto 16 or 32 bits (in various signednesses)

This repository provides:

Recommend

Github github.com 7 years ago
Cache

GitHub - nemequ/simde: Implementations of SIMD instruction sets for systems whic...

SIMD Everywhere The SIMDe header-only library provides fast, portable implementations of SIMD intrinsics on hardware which doesn't natively support them, such as calling